Untitled

{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# scikit-learn\n",
"\n",
"### Instalacja\n",
"\n",
"W laboratoriach pakiet `sklearn` powinien być zainstalowany pod Windowsem i Linuxem.\n",
"\n",
"* Pod Linuxem można zainstalować lokalnie:\n",
"        pip install --user sklearn\n",
"* lub na własnym komputerze:\n",
"        sudo pip install sklearn\n",
"\n",
"Konieczne może być ponowne uruchomienie IPython, jeśli był uruchomiony podczas instalacji.\n",
"\n",
"### Dokumentacja\n",
"\n",
"Pełna dokumentacja: http://scikit-learn.org/0.16. Należy zmienić numer wersji zgodnie z zainstalowaną wersją, aby przejść do właściwej dokumentacji.\n",
"\n",
"## 1. Zapoznanie z pakietem\n",
"\n",
"* Zapoznaj się z [dokumentacją](http://scikit-learn.org/0.16) _scikit-learn_. \n",
"* Na podstawie [API](http://scikit-learn.org/0.16/modules/classes.html), podaj listę dostępnych rodzajów klasyfikatorów i regresorów. Które z metod poznaliśmy na wykładach, a które są nowe?\n",
"* Jakie inne zagadnienia (inne niż modele klasyifikacji/regresji) omawiane na wykładach są uwzględniane w pakiecie `sklearn`? Krótko je opisz."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lista Regresorów\n",
"\n",
"An AdaBoost regressor ---\n",
"ensemble.AdaBoostRegressor([base_estimator, ...])\t\n",
"\n",
"A Bagging regressor ---\n",
"ensemble.BaggingRegressor([base_estimator, ...])\n",
"\n",
"An extra-trees regressor. (nowe) ---\n",
"ensemble.ExtraTreesRegressor([n_estimators, ...])\n",
"\n",
"Gradient Boosting for regression. ---\n",
"ensemble.GradientBoostingRegressor([loss, ...])\n",
"\n",
"A random forest regressor. (nowe) ---\n",
"ensemble.RandomForestRegressor([...])\n",
"\n",
"Regression based on k-nearest neighbors. ---\n",
"neighbors.KNeighborsRegressor([n_neighbors, ...])\n",
"\n",
"Regression based on neighbors within a fixed radius. ---\n",
"neighbors.RadiusNeighborsRegressor([radius, ...])\n",
"\n",
"Passive Aggressive Regressor (nowe) ---\n",
"linear_model.PassiveAggressiveRegressor([C, ...])\n",
"\n",
"RANSAC (RANdom SAmple Consensus) algorithm. (nowe) ---\n",
"linear_model.RANSACRegressor([...])\n",
"\n",
"Linear model fitted by minimizing a regularized empirical loss with SGD ---\n",
"linear_model.SGDRegressor([loss, penalty, ...])\n",
"\n",
"Theil-Sen Estimator: robust multivariate regression model. (nowe) ---\n",
"linear_model.TheilSenRegressor([...])\n",
"\n",
"\n",
"### Lista klasyfikatorów\n",
"\n",
"An AdaBoost classifier. ---\n",
"ensemble.AdaBoostClassifier([...])\n",
"\n",
"A Bagging classifier. ---\n",
"ensemble.BaggingClassifier([base_estimator, ...])\n",
"\n",
"An extra-trees classifier. (nowe) ---\n",
"ensemble.ExtraTreesClassifier([...])\n",
"\n",
"Gradient Boosting for classification. ---\n",
"ensemble.GradientBoostingClassifier([loss, ...])\n",
"\n",
"A random forest classifier. (nowe) ---\n",
"ensemble.RandomForestClassifier([...])\n",
"\n",
"Linear classifiers (SVM, logistic regression, a.o.) with SGD training. ---\n",
"linear_model.SGDClassifier([loss, penalty, ...])\n",
"\n",
"Classifier implementing the k-nearest neighbors vote. ---\n",
"neighbors.KNeighborsClassifier([...])\n",
"\n",
"Classifier implementing a vote among neighbors within a given radius ---\n",
"neighbors.RadiusNeighborsClassifier([...])\n",
"\n",
"\n",
"### Znajome zagadnienia\n",
"\n",
"Mini-Batch K-Means clustering ---\n",
"cluster.MiniBatchKMeans([n_clusters, init, ...])\n",
"\n",
"Perceptron ---\n",
"linear_model.Perceptron([penalty, alpha, ...])\n",
"\n",
"Normalize samples individually to unit norm. ----\n",
"preprocessing.Normalizer([norm, copy])\n",
" \n",
"Binding of the cross-validation routine (low-level routine) ---\n",
"svm.libsvm.cross_validation"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## 2. Tutorial\n",
"\n",
"* Zapoznaj się z [wprowadzeniem](http://scikit-learn.org/0.15/tutorial/basic/tutorial.html) do `sklearn`.\n",
"* Opracuj [tutorial \"text analytics\"](http://scikit-learn.org/0.16/tutorial/text_analytics/working_with_text_data.html) w formie notatnika IPython w języku polskim. Opracowanie nie może być wyłącznie tłumaczeniem."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"sckikit-learn jest jedną z najlepszyhc bibliotek służocych do implementacji algorytmów uczenia maszynowego. Poniżej, na przykładzie pracy z tekstem zaprezentuję, jak działa ta biblioteka."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Wczytanie i podstawowe operacje na danych tekstowych\n",
"\n",
"Będziemy korzystach z gotowych danych, które oferuje nam bilbioteka sklearn. Jest to zestaw 20000 dokumentów pogrupowanych w 20 różnych kategoriach.\n",
"\n",
"Na potrzeby szybszych czasów odpalania wybiorę 4 grupy i na ich podstawie wczytam listę artykułów"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']\n",
"from sklearn.datasets import fetch_20newsgroups\n",
"twenty_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sprawdzimy teraz, czy pliki zostały prawidłowo wczytana, na postawie ilości załadowanych plików i przykładowego fragmentu tekstu"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Liczba plików:\n",
"2257\n",
"Przykładowy dokument:\n",
"From: geb@cs.pitt.edu (Gordon Banks)\n",
"Subject: Re: Blindsight\n",
"Reply-To: geb@cs.pitt.edu (Gordon Banks)\n",
"Organization: Univ. of Pittsburgh Computer Science\n",
"Lines: 18\n",
"\n",
"In article <werner-240393161954@tol7mac15.soe.berkeley.edu> werner@soe.berkeley.edu (John Werner) writes:\n",
">In article <19213@pitt.UUCP>, geb@cs.pitt.edu (Gordon Banks) wrote:\n",
">> \n",
">> Explain.  I thought there were 3 types of cones, equivalent to RGB.\n",
">\n",
">You're basically right, but I think there are just 2 types.  One is\n",
">sensitive to red and green, and the other is sensitive to blue and yellow. \n",
">This is why the two most common kinds of color-blindness are red-green and\n",
">blue-yellow.\n",
">\n",
"\n",
"Yes, I remember that now.  Well, in that case, the cones are indeed\n",
"color sensitive, contrary to what the original respondent had claimed.\n",
"-- \n",
"----------------------------------------------------------------------------\n",
"Gordon Banks  N3JXP      | \"Skepticism is the chastity of the intellect, and\n",
"geb@cadre.dsl.pitt.edu   |  it is shameful to surrender it too soon.\" \n",
"----------------------------------------------------------------------------\n",
"\n"
]
}
],
"source": [
"print \"Liczba plików:\"\n",
"print len(twenty_train.data)\n",
"print \"Przykładowy dokument:\"\n",
"print(\"\\n\".join(twenty_train.data[8].split(\"\\n\")))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Aby móc pracować algorytmami uczenia maszynowego potrzebujemy poznać kategorię dokumentów. Możemy to zrobić w następujący sposób:"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[1 1 3 3 3]\n",
"comp.graphics\n",
"comp.graphics\n",
"soc.religion.christian\n",
"soc.religion.christian\n",
"soc.religion.christian\n"
]
}
],
"source": [
"print twenty_train.target[:5]\n",
"for t in twenty_train.target[:5]:\n",
"    print(twenty_train.target_names[t])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Wydobywanie cech z tekstu\n",
"\n",
"Aby zacząć pracować nad tekstami, należy wydobyć z nich cechy.\n",
"Skorzystamy z gotowych funkcji i zbudujemy na podstawie naszych tekstów słownik."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import CountVectorizer\n",
"count_vect = CountVectorizer()\n",
"X_train_counts = count_vect.fit_transform(twenty_train.data)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Możemy teraz sprawdzić ile razy występuje dany wyraz w naszym słowniku (wyraz ten jest naszą cechą)"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Wyraz 'computer' występuje 9338 razy\n"
]
}
],
"source": [
"print \"Wyraz 'computer' występuje {} razy\".format(count_vect.vocabulary_.get(u'computer'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Kiedy potrafimy utworzyć już słownik z gotowych tekstów jesteśmy na dobrej drodze, aby zacząć analizować częstotliwość wystąpienia danych słów. Pamiętajmy również, że dłuże teksty mogą mieć więcej wystąpień danego słowa, a krótsze mniej, lecz nadal mogą dotyczyć tego samego tematu. Częstotliwość wystąpień danego wyrazu w obrębie tekstu będziemy nazywać tf (od ang. \"Term Frequencies\"). Z kolei tf-idf będziemy nazywać mało istotne słowa które często się pojawiają i nie mają dużego wpływu na grupę dokumentu. tf i tf-idf możemy obliczyć w następujący sposób:"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.feature_extraction.text import TfidfTransformer\n",
"tfidf_transformer = TfidfTransformer()\n",
"X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Trenowanie klasyfikatora\n",
"\n",
"Kiedy mamy już nasze cechy (słowa) możemy przystąpić do trenowania klasyfikatora, który postara się przewidzieć kategorię dokumentu\n",
"Zacznijmy od naiwnego klasyfikatora bayesowskiego."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from sklearn.naive_bayes import MultinomialNB\n",
"clf = MultinomialNB().fit(X_train_tfidf, twenty_train.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Spróbujmy teraz podać krótki przykład tekstu i zobaczyć do jakiej kategorii zostanie przypisany"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"In monotheism, God is conceived of as the Supreme Being and principal object of faith. The concept of God as described by most theologians includes the attributes of omniscience (infinite knowledge), omnipotence (unlimited power), omnipresence (present everywhere), divine simplicity, and as having an eternal and necessary existence.\n",
"Powyższy tekst należy do kategorii: soc.religion.christian\n",
"\n",
"A video card (also called a display card, graphics card, display adapter or graphics adapter) is an expansion card which generates a feed of output images to a display (such as a computer monitor). Frequently, these are advertised as discrete or dedicated graphics cards, emphasizing the distinction between these and integrated graphics.\n",
"Powyższy tekst należy do kategorii: comp.graphics\n",
"\n",
"Penicillin (PCN or pen) is a group of antibiotics which include penicillin G (intravenous use), penicillin V (use by mouth), and procaine penicillin, and benzathine penicillin (intramuscular use). Penicillin antibiotics were among the first medications to be effective against many bacterial infections caused by staphylococci and streptococci.\n",
"Powyższy tekst należy do kategorii: sci.med\n",
"\n",
"Atheism is, in the broadest sense, the absence of belief in the existence of deities. Less broadly, atheism is the rejection of belief that any deities exist. In an even narrower sense, atheism is specifically the position that there are no deities. Atheism is contrasted with theism, which, in its most general form, is the belief that at least one deity exists.\n",
"Powyższy tekst należy do kategorii: alt.atheism\n",
"\n"
]
}
],
"source": [
"docs_new = ['In monotheism, God is conceived of as the Supreme Being and principal object of faith. The concept of God as described by most theologians includes the attributes of omniscience (infinite knowledge), omnipotence (unlimited power), omnipresence (present everywhere), divine simplicity, and as having an eternal and necessary existence.', \n",
"            'A video card (also called a display card, graphics card, display adapter or graphics adapter) is an expansion card which generates a feed of output images to a display (such as a computer monitor). Frequently, these are advertised as discrete or dedicated graphics cards, emphasizing the distinction between these and integrated graphics.',\n",
"            'Penicillin (PCN or pen) is a group of antibiotics which include penicillin G (intravenous use), penicillin V (use by mouth), and procaine penicillin, and benzathine penicillin (intramuscular use). Penicillin antibiotics were among the first medications to be effective against many bacterial infections caused by staphylococci and streptococci.',\n",
"            'Atheism is, in the broadest sense, the absence of belief in the existence of deities. Less broadly, atheism is the rejection of belief that any deities exist. In an even narrower sense, atheism is specifically the position that there are no deities. Atheism is contrasted with theism, which, in its most general form, is the belief that at least one deity exists.']\n",
"X_new_counts = count_vect.transform(docs_new)\n",
"X_new_tfidf = tfidf_transformer.transform(X_new_counts)\n",
"\n",
"predicted = clf.predict(X_new_tfidf)\n",
"\n",
"for doc, category in zip(docs_new, predicted):\n",
"    print(doc)\n",
"    print \"Powyższy tekst należy do kategorii: \"+twenty_train.target_names[category]\n",
"    print"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Budowanie przepływu\n",
"Używając przepływu, można zapisać to co analizowaliśmy wsześniej w jednej linijce w następujący sposób:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"from sklearn.pipeline import Pipeline\n",
"text_clf = Pipeline([('vect', CountVectorizer()),\n",
"                    ('tfidf', TfidfTransformer()),\n",
"                    ('clf', MultinomialNB()),\n",
"])\n",
"text_clf = text_clf.fit(twenty_train.data, twenty_train.target)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Sprawdzanie poprawności na zbiorze testowym\n",
"Sprawdzanie poprawności na zbiorze jest bardzo proste i możemy to uzyskać wykonując następujące operacje:"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nasz klasyfikator uzyskał skuteczność na poziomie 83.4886817577%\n"
]
}
],
"source": [
"import numpy as np\n",
"twenty_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)\n",
"docs_test = twenty_test.data\n",
"predicted = text_clf.predict(docs_test)\n",
"print \"Nasz klasyfikator uzyskał skuteczność na poziomie {}%\".format(np.mean(predicted == twenty_test.target)*100)   "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Spróbujmy uzyskać lepszą skuteczność posługując się innym klasyfikatorem. Wykorzystajmy do tego maszynę wektorów nośnych, która dla danych tekstowych może przynieść lepszą skuteczność, lecz niestety z dłuższym czasem przetwarzania."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Nasz klasyfikator uzyskał skuteczność na poziomie 91.2782956059%\n"
]
}
],
"source": [
"from sklearn.linear_model import SGDClassifier\n",
"text_clf = Pipeline([('vect', CountVectorizer()),\n",
"                    ('tfidf', TfidfTransformer()),\n",
"                    ('clf', SGDClassifier(loss='hinge', penalty='l2',\n",
"                    alpha=1e-3, n_iter=5, random_state=42)),\n",
"])\n",
"_ = text_clf.fit(twenty_train.data, twenty_train.target)\n",
"predicted = text_clf.predict(docs_test)\n",
"print \"Nasz klasyfikator uzyskał skuteczność na poziomie {}%\".format(np.mean(predicted == twenty_test.target)*100) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Jeżeli zwykły procentowy wynik to dla nas za mało, może równie łatwo zdobyć bardziej szczegółowe statystyki w następujący sposób:"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"                        precision    recall  f1-score   support\n",
"\n",
"           alt.atheism       0.95      0.81      0.87       319\n",
"         comp.graphics       0.88      0.97      0.92       389\n",
"               sci.med       0.94      0.90      0.92       396\n",
"soc.religion.christian       0.90      0.95      0.93       398\n",
"\n",
"           avg / total       0.92      0.91      0.91      1502\n",
"\n"
]
}
],
"source": [
"from sklearn import metrics\n",
"print(metrics.classification_report(twenty_test.target, predicted,\n",
"    target_names=twenty_test.target_names))"
]
}
],
"metadata": {
"anaconda-cloud": {},
"hide_input": false,
"kernelspec": {
"display_name": "Python [conda root]",
"language": "python",
"name": "conda-root-py"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 2
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython2",
"version": "2.7.12"
}
},
"nbformat": 4,
"nbformat_minor": 0
}