Introduction to Machine Learning.md


        
                Markdown 14.00 KB
                                    
                        | None                    
                
                                        |
    0    0                            

            
                                    raw
                    download
                    clone
                    embed
                    print
                
                                    report
                
                
            Introduction
Machine learning is more than simply computing averages or performing some data manipulation. It actually involves making predictions about new observations based on previous information.
Types

Supervised (Predictive Analytics) There does exists a predefined relationship to exploit between some inputs (predictors) and one or more outputs (responses or outcomes), so that an algorithm is able to build a function that tries to approximate in some sense, and that will be used to map new inputs with no known outputs.

Classification for Qualitative Output(s)

k-Nearest Neighbors (kNN)
Naive Bayes
Logistic Regression
Recursive Partitioning (or Decision Tree)
Random Forest
Support Vector Machine

Regression for Quantitative Output(s)

Linear Regression
Lasso Regression
Ridge Regression
Poisson Regression (counting, discrete data)


Unsupervised (Pattern Discovery) There is no concepts of predictor and response, 

Dimensionality Reduction for Feature Selection

Principal Components (PCA)
Factor Analysis (FA)
Manifold Learning (IsoMap)

Clustering for grouping objects together such that the objects are similar within each group, and dissimilar between different groups. Notice that there is no prior knowledge of what the resulting groups could or should look like. 

k-means, k-medians, k-modes, k-prototypes
Hierarchical
Mean-shift
DBSCAN

Anomaly Detection for Outlier Analysis

Isolation Forests

Data Imputation for Missing Values
Natural Language Processing ==> Topic Modeling


Use cases

Spam Filtering. Predictors: word frequency, character frequency, the amount of sequential capital letters
Fraud Detection
Credit Scoring
Customer Segmentation
Shopping Basket Analysis
Content Tagging
Image/Text Recognition
Recommender System
Medical Diagnosis

R most known packages

caret: Classification And REgression Training
mlr: 
glmnet
rpart / party Recursive Partitioning and Regression Trees
ROCR Visualizing the performance of scoring classifiers
e1071
randomForest
nnet
igraph Collection of Network Analysis tools
kernlab
neuralnet
h2o
tensorflow Deep Learning
keras Deep Learning

Steps for Supervided Models


EDA: preprocess and explore the data 


dataset splitting

When the model should be built only on a of the available observations. The remaining units should be used to assess the predictive power of the model. In order to have a fair distribution of the output variable in each set, it's important that the units are sampled randomly from the dataset, or equivalently shuffle the dataset beforehand and then extract sequentially. 


Data splitting into validation / train / test partitions


model selection and validation, using default Hyper-Parameters


fit the model on the training set


fine-tune ==> act on the Hyper-Parameters of the model structure to find the best configuration


generate prediction values apllying the model on the test set


evaluate the model

Classification: Confusion Matrix, accuracy, precision (1-T1), recall (1-T2)
Regression: mean absolute error, median absolute error, R^2 score


interpret the results


In problems that have a random aspect, the set.seed(n) function should be used to enforce reproducibility. After fixing the seed, the random numbers that are generated are always the same, and so should be all the subsequent results.
R built-in functions

lm is a generic function to build linear models
predict takes a model (1st arg) and some unseen observations (2nd arg), then returns the predicted outcomes from the model for those observations.
kmeans performs k-means clustering
set.seed enforce reproducibility in problems that have a random aspect function. If you fix the seed, the random numbers that are generated are always the same.
rpart 

Examples


load the packages
library(data.table)
library(broom)
library(rpart)


when following a supervised algorithm, it's important to first split randomly the available data into a training and a test subdataset, while assuring reproducibility: 
set.seed(n)
dts <- fread('/path/to/filename.csv')
pct_train <- n
units_train <- sample.int(nrow(dts), pct_train * nrow(dts)) 
dts_train <- dts[units_train]
dts_test  <- dts[!units_train]
all.equal(dts, rbind(dts_train, dts_test))
An alternative way is to first shuffle the dataset, and then extract .
shuffled <- sample(dts)
n_train <- round(nrow(shuffled) * pct_train)
dts_train <- shuffled[1:n_train]
dts_test  <- shuffled[(n_train + 1):nrow(shuffled)]
all.equal(shuffled, rbind(dts_train, dts_test))


classification: kNN
fit <- 
pred <- predict(fit, dts_test)
table(dts$y, pred) # Confusion Matrix


classification: Naive Bayes
fit <- 
pred <- predict(fit, dts_test)
conf_mat <- table(dts$y, pred) # Confusion Matrix


classification: recursive partitioning (decision tree)
fit <- rpart(y ~ x1 + ... + xn, data = dts_train, method = "class")
pred <- predict(fit, dts_test, type = 'class')
conf_mat <- table(dts_test$y, pred)
TP <- conf[1, 1] 
FN <- conf[1, 2] 
FP <- conf[2, 1] 
TN <- conf[2, 2] 
accuracy <- TP / (TP + FP + TN + FN)
precision <- TP / (TP + FP)
recall <- TP / (TP + FN)


classification: logistic regression
fit <- lm(y ~ x1 + ... + xn, data = dts_train, family = 'binomial')
pred <- predict(fit, dts_test)
table(dts$y, pred) # Confusion Matrix


linear regression
fit <- lm(y ~ x1 + ... + xn + x1*x2 + ..., data = dts_train)
broom::tidy(fit)
pred <- predict(fit, dts_test)
rmse <- sqrt( 1 / nrow(dts_test) * sum( (dts$y - pred)^2 ) )


clustering 
set.seed(1)
groups <- kmeans(dts, n)
table(z, groups$cluster)
plot(dts, col = groups$cluster)
points(groups$centers, pch = 22, bg = c(1, 2), cex = 2)
groups$tot.withinss / groups$betweenss


Performance Measures
Bias vs Variability Trade Off
Overfitting vs Underfitting
Confusion Matrix
In the case of supervised learning, it's possible to compare the outcomes of the model with the truth, and the four possible situations leads to build what is called as Confusion Matrix:


n =
Predicted: YES
Predicted: NO


Actual: YES
TP = true positives
FN = false negatives


Actual:  NO
FP = false positives
TN = true negatives


Out of the n cases:

the classifier predicted in total TPP = TP + FP positives and TPN = TN + FN negatives
in reality, there has been TAP = TP + FN positives and TAN = TN + FP negatives.

From the above , the following metrics can be quickly calculated:

Accuracy (TP + TN) / n Overall, how often is the classifier correct?
Error Rate (FP + FN) / n = 1 - Accuracy Overall, how often is the classifier wrong? Also called Misclassification Rate
Precision TP / (TP + FP) = TP / TPP How often does it predict a correct positive?
Recall or Sensitivity TP / (TP + FN) = TP / TAP How often does it predict an actual positive?
False Positive Rate FP / (FP + TN) = FP / TAN How often does it predict a wrong negative?
Specificity TN / (FP + TN) = TN / TAN How often does it predict an actual negative?
Prevalence TP + FN / n = TAP / n How many positives actually occur in the sample?

A particular attention should be on the concepts of accuracy and precision. Many people use these terms interchangeably. However, these words have different meanings in statistics:

accuracy refers to how closely a measurement or observation comes to measuring a true value, since measurements and observations are always subject to error. It's a similar notion to the unbiasedness of a statistical estimator.
precision refers to how closely repeated measurements or observations come to duplicating measured or observed values. It's a similar notion to the variability of a statistical estimator.

Cross Validation
A better way to assess the predicticve power of a supervised algorithm is to use all units both in the training and test phases alternatively. This process is called k-fold Cross-Validation. The original dataset is split k times, with the accuracy calculated for each fold. The mean of these accuracies forms a more robust estimation of the model's true accuracy of predicting unseen data, because it is less dependent on the choice of training and test sets.
# Set the number of folds
n_fold <- n
# Set the number of units in each fold
k_units <- round( (1/n_fold) * nrow(dts) )
# Set random seed for reproducibility
set.seed(n)
# Shuffle the original dataset for fair representation
shuffled <- sample(dts)
# Initialize the vector storing the folds accuracies
accs <- rep(0, n_fold)
# run the model over a loop
for (i in 1:n_fold){
  # Calculate the indices that address the current test set
  k_idx <- ((i-1) * k_units + 1):(i * k_units)
  # Form the train set
  dts_train <- shuffled[-k_idx,]
  # Form the test set
  dts_test <- shuffled[k_idx,]
  # Build the model 
  tree <- rpart(y ~ ., dts_train, method = 'class')
  # Make a prediction based on the test set
  pred <- predict(tree, dts_test, type = 'class')
  # Calculate the confusion matrix
  conf <- table(dts_test$y, pred)
  # Assign the accuracy to the ith index in accs
  accs[i] <- sum( diag(conf) ) / sum(conf)
}
# Print out the mean of accs as final accuracy of the model
mean(accs)
Receiver Operating Characteristic (ROC) Curve
RMSE
When using a regression algorithm, a standard measure for assessing the quality of the resulting model is the Root Mean Squared Error (RMSE) which the Mean distance between the estimates and the regression line and therefore represent a measure of the error incurred in substituting the true value with its prediction: the lower the value for the RMSE, the better the model. 
However, as a standalone number, the RMSE doesn't tell anything meaningful, as it is expressed in the units of the outcome variable. In order to derive its meaning, it should be compared to the RMSE of a different model for the same problem. It should be noted that in general, adding complexity to a model, in this case adding predictors, decreases the value of the RMSE 
Dunn's Index
Clustering algorithms are based on:

maximising the similarity within groups. We can measure this by using:

Within Sum of Squares (WSS)
Average Diameter of clusters

The smaller the better

minimizing the similarity between groups. We can measure this by using:

Between Sum of Squares (BSS)
Average Intercluster Distances

The higher the better

Notice that the Total Sum of Squares (TSS) is simply the sum of Within and Between sums: TSS = BSS + WSS.


A popular measure of performance for different clustering algorithm is the Dunn's Index, which is defined as the ratio between the

Minimal Intercluster Distance and the Maximal Diameter. Using a similar approach, another measure of performance is defined as the ratio between the Within cluster Sum of Squares and the Between cluster Sum of Squares, so WSS/BSS. With both the above measures, the smaller they are the clusters are well seperated and overall compact.

Algorithms
k Nearest Neighbors (kNN)
Naive Bayes
Decision Tree
Automatic Interaction Detection, a wide range of methods has been suggested that is usually termed Recursive Partitioning or Decision trees or tree(-structured) models. Particularly influential has been and still are the following algorithms

CART, Classification And Regression Trees
CHAID, Chi-square Automatic Interaction Detector
CTree, Conditional Inference
MoB, Model-Based 
EvTree, Evolutionary Learning of Globally Optimal Classification and Regression Trees
Ensemble Methods

Bagging
Random Forest
Gradient boosting

Xgboost
GBM

Decision trees are of two main types:


Classification tree, when the response variable is categorical, or a discretized numeric.
Regression tree, when the response variable is numeric.

Within R the list of prominent packages includes rpart (CART), mvpart (multivariate CART), party (CTree, MOB), partykit (CTree, MOB) 
Basic Terminology used with Decision trees:

Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
Splitting: It is a process of dividing a node into two or more sub-nodes.
Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
Leaf or Terminal Node: Nodes do not split is called Leaf or Terminal node.
Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
Parent vs Child Nodes: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.
Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

Regression Tree
Clustering
An important part in machine learning is understanding and interpreting the results. In the case of clustering, visualization is key, and the simplest way to achieve it is by plotting the features in the dataset and coloring the points based on their corresponding cluster. The cluster centroids are typically added as a good representations of all the observations in a cluster. They are often used to summarize your clusters.
Interpretabilty and Explainability
In the context of Modeling and Machine Learning systems, interpretability is the ability to explain or to present a machine learning model in understandable terms to a human. As long as a model is having no significant impact in the real world, its interpretability doesn’t matter as much. But when there are implications involved based on a model’s prediction, be it financial or social, the concept of interpretability becomes paramount. Machine learning models are being increasingly used to make decisions that affect people’s lives. With this power comes a responsibility to ensure, for example, that the model predictions are fair and not discriminating. It’s not enough to know if a model works, we need to know how it works. A model can often work in a right way using multiple, even infinite, different combination of inputs. We want to choose the one(s) that has the least negative impact, And to be able to do that we first need to be sure that the model can be interepreted, and to build some techniques that (try to) explain how different features have a different effect on the prediction of a model.
Permutation Importance
Partial Dependence Plots
SHAP Values
Advanced Uses of SHAP Values
LIME: Locally Interpretable Model-Agnostic Explanations
n =	Predicted: YES	Predicted: NO
Actual: YES	TP = true positives	FN = false negatives
Actual: NO	FP = false positives	TN = true negatives