Introduction
Machine learning is more than simply computing averages or performing some data manipulation. It actually involves making predictions about new observations based on previous information.
Types
- Supervised (Predictive Analytics) There does exists a predefined relationship to exploit between some inputs (predictors) and one or more outputs (responses or outcomes), so that an algorithm is able to build a function that tries to approximate in some sense, and that will be used to map new inputs with no known outputs.
- Classification for Qualitative Output(s)
- k-Nearest Neighbors (kNN)
- Naive Bayes
- Logistic Regression
- Recursive Partitioning (or Decision Tree)
- Random Forest
- Support Vector Machine
- Regression for Quantitative Output(s)
- Linear Regression
- Lasso Regression
- Ridge Regression
- Poisson Regression (counting, discrete data)
- Classification for Qualitative Output(s)
- Unsupervised (Pattern Discovery) There is no concepts of predictor and response,
- Dimensionality Reduction for Feature Selection
- Principal Components (PCA)
- Factor Analysis (FA)
- Manifold Learning (IsoMap)
- Clustering for grouping objects together such that the objects are similar within each group, and dissimilar between different groups. Notice that there is no prior knowledge of what the resulting groups could or should look like.
- k-means, k-medians, k-modes, k-prototypes
- Hierarchical
- Mean-shift
- DBSCAN
- Anomaly Detection for Outlier Analysis
- Isolation Forests
- Data Imputation for Missing Values
- Natural Language Processing ==> Topic Modeling
- Dimensionality Reduction for Feature Selection
Use cases
- Spam Filtering. Predictors: word frequency, character frequency, the amount of sequential capital letters
- Fraud Detection
- Credit Scoring
- Customer Segmentation
- Shopping Basket Analysis
- Content Tagging
- Image/Text Recognition
- Recommender System
- Medical Diagnosis
R most known packages
caret: Classification And REgression Trainingmlr:glmnetrpart/partyRecursive Partitioning and Regression TreesROCRVisualizing the performance of scoring classifierse1071randomForestnnetigraphCollection of Network Analysis toolskernlabneuralneth2otensorflowDeep LearningkerasDeep Learning
Steps for Supervided Models
-
EDA: preprocess and explore the data
-
dataset splitting
When the model should be built only on a of the available observations. The remaining units should be used to assess the predictive power of the model. In order to have a fair distribution of the output variable in each set, it's important that the units are sampled randomly from the dataset, or equivalently shuffle the dataset beforehand and then extract sequentially. -
Data splitting into validation / train / test partitions
-
model selection and validation, using default Hyper-Parameters
-
fit the model on the training set
-
fine-tune ==> act on the Hyper-Parameters of the model structure to find the best configuration
-
generate prediction values apllying the model on the test set
-
evaluate the model
- Classification: Confusion Matrix, accuracy, precision (1-T1), recall (1-T2)
- Regression: mean absolute error, median absolute error, R^2 score
-
interpret the results
In problems that have a random aspect, the set.seed(n) function should be used to enforce reproducibility. After fixing the seed, the random numbers that are generated are always the same, and so should be all the subsequent results.
R built-in functions
lmis a generic function to build linear modelspredicttakes a model (1st arg) and some unseen observations (2nd arg), then returns the predicted outcomes from the model for those observations.kmeansperforms k-means clusteringset.seedenforce reproducibility in problems that have a random aspect function. If you fix the seed, the random numbers that are generated are always the same.rpart
Examples
-
load the packages
library(data.table) library(broom) library(rpart) -
when following a supervised algorithm, it's important to first split randomly the available data into a training and a test subdataset, while assuring reproducibility:
set.seed(n) dts <- fread('/path/to/filename.csv') pct_train <- n units_train <- sample.int(nrow(dts), pct_train * nrow(dts)) dts_train <- dts[units_train] dts_test <- dts[!units_train] all.equal(dts, rbind(dts_train, dts_test))An alternative way is to first shuffle the dataset, and then extract .
shuffled <- sample(dts) n_train <- round(nrow(shuffled) * pct_train) dts_train <- shuffled[1:n_train] dts_test <- shuffled[(n_train + 1):nrow(shuffled)] all.equal(shuffled, rbind(dts_train, dts_test)) -
classification: kNN
fit <- pred <- predict(fit, dts_test) table(dts$y, pred) # Confusion Matrix -
classification: Naive Bayes
fit <- pred <- predict(fit, dts_test) conf_mat <- table(dts$y, pred) # Confusion Matrix -
classification: recursive partitioning (decision tree)
fit <- rpart(y ~ x1 + ... + xn, data = dts_train, method = "class") pred <- predict(fit, dts_test, type = 'class') conf_mat <- table(dts_test$y, pred) TP <- conf[1, 1] FN <- conf[1, 2] FP <- conf[2, 1] TN <- conf[2, 2] accuracy <- TP / (TP + FP + TN + FN) precision <- TP / (TP + FP) recall <- TP / (TP + FN) -
classification: logistic regression
fit <- lm(y ~ x1 + ... + xn, data = dts_train, family = 'binomial') pred <- predict(fit, dts_test) table(dts$y, pred) # Confusion Matrix -
linear regression
fit <- lm(y ~ x1 + ... + xn + x1*x2 + ..., data = dts_train) broom::tidy(fit) pred <- predict(fit, dts_test) rmse <- sqrt( 1 / nrow(dts_test) * sum( (dts$y - pred)^2 ) ) -
clustering
set.seed(1) groups <- kmeans(dts, n) table(z, groups$cluster) plot(dts, col = groups$cluster) points(groups$centers, pch = 22, bg = c(1, 2), cex = 2) groups$tot.withinss / groups$betweenss
Performance Measures
Bias vs Variability Trade Off
Overfitting vs Underfitting
Confusion Matrix
In the case of supervised learning, it's possible to compare the outcomes of the model with the truth, and the four possible situations leads to build what is called as Confusion Matrix:
| n = | Predicted: YES | Predicted: NO |
|---|---|---|
| Actual: YES | TP = true positives | FN = false negatives |
| Actual: NO | FP = false positives | TN = true negatives |
Out of the n cases:
- the classifier predicted in total
TPP = TP + FPpositives andTPN = TN + FNnegatives - in reality, there has been
TAP = TP + FNpositives andTAN = TN + FPnegatives.
From the above , the following metrics can be quickly calculated:
- Accuracy
(TP + TN) / nOverall, how often is the classifier correct? - Error Rate
(FP + FN) / n = 1 - AccuracyOverall, how often is the classifier wrong? Also called Misclassification Rate - Precision
TP / (TP + FP) = TP / TPPHow often does it predict a correct positive? - Recall or Sensitivity
TP / (TP + FN) = TP / TAPHow often does it predict an actual positive? - False Positive Rate
FP / (FP + TN) = FP / TANHow often does it predict a wrong negative? - Specificity
TN / (FP + TN) = TN / TANHow often does it predict an actual negative? - Prevalence
TP + FN / n = TAP / nHow many positives actually occur in the sample?
A particular attention should be on the concepts of accuracy and precision. Many people use these terms interchangeably. However, these words have different meanings in statistics:
- accuracy refers to how closely a measurement or observation comes to measuring a true value, since measurements and observations are always subject to error. It's a similar notion to the unbiasedness of a statistical estimator.
- precision refers to how closely repeated measurements or observations come to duplicating measured or observed values. It's a similar notion to the variability of a statistical estimator.
Cross Validation
A better way to assess the predicticve power of a supervised algorithm is to use all units both in the training and test phases alternatively. This process is called k-fold Cross-Validation. The original dataset is split k times, with the accuracy calculated for each fold. The mean of these accuracies forms a more robust estimation of the model's true accuracy of predicting unseen data, because it is less dependent on the choice of training and test sets.
# Set the number of folds
n_fold <- n
# Set the number of units in each fold
k_units <- round( (1/n_fold) * nrow(dts) )
# Set random seed for reproducibility
set.seed(n)
# Shuffle the original dataset for fair representation
shuffled <- sample(dts)
# Initialize the vector storing the folds accuracies
accs <- rep(0, n_fold)
# run the model over a loop
for (i in 1:n_fold){
# Calculate the indices that address the current test set
k_idx <- ((i-1) * k_units + 1):(i * k_units)
# Form the train set
dts_train <- shuffled[-k_idx,]
# Form the test set
dts_test <- shuffled[k_idx,]
# Build the model
tree <- rpart(y ~ ., dts_train, method = 'class')
# Make a prediction based on the test set
pred <- predict(tree, dts_test, type = 'class')
# Calculate the confusion matrix
conf <- table(dts_test$y, pred)
# Assign the accuracy to the ith index in accs
accs[i] <- sum( diag(conf) ) / sum(conf)
}
# Print out the mean of accs as final accuracy of the model
mean(accs)
Receiver Operating Characteristic (ROC) Curve
RMSE
When using a regression algorithm, a standard measure for assessing the quality of the resulting model is the Root Mean Squared Error (RMSE) which the Mean distance between the estimates and the regression line and therefore represent a measure of the error incurred in substituting the true value with its prediction: the lower the value for the RMSE, the better the model.
However, as a standalone number, the RMSE doesn't tell anything meaningful, as it is expressed in the units of the outcome variable. In order to derive its meaning, it should be compared to the RMSE of a different model for the same problem. It should be noted that in general, adding complexity to a model, in this case adding predictors, decreases the value of the RMSE
Dunn's Index
Clustering algorithms are based on:
- maximising the similarity within groups. We can measure this by using:
- Within Sum of Squares (WSS)
- Average Diameter of clusters
The smaller the better
- minimizing the similarity between groups. We can measure this by using:
- Between Sum of Squares (BSS)
- Average Intercluster Distances
The higher the better
Notice that the Total Sum of Squares (TSS) is simply the sum of Within and Between sums:TSS = BSS + WSS.
A popular measure of performance for different clustering algorithm is the Dunn's Index, which is defined as the ratio between the
Minimal Intercluster Distance and the Maximal Diameter. Using a similar approach, another measure of performance is defined as the ratio between the Within cluster Sum of Squares and the Between cluster Sum of Squares, so WSS/BSS. With both the above measures, the smaller they are the clusters are well seperated and overall compact.
Algorithms
k Nearest Neighbors (kNN)
Naive Bayes
Decision Tree
Automatic Interaction Detection, a wide range of methods has been suggested that is usually termed Recursive Partitioning or Decision trees or tree(-structured) models. Particularly influential has been and still are the following algorithms
- CART, Classification And Regression Trees
- CHAID, Chi-square Automatic Interaction Detector
- CTree, Conditional Inference
- MoB, Model-Based
- EvTree, Evolutionary Learning of Globally Optimal Classification and Regression Trees
- Ensemble Methods
- Bagging
- Random Forest
- Gradient boosting
- Xgboost
- GBM
Decision trees are of two main types:
- Classification tree, when the response variable is categorical, or a discretized numeric.
- Regression tree, when the response variable is numeric.
Within R the list of prominent packages includes rpart (CART), mvpart (multivariate CART), party (CTree, MOB), partykit (CTree, MOB)
Basic Terminology used with Decision trees:
- Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
- Splitting: It is a process of dividing a node into two or more sub-nodes.
- Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
- Leaf or Terminal Node: Nodes do not split is called Leaf or Terminal node.
- Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
- Parent vs Child Nodes: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.
- Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.
Regression Tree
Clustering
An important part in machine learning is understanding and interpreting the results. In the case of clustering, visualization is key, and the simplest way to achieve it is by plotting the features in the dataset and coloring the points based on their corresponding cluster. The cluster centroids are typically added as a good representations of all the observations in a cluster. They are often used to summarize your clusters.
Interpretabilty and Explainability
In the context of Modeling and Machine Learning systems, interpretability is the ability to explain or to present a machine learning model in understandable terms to a human. As long as a model is having no significant impact in the real world, its interpretability doesn’t matter as much. But when there are implications involved based on a model’s prediction, be it financial or social, the concept of interpretability becomes paramount. Machine learning models are being increasingly used to make decisions that affect people’s lives. With this power comes a responsibility to ensure, for example, that the model predictions are fair and not discriminating. It’s not enough to know if a model works, we need to know how it works. A model can often work in a right way using multiple, even infinite, different combination of inputs. We want to choose the one(s) that has the least negative impact, And to be able to do that we first need to be sure that the model can be interepreted, and to build some techniques that (try to) explain how different features have a different effect on the prediction of a model.