lvalnegri

Introduction to Machine Learning.md

Mar 22nd, 2018
241
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Markdown 14.00 KB | None | 0 0

Introduction

Machine learning is more than simply computing averages or performing some data manipulation. It actually involves making predictions about new observations based on previous information.

Types

  • Supervised (Predictive Analytics) There does exists a predefined relationship to exploit between some inputs (predictors) and one or more outputs (responses or outcomes), so that an algorithm is able to build a function that tries to approximate in some sense, and that will be used to map new inputs with no known outputs.
    • Classification for Qualitative Output(s)
      • k-Nearest Neighbors (kNN)
      • Naive Bayes
      • Logistic Regression
      • Recursive Partitioning (or Decision Tree)
      • Random Forest
      • Support Vector Machine
    • Regression for Quantitative Output(s)
      • Linear Regression
      • Lasso Regression
      • Ridge Regression
      • Poisson Regression (counting, discrete data)
  • Unsupervised (Pattern Discovery) There is no concepts of predictor and response,
    • Dimensionality Reduction for Feature Selection
      • Principal Components (PCA)
      • Factor Analysis (FA)
      • Manifold Learning (IsoMap)
    • Clustering for grouping objects together such that the objects are similar within each group, and dissimilar between different groups. Notice that there is no prior knowledge of what the resulting groups could or should look like.
      • k-means, k-medians, k-modes, k-prototypes
      • Hierarchical
      • Mean-shift
      • DBSCAN
    • Anomaly Detection for Outlier Analysis
      • Isolation Forests
    • Data Imputation for Missing Values
    • Natural Language Processing ==> Topic Modeling

Use cases

  • Spam Filtering. Predictors: word frequency, character frequency, the amount of sequential capital letters
  • Fraud Detection
  • Credit Scoring
  • Customer Segmentation
  • Shopping Basket Analysis
  • Content Tagging
  • Image/Text Recognition
  • Recommender System
  • Medical Diagnosis

R most known packages

  • caret: Classification And REgression Training
  • mlr:
  • glmnet
  • rpart / party Recursive Partitioning and Regression Trees
  • ROCR Visualizing the performance of scoring classifiers
  • e1071
  • randomForest
  • nnet
  • igraph Collection of Network Analysis tools
  • kernlab
  • neuralnet
  • h2o
  • tensorflow Deep Learning
  • keras Deep Learning

Steps for Supervided Models

  • EDA: preprocess and explore the data

  • dataset splitting
    When the model should be built only on a of the available observations. The remaining units should be used to assess the predictive power of the model. In order to have a fair distribution of the output variable in each set, it's important that the units are sampled randomly from the dataset, or equivalently shuffle the dataset beforehand and then extract sequentially.

  • Data splitting into validation / train / test partitions

  • model selection and validation, using default Hyper-Parameters

  • fit the model on the training set

  • fine-tune ==> act on the Hyper-Parameters of the model structure to find the best configuration

  • generate prediction values apllying the model on the test set

  • evaluate the model

    • Classification: Confusion Matrix, accuracy, precision (1-T1), recall (1-T2)
    • Regression: mean absolute error, median absolute error, R^2 score
  • interpret the results

In problems that have a random aspect, the set.seed(n) function should be used to enforce reproducibility. After fixing the seed, the random numbers that are generated are always the same, and so should be all the subsequent results.

R built-in functions

  • lm is a generic function to build linear models
  • predict takes a model (1st arg) and some unseen observations (2nd arg), then returns the predicted outcomes from the model for those observations.
  • kmeans performs k-means clustering
  • set.seed enforce reproducibility in problems that have a random aspect function. If you fix the seed, the random numbers that are generated are always the same.
  • rpart

Examples

  • load the packages

    library(data.table)
    library(broom)
    library(rpart)
  • when following a supervised algorithm, it's important to first split randomly the available data into a training and a test subdataset, while assuring reproducibility:

    set.seed(n)
    dts <- fread('/path/to/filename.csv')
    pct_train <- n
    units_train <- sample.int(nrow(dts), pct_train * nrow(dts)) 
    dts_train <- dts[units_train]
    dts_test  <- dts[!units_train]
    all.equal(dts, rbind(dts_train, dts_test))

    An alternative way is to first shuffle the dataset, and then extract .

    shuffled <- sample(dts)
    n_train <- round(nrow(shuffled) * pct_train)
    dts_train <- shuffled[1:n_train]
    dts_test  <- shuffled[(n_train + 1):nrow(shuffled)]
    all.equal(shuffled, rbind(dts_train, dts_test))
  • classification: kNN

    fit <- 
    pred <- predict(fit, dts_test)
    table(dts$y, pred) # Confusion Matrix
  • classification: Naive Bayes

    fit <- 
    pred <- predict(fit, dts_test)
    conf_mat <- table(dts$y, pred) # Confusion Matrix
  • classification: recursive partitioning (decision tree)

    fit <- rpart(y ~ x1 + ... + xn, data = dts_train, method = "class")
    pred <- predict(fit, dts_test, type = 'class')
    conf_mat <- table(dts_test$y, pred)
    TP <- conf[1, 1] 
    FN <- conf[1, 2] 
    FP <- conf[2, 1] 
    TN <- conf[2, 2] 
    accuracy <- TP / (TP + FP + TN + FN)
    precision <- TP / (TP + FP)
    recall <- TP / (TP + FN)
  • classification: logistic regression

    fit <- lm(y ~ x1 + ... + xn, data = dts_train, family = 'binomial')
    pred <- predict(fit, dts_test)
    table(dts$y, pred) # Confusion Matrix
  • linear regression

    fit <- lm(y ~ x1 + ... + xn + x1*x2 + ..., data = dts_train)
    broom::tidy(fit)
    pred <- predict(fit, dts_test)
    rmse <- sqrt( 1 / nrow(dts_test) * sum( (dts$y - pred)^2 ) )
  • clustering

    set.seed(1)
    groups <- kmeans(dts, n)
    table(z, groups$cluster)
    plot(dts, col = groups$cluster)
    points(groups$centers, pch = 22, bg = c(1, 2), cex = 2)
    groups$tot.withinss / groups$betweenss

Performance Measures

Bias vs Variability Trade Off

Overfitting vs Underfitting

Confusion Matrix

In the case of supervised learning, it's possible to compare the outcomes of the model with the truth, and the four possible situations leads to build what is called as Confusion Matrix:

n = Predicted: YES Predicted: NO
Actual: YES TP = true positives FN = false negatives
Actual: NO FP = false positives TN = true negatives

Out of the n cases:

  • the classifier predicted in total TPP = TP + FP positives and TPN = TN + FN negatives
  • in reality, there has been TAP = TP + FN positives and TAN = TN + FP negatives.

From the above , the following metrics can be quickly calculated:

  • Accuracy (TP + TN) / n Overall, how often is the classifier correct?
  • Error Rate (FP + FN) / n = 1 - Accuracy Overall, how often is the classifier wrong? Also called Misclassification Rate
  • Precision TP / (TP + FP) = TP / TPP How often does it predict a correct positive?
  • Recall or Sensitivity TP / (TP + FN) = TP / TAP How often does it predict an actual positive?
  • False Positive Rate FP / (FP + TN) = FP / TAN How often does it predict a wrong negative?
  • Specificity TN / (FP + TN) = TN / TAN How often does it predict an actual negative?
  • Prevalence TP + FN / n = TAP / n How many positives actually occur in the sample?

A particular attention should be on the concepts of accuracy and precision. Many people use these terms interchangeably. However, these words have different meanings in statistics:

  • accuracy refers to how closely a measurement or observation comes to measuring a true value, since measurements and observations are always subject to error. It's a similar notion to the unbiasedness of a statistical estimator.
  • precision refers to how closely repeated measurements or observations come to duplicating measured or observed values. It's a similar notion to the variability of a statistical estimator.

Cross Validation

A better way to assess the predicticve power of a supervised algorithm is to use all units both in the training and test phases alternatively. This process is called k-fold Cross-Validation. The original dataset is split k times, with the accuracy calculated for each fold. The mean of these accuracies forms a more robust estimation of the model's true accuracy of predicting unseen data, because it is less dependent on the choice of training and test sets.

# Set the number of folds
n_fold <- n
# Set the number of units in each fold
k_units <- round( (1/n_fold) * nrow(dts) )
# Set random seed for reproducibility
set.seed(n)
# Shuffle the original dataset for fair representation
shuffled <- sample(dts)
# Initialize the vector storing the folds accuracies
accs <- rep(0, n_fold)
# run the model over a loop
for (i in 1:n_fold){
  # Calculate the indices that address the current test set
  k_idx <- ((i-1) * k_units + 1):(i * k_units)
  # Form the train set
  dts_train <- shuffled[-k_idx,]
  # Form the test set
  dts_test <- shuffled[k_idx,]
  # Build the model 
  tree <- rpart(y ~ ., dts_train, method = 'class')
  # Make a prediction based on the test set
  pred <- predict(tree, dts_test, type = 'class')
  # Calculate the confusion matrix
  conf <- table(dts_test$y, pred)
  # Assign the accuracy to the ith index in accs
  accs[i] <- sum( diag(conf) ) / sum(conf)
}
# Print out the mean of accs as final accuracy of the model
mean(accs)

Receiver Operating Characteristic (ROC) Curve

RMSE

When using a regression algorithm, a standard measure for assessing the quality of the resulting model is the Root Mean Squared Error (RMSE) which the Mean distance between the estimates and the regression line and therefore represent a measure of the error incurred in substituting the true value with its prediction: the lower the value for the RMSE, the better the model.

However, as a standalone number, the RMSE doesn't tell anything meaningful, as it is expressed in the units of the outcome variable. In order to derive its meaning, it should be compared to the RMSE of a different model for the same problem. It should be noted that in general, adding complexity to a model, in this case adding predictors, decreases the value of the RMSE

Dunn's Index

Clustering algorithms are based on:

  • maximising the similarity within groups. We can measure this by using:
    • Within Sum of Squares (WSS)
    • Average Diameter of clusters
      The smaller the better
  • minimizing the similarity between groups. We can measure this by using:
    • Between Sum of Squares (BSS)
    • Average Intercluster Distances
      The higher the better
      Notice that the Total Sum of Squares (TSS) is simply the sum of Within and Between sums: TSS = BSS + WSS.

A popular measure of performance for different clustering algorithm is the Dunn's Index, which is defined as the ratio between the
Minimal Intercluster Distance and the Maximal Diameter. Using a similar approach, another measure of performance is defined as the ratio between the Within cluster Sum of Squares and the Between cluster Sum of Squares, so WSS/BSS. With both the above measures, the smaller they are the clusters are well seperated and overall compact.


Algorithms

k Nearest Neighbors (kNN)

Naive Bayes

Decision Tree

Automatic Interaction Detection, a wide range of methods has been suggested that is usually termed Recursive Partitioning or Decision trees or tree(-structured) models. Particularly influential has been and still are the following algorithms

  • CART, Classification And Regression Trees
  • CHAID, Chi-square Automatic Interaction Detector
  • CTree, Conditional Inference
  • MoB, Model-Based
  • EvTree, Evolutionary Learning of Globally Optimal Classification and Regression Trees
  • Ensemble Methods
    • Bagging
    • Random Forest
    • Gradient boosting
      • Xgboost
      • GBM
        Decision trees are of two main types:
  • Classification tree, when the response variable is categorical, or a discretized numeric.
  • Regression tree, when the response variable is numeric.

Within R the list of prominent packages includes rpart (CART), mvpart (multivariate CART), party (CTree, MOB), partykit (CTree, MOB)

Basic Terminology used with Decision trees:

  • Root Node: It represents entire population or sample and this further gets divided into two or more homogeneous sets.
  • Splitting: It is a process of dividing a node into two or more sub-nodes.
  • Decision Node: When a sub-node splits into further sub-nodes, then it is called decision node.
  • Leaf or Terminal Node: Nodes do not split is called Leaf or Terminal node.
  • Branch / Sub-Tree: A sub section of entire tree is called branch or sub-tree.
  • Parent vs Child Nodes: A node, which is divided into sub-nodes is called parent node of sub-nodes where as sub-nodes are the child of parent node.
  • Pruning: When we remove sub-nodes of a decision node, this process is called pruning. You can say opposite process of splitting.

Regression Tree

Clustering

An important part in machine learning is understanding and interpreting the results. In the case of clustering, visualization is key, and the simplest way to achieve it is by plotting the features in the dataset and coloring the points based on their corresponding cluster. The cluster centroids are typically added as a good representations of all the observations in a cluster. They are often used to summarize your clusters.

Interpretabilty and Explainability

In the context of Modeling and Machine Learning systems, interpretability is the ability to explain or to present a machine learning model in understandable terms to a human. As long as a model is having no significant impact in the real world, its interpretability doesn’t matter as much. But when there are implications involved based on a model’s prediction, be it financial or social, the concept of interpretability becomes paramount. Machine learning models are being increasingly used to make decisions that affect people’s lives. With this power comes a responsibility to ensure, for example, that the model predictions are fair and not discriminating. It’s not enough to know if a model works, we need to know how it works. A model can often work in a right way using multiple, even infinite, different combination of inputs. We want to choose the one(s) that has the least negative impact, And to be able to do that we first need to be sure that the model can be interepreted, and to build some techniques that (try to) explain how different features have a different effect on the prediction of a model.

Permutation Importance

Partial Dependence Plots

SHAP Values

Advanced Uses of SHAP Values

LIME: Locally Interpretable Model-Agnostic Explanations

Add Comment
Please, Sign In to add comment