Untitled

Problem:
    Use data science to improve image testing workflow.

Questions:
    1. What are some ways data science can be used to improve the image testing workflow?
        a. image anomaly detection
        b. report generation
        c. generally pointing reviewers in the right direction of where errors are

    2. What is SIFT?
        https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
        "Scale Invariant Feature Transform"
	    Effectively, the algorithm selects a set of key points in an image. It then uses calculates Difference in Gaussian space from these points.
        The benefit is that this focuses on points, which simplifies data when the image scales.
        This may be what our project is actually looking for, as screenshots could occur on differently sized monitors.

        SIFT Python Tut:
        https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_sift_intro/py_sift_intro.html

    3. What is "Difference of Guassian detector"?
        https://en.wikipedia.org/wiki/Difference_of_Gaussians
        It is an image edge-detection algorithm.
        It works by blurring images. Blurring an image will reduce high-frequency spacial information.
        Subtracting this blurred image from another can then act as a filter.
        Can be used to outline images in a scene. May work well for detecting significant components of a given screenshot.

        Some relevant Python code.
        https://stackoverflow.com/questions/22050199/python-implementation-of-the-laplacian-of-gaussian-edge-detection

    4. What is RANSAC?
        "Random Sample Concencus"
        RANSAC is an outlier detection model/method. This will likely be an extremely useful feature for the project.
        It iteratively constructs a best-fit model for data.


        Python code for RANSAC implementation:
        https://scikit-learn.org/stable/auto_examples/linear_model/plot_ransac.html

    5. Main Problem.
        a) How will data be reduced into a feedable set and
        b) what will it be fed into?

        We will likely feed data through things that reduce it, and then feed that reduced data through anomaly detection.
        a) Things that reduce data are discussed here, like SIFT and Guassian blob detection.
            This will be good for working with chunks of screenshots.
        b) For anomaly detection, below has plenty of information.
            K-Means clustering is a likely candidate for what is going to be used in our model.
            The Gaussain Model set by Andrew Ng is also very attractive, as it closely relates to our problem.

        Input would be images.
        Output would maybe be a snapshot/brief report to help those working in testing to quickly find anomalies.
        Maybe make that snapshot a series of verbal data interpretations from 2+ models.


Articles that were helpful:
-"25 Questions to a data scientist on image processing":
    part1:
        https://towardsdatascience.com/my-take-on-25-questions-to-test-a-data-scientist-on-image-processing-with-interactive-code-part-1-a6196f535008
    part2:
        https://towardsdatascience.com/my-take-on-25-questions-to-test-a-data-scientist-on-image-processing-with-interactive-code-part-2-77eacfd96cf9

    Significance:
        The article is just a series of quiz-like questions about data science related to image processing.
	The questions are random but this causes the article(s) to provide a broad set of ideas and general concepts that will likely prove useful in this project.
	More importantly, as someone rather fresh into data science, this provided me with a lot of terminology I would have otherwise not have come in contact with.

        Question 15 was particularly significant.
        Q15. "
        Which of the following methods is used as a model fitting method for edge detection?
            a) SIFT
            b) Difference of Gaussian detector
            c) RANSAC"
        A15. "Now lets go back to the question, SIFT is a feature detector, Gaussain Detector can be understood as blob detection. Hence the answer have to be RANSAC."

        This snippet pointed me in multiple directions and added multiple questions to my question bank.
            What is SIFT, Gaussain Detection and RANSAC?

-"Introduction to Anomaly Detection in Python":
    https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/

    Significance:
        The article provides a great deal of sample code that can help with the project.
    It also demonstrates an application-based understanding of the concept.
    Libraries used in this tut:
        pandas, numpy, matplotlib.pyplot (for data visualization), scipy, PyOD

    It brings up also a series of techniques for anomaly detection:
        1. box plots
        2. k-means-clustering
            kmeans() generates these things called "centroids".
            Centroids can then be used to detect Euclidean distance (sqrt((y2-y1)^2 + (x2-x1)^2 + (z2-z1)^2 +...))
            which is a way of detecting anomaly.
        3. treat anomaly detection as a classification problem:
            use k-NN classification method (find most nearby neighbors).

-"Andrew Ng's Machine Learning Course in Python (Anomaly Detection)"
    https://towardsdatascience.com/andrew-ngs-machine-learning-course-in-python-anomaly-detection-1233d23dba95

    Significance:
        Implements a Guassian model to detect anomalies in a 2d dataset...
        This is almost the exact problem of this project, but this project will have higher dimension dataset.
        Has a significant Python code base.

    Brings up:
        1. Gradient descent
            effectively, higher dimension curve fitting. Useful for calculating deviation from a predicted curve.

        Has links at the bottom to topics:
            1) Linear Regression - line fitting
            2) Logistic Regression - classification with discrete values //would not be terribly useful in this project
            3) Regularized Logistic Regression + Lasso Regression - regularizing methods to shrink coefficients to avoid overfitting. Will probably be handy.
            4) Neural Networks - Better for deep learning (natural language / vision. Probably not suited for this problem.)
            5) Support Vector Machines - For supervised learning. Can be used for regression analysis. Will maybe be used here.
            6) Unsupervised Learning - A broad topic. K-Means Clustering, PCA.
                PCA - Converts a set of correlated vars to uncorrelated vars.
                Kmeans was discussed above.
                Article: https://towardsdatascience.com/andrew-ngs-machine-learning-course-in-python-kmeans-clustering-pca-b7ba6fafa74

-"Various ways to evaluate a machine learning model’s performance"
    https://towardsdatascience.com/various-ways-to-evaluate-a-machine-learning-models-performance-230449055f15

    Significance:
        provides methods of evaluation!

    Important terms:
        1) Confusion matrix
            Let P be Predicted and R be the Real. P and R can either be True or False. (P, R) => (T, F).
            True Positive: (T, T)
            False Positive: (T, F)
            True Negatives: (F, F)
            False Negatives: (F, T)

            A confusion matrix takes the above ideas and sorts them into a matrix.
            Instances in the dataset can be sorted like this.

            Resource for Python confusion matrix implementation:
            https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

        2) Accuracy
            (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)

        3) Precision
            (True Positives) / (True Positives + True Negatives)

        4) Recall/Sensitivity/True Positive Rate (TPR)
            (True Positive) / (True Positive + False Negative) //rate of finding positives

        5) Specificity
            (True Negative) / (True Negative + False Positive) //rate of finding negatives

        6) F1 Score
            It is the harmonic mean of precision and recall. (3 and 4)
            = 2 / ( (1/precision) + (1/recall) )
            Higher is better

        7) PR Curve ("Precision Recall Curve")
            "It is the curve between precision and recall for various threshold values."
            Python implementation of a PR Curve:
            https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
            https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html

            X: Recall
            Y: Precision

        8) ROC curve
            True Positive Rate vs False Positive Rate (1 - Specificity)