Advertisement
Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- Problem:
- Use data science to improve image testing workflow.
- Questions:
- 1. What are some ways data science can be used to improve the image testing workflow?
- a. image anomaly detection
- b. report generation
- c. generally pointing reviewers in the right direction of where errors are
- 2. What is SIFT?
- https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
- "Scale Invariant Feature Transform"
- Effectively, the algorithm selects a set of key points in an image. It then uses calculates Difference in Gaussian space from these points.
- The benefit is that this focuses on points, which simplifies data when the image scales.
- This may be what our project is actually looking for, as screenshots could occur on differently sized monitors.
- SIFT Python Tut:
- https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_sift_intro/py_sift_intro.html
- 3. What is "Difference of Guassian detector"?
- https://en.wikipedia.org/wiki/Difference_of_Gaussians
- It is an image edge-detection algorithm.
- It works by blurring images. Blurring an image will reduce high-frequency spacial information.
- Subtracting this blurred image from another can then act as a filter.
- Can be used to outline images in a scene. May work well for detecting significant components of a given screenshot.
- Some relevant Python code.
- https://stackoverflow.com/questions/22050199/python-implementation-of-the-laplacian-of-gaussian-edge-detection
- 4. What is RANSAC?
- "Random Sample Concencus"
- RANSAC is an outlier detection model/method. This will likely be an extremely useful feature for the project.
- It iteratively constructs a best-fit model for data.
- Python code for RANSAC implementation:
- https://scikit-learn.org/stable/auto_examples/linear_model/plot_ransac.html
- 5. Main Problem.
- a) How will data be reduced into a feedable set and
- b) what will it be fed into?
- We will likely feed data through things that reduce it, and then feed that reduced data through anomaly detection.
- a) Things that reduce data are discussed here, like SIFT and Guassian blob detection.
- This will be good for working with chunks of screenshots.
- b) For anomaly detection, below has plenty of information.
- K-Means clustering is a likely candidate for what is going to be used in our model.
- The Gaussain Model set by Andrew Ng is also very attractive, as it closely relates to our problem.
- Input would be images.
- Output would maybe be a snapshot/brief report to help those working in testing to quickly find anomalies.
- Maybe make that snapshot a series of verbal data interpretations from 2+ models.
- Articles that were helpful:
- -"25 Questions to a data scientist on image processing":
- part1:
- https://towardsdatascience.com/my-take-on-25-questions-to-test-a-data-scientist-on-image-processing-with-interactive-code-part-1-a6196f535008
- part2:
- https://towardsdatascience.com/my-take-on-25-questions-to-test-a-data-scientist-on-image-processing-with-interactive-code-part-2-77eacfd96cf9
- Significance:
- The article is just a series of quiz-like questions about data science related to image processing.
- The questions are random but this causes the article(s) to provide a broad set of ideas and general concepts that will likely prove useful in this project.
- More importantly, as someone rather fresh into data science, this provided me with a lot of terminology I would have otherwise not have come in contact with.
- Question 15 was particularly significant.
- Q15. "
- Which of the following methods is used as a model fitting method for edge detection?
- a) SIFT
- b) Difference of Gaussian detector
- c) RANSAC"
- A15. "Now lets go back to the question, SIFT is a feature detector, Gaussain Detector can be understood as blob detection. Hence the answer have to be RANSAC."
- This snippet pointed me in multiple directions and added multiple questions to my question bank.
- What is SIFT, Gaussain Detection and RANSAC?
- -"Introduction to Anomaly Detection in Python":
- https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/
- Significance:
- The article provides a great deal of sample code that can help with the project.
- It also demonstrates an application-based understanding of the concept.
- Libraries used in this tut:
- pandas, numpy, matplotlib.pyplot (for data visualization), scipy, PyOD
- It brings up also a series of techniques for anomaly detection:
- 1. box plots
- 2. k-means-clustering
- kmeans() generates these things called "centroids".
- Centroids can then be used to detect Euclidean distance (sqrt((y2-y1)^2 + (x2-x1)^2 + (z2-z1)^2 +...))
- which is a way of detecting anomaly.
- 3. treat anomaly detection as a classification problem:
- use k-NN classification method (find most nearby neighbors).
- -"Andrew Ng's Machine Learning Course in Python (Anomaly Detection)"
- https://towardsdatascience.com/andrew-ngs-machine-learning-course-in-python-anomaly-detection-1233d23dba95
- Significance:
- Implements a Guassian model to detect anomalies in a 2d dataset...
- This is almost the exact problem of this project, but this project will have higher dimension dataset.
- Has a significant Python code base.
- Brings up:
- 1. Gradient descent
- effectively, higher dimension curve fitting. Useful for calculating deviation from a predicted curve.
- Has links at the bottom to topics:
- 1) Linear Regression - line fitting
- 2) Logistic Regression - classification with discrete values //would not be terribly useful in this project
- 3) Regularized Logistic Regression + Lasso Regression - regularizing methods to shrink coefficients to avoid overfitting. Will probably be handy.
- 4) Neural Networks - Better for deep learning (natural language / vision. Probably not suited for this problem.)
- 5) Support Vector Machines - For supervised learning. Can be used for regression analysis. Will maybe be used here.
- 6) Unsupervised Learning - A broad topic. K-Means Clustering, PCA.
- PCA - Converts a set of correlated vars to uncorrelated vars.
- Kmeans was discussed above.
- Article: https://towardsdatascience.com/andrew-ngs-machine-learning-course-in-python-kmeans-clustering-pca-b7ba6fafa74
- -"Various ways to evaluate a machine learning model’s performance"
- https://towardsdatascience.com/various-ways-to-evaluate-a-machine-learning-models-performance-230449055f15
- Significance:
- provides methods of evaluation!
- Important terms:
- 1) Confusion matrix
- Let P be Predicted and R be the Real. P and R can either be True or False. (P, R) => (T, F).
- True Positive: (T, T)
- False Positive: (T, F)
- True Negatives: (F, F)
- False Negatives: (F, T)
- A confusion matrix takes the above ideas and sorts them into a matrix.
- Instances in the dataset can be sorted like this.
- Resource for Python confusion matrix implementation:
- https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
- 2) Accuracy
- (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)
- 3) Precision
- (True Positives) / (True Positives + True Negatives)
- 4) Recall/Sensitivity/True Positive Rate (TPR)
- (True Positive) / (True Positive + False Negative) //rate of finding positives
- 5) Specificity
- (True Negative) / (True Negative + False Positive) //rate of finding negatives
- 6) F1 Score
- It is the harmonic mean of precision and recall. (3 and 4)
- = 2 / ( (1/precision) + (1/recall) )
- Higher is better
- 7) PR Curve ("Precision Recall Curve")
- "It is the curve between precision and recall for various threshold values."
- Python implementation of a PR Curve:
- https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
- https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
- X: Recall
- Y: Precision
- 8) ROC curve
- True Positive Rate vs False Positive Rate (1 - Specificity)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement