Advertisement
Guest User

Untitled

a guest
Dec 9th, 2019
194
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 8.77 KB | None | 0 0
  1. Problem:
  2. Use data science to improve image testing workflow.
  3.  
  4. Questions:
  5. 1. What are some ways data science can be used to improve the image testing workflow?
  6. a. image anomaly detection
  7. b. report generation
  8. c. generally pointing reviewers in the right direction of where errors are
  9.  
  10. 2. What is SIFT?
  11. https://en.wikipedia.org/wiki/Scale-invariant_feature_transform
  12. "Scale Invariant Feature Transform"
  13. Effectively, the algorithm selects a set of key points in an image. It then uses calculates Difference in Gaussian space from these points.
  14. The benefit is that this focuses on points, which simplifies data when the image scales.
  15. This may be what our project is actually looking for, as screenshots could occur on differently sized monitors.
  16.  
  17. SIFT Python Tut:
  18. https://opencv-python-tutroals.readthedocs.io/en/latest/py_tutorials/py_feature2d/py_sift_intro/py_sift_intro.html
  19.  
  20. 3. What is "Difference of Guassian detector"?
  21. https://en.wikipedia.org/wiki/Difference_of_Gaussians
  22. It is an image edge-detection algorithm.
  23. It works by blurring images. Blurring an image will reduce high-frequency spacial information.
  24. Subtracting this blurred image from another can then act as a filter.
  25. Can be used to outline images in a scene. May work well for detecting significant components of a given screenshot.
  26.  
  27. Some relevant Python code.
  28. https://stackoverflow.com/questions/22050199/python-implementation-of-the-laplacian-of-gaussian-edge-detection
  29.  
  30. 4. What is RANSAC?
  31. "Random Sample Concencus"
  32. RANSAC is an outlier detection model/method. This will likely be an extremely useful feature for the project.
  33. It iteratively constructs a best-fit model for data.
  34.  
  35.  
  36. Python code for RANSAC implementation:
  37. https://scikit-learn.org/stable/auto_examples/linear_model/plot_ransac.html
  38.  
  39. 5. Main Problem.
  40. a) How will data be reduced into a feedable set and
  41. b) what will it be fed into?
  42.  
  43. We will likely feed data through things that reduce it, and then feed that reduced data through anomaly detection.
  44. a) Things that reduce data are discussed here, like SIFT and Guassian blob detection.
  45. This will be good for working with chunks of screenshots.
  46. b) For anomaly detection, below has plenty of information.
  47. K-Means clustering is a likely candidate for what is going to be used in our model.
  48. The Gaussain Model set by Andrew Ng is also very attractive, as it closely relates to our problem.
  49.  
  50. Input would be images.
  51. Output would maybe be a snapshot/brief report to help those working in testing to quickly find anomalies.
  52. Maybe make that snapshot a series of verbal data interpretations from 2+ models.
  53.  
  54.  
  55. Articles that were helpful:
  56. -"25 Questions to a data scientist on image processing":
  57. part1:
  58. https://towardsdatascience.com/my-take-on-25-questions-to-test-a-data-scientist-on-image-processing-with-interactive-code-part-1-a6196f535008
  59. part2:
  60. https://towardsdatascience.com/my-take-on-25-questions-to-test-a-data-scientist-on-image-processing-with-interactive-code-part-2-77eacfd96cf9
  61.  
  62. Significance:
  63. The article is just a series of quiz-like questions about data science related to image processing.
  64. The questions are random but this causes the article(s) to provide a broad set of ideas and general concepts that will likely prove useful in this project.
  65. More importantly, as someone rather fresh into data science, this provided me with a lot of terminology I would have otherwise not have come in contact with.
  66.  
  67. Question 15 was particularly significant.
  68. Q15. "
  69. Which of the following methods is used as a model fitting method for edge detection?
  70. a) SIFT
  71. b) Difference of Gaussian detector
  72. c) RANSAC"
  73. A15. "Now lets go back to the question, SIFT is a feature detector, Gaussain Detector can be understood as blob detection. Hence the answer have to be RANSAC."
  74.  
  75. This snippet pointed me in multiple directions and added multiple questions to my question bank.
  76. What is SIFT, Gaussain Detection and RANSAC?
  77.  
  78. -"Introduction to Anomaly Detection in Python":
  79. https://blog.floydhub.com/introduction-to-anomaly-detection-in-python/
  80.  
  81. Significance:
  82. The article provides a great deal of sample code that can help with the project.
  83. It also demonstrates an application-based understanding of the concept.
  84. Libraries used in this tut:
  85. pandas, numpy, matplotlib.pyplot (for data visualization), scipy, PyOD
  86.  
  87. It brings up also a series of techniques for anomaly detection:
  88. 1. box plots
  89. 2. k-means-clustering
  90. kmeans() generates these things called "centroids".
  91. Centroids can then be used to detect Euclidean distance (sqrt((y2-y1)^2 + (x2-x1)^2 + (z2-z1)^2 +...))
  92. which is a way of detecting anomaly.
  93. 3. treat anomaly detection as a classification problem:
  94. use k-NN classification method (find most nearby neighbors).
  95.  
  96. -"Andrew Ng's Machine Learning Course in Python (Anomaly Detection)"
  97. https://towardsdatascience.com/andrew-ngs-machine-learning-course-in-python-anomaly-detection-1233d23dba95
  98.  
  99. Significance:
  100. Implements a Guassian model to detect anomalies in a 2d dataset...
  101. This is almost the exact problem of this project, but this project will have higher dimension dataset.
  102. Has a significant Python code base.
  103.  
  104. Brings up:
  105. 1. Gradient descent
  106. effectively, higher dimension curve fitting. Useful for calculating deviation from a predicted curve.
  107.  
  108. Has links at the bottom to topics:
  109. 1) Linear Regression - line fitting
  110. 2) Logistic Regression - classification with discrete values //would not be terribly useful in this project
  111. 3) Regularized Logistic Regression + Lasso Regression - regularizing methods to shrink coefficients to avoid overfitting. Will probably be handy.
  112. 4) Neural Networks - Better for deep learning (natural language / vision. Probably not suited for this problem.)
  113. 5) Support Vector Machines - For supervised learning. Can be used for regression analysis. Will maybe be used here.
  114. 6) Unsupervised Learning - A broad topic. K-Means Clustering, PCA.
  115. PCA - Converts a set of correlated vars to uncorrelated vars.
  116. Kmeans was discussed above.
  117. Article: https://towardsdatascience.com/andrew-ngs-machine-learning-course-in-python-kmeans-clustering-pca-b7ba6fafa74
  118.  
  119. -"Various ways to evaluate a machine learning model’s performance"
  120. https://towardsdatascience.com/various-ways-to-evaluate-a-machine-learning-models-performance-230449055f15
  121.  
  122. Significance:
  123. provides methods of evaluation!
  124.  
  125. Important terms:
  126. 1) Confusion matrix
  127. Let P be Predicted and R be the Real. P and R can either be True or False. (P, R) => (T, F).
  128. True Positive: (T, T)
  129. False Positive: (T, F)
  130. True Negatives: (F, F)
  131. False Negatives: (F, T)
  132.  
  133. A confusion matrix takes the above ideas and sorts them into a matrix.
  134. Instances in the dataset can be sorted like this.
  135.  
  136. Resource for Python confusion matrix implementation:
  137. https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html
  138.  
  139. 2) Accuracy
  140. (True Positives + True Negatives) / (True Positives + True Negatives + False Positives + False Negatives)
  141.  
  142. 3) Precision
  143. (True Positives) / (True Positives + True Negatives)
  144.  
  145. 4) Recall/Sensitivity/True Positive Rate (TPR)
  146. (True Positive) / (True Positive + False Negative) //rate of finding positives
  147.  
  148. 5) Specificity
  149. (True Negative) / (True Negative + False Positive) //rate of finding negatives
  150.  
  151. 6) F1 Score
  152. It is the harmonic mean of precision and recall. (3 and 4)
  153. = 2 / ( (1/precision) + (1/recall) )
  154. Higher is better
  155.  
  156. 7) PR Curve ("Precision Recall Curve")
  157. "It is the curve between precision and recall for various threshold values."
  158. Python implementation of a PR Curve:
  159. https://machinelearningmastery.com/roc-curves-and-precision-recall-curves-for-classification-in-python/
  160. https://scikit-learn.org/stable/auto_examples/model_selection/plot_precision_recall.html
  161.  
  162. X: Recall
  163. Y: Precision
  164.  
  165. 8) ROC curve
  166. True Positive Rate vs False Positive Rate (1 - Specificity)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement