Advertisement
Guest User

Untitled

a guest
Jun 19th, 2018
68
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Java 4.47 KB | None | 0 0
  1. // Sayat implemented a cos, how tf-idf says. Implemented it incorrectly, because he doesn’t know or native with operators order (* should be in braces). Saw that ```cos``` decreased quality and instead of fixing a real issue - decided to use ```dot``` instead of ```cos``` removing the whole concept that takes into account influence a size of a document to term importancy.
  2. double dotProduct = vector.dot(other);
  3. public class SparseVector {
  4.     public double dot(SparseVector other) {
  5.         double product = 0;
  6.  
  7.         for (Map.Entry<Integer, Double> word : other.weights.entrySet()) {
  8.             int wordOtherId = word.getKey();
  9.             if(weights.containsKey(wordOtherId)) {
  10.                 product += weights.get(wordOtherId) * word.getValue();
  11.             }
  12.         }
  13.         return product;
  14.     }
  15.  
  16.     public double cosine(SparseVector other) {
  17.         double dotProduct = other.dot(this);
  18.         return dotProduct / this.getL2Norm() * other.getL2Norm();
  19.     }
  20. }
  21. //“invert document frequency” of part is not implemented at all, at least I don’t see it.
  22. //As a result of non using ```cos``` and not implementing “invert document frequency” tf-idf reduced to dot product of terms frequencies inside documents. So, `and`, `a` and `the` will be the most important words. This fact was covered by usage of stop words from Lucene, but of course an importance of all other words is skewed too. If you implement tf-idf correctly - you don't need any stop words set to reduce importance of common words.
  23. //To demonstrate it, if remove his stop words plugging in here - accuracy falling to 0.4:
  24. filter = new StopFilter(filter, StandardAnalyzer.STOP_WORDS_SET);
  25.  
  26.  
  27. // so, ok, here we use ```encode``` purely to store the result, we don’t care about the result
  28. public void fit(Classification train) {
  29.     Multiset<String> features = getFeatures(train.getDocument());
  30.     encode(features, Optional.of(train.getCategory()));
  31. }
  32. // here we pass a magic empty value to a getter to avoid storing. This is the only place where we use this getter and we pass a magic empty value into it
  33. public String predict(String html) {
  34.     String text = TextUtil.parseHtml(html);
  35.     SparseVector vector = getSparseVector(text, Optional.empty());
  36.     …
  37. }
  38. private SparseVector getSparseVector(String text, Optional<String> category) {
  39.     Multiset<String> features = getFeatures(text);
  40.     return encode(features, category);
  41. }
  42. //Finally, ```encode``` method that modifies internal state (this::getFeatureVector returns a vector stored inside the instance) and basically is incremental saving method if we pass a category and an actual encoder if we pass a category that is not inside category map. Pretty cool usage of Optional - to load persisted data if ```Optional category```  is not empty and use one category is empty.
  43. private SparseVector encode(Multiset<String> tokens, Optional<String> category) {
  44.     SparseVector vector = category.map(this::getFeatureVector).orElse(SparseVector.create());
  45.     for (Multiset.Entry<String> entry: tokens.entrySet()) {
  46.         int wordId = getWordId(entry.getElement());
  47.         double weight = entry.getCount() * 1.0 / tokens.size();
  48.         vector.add(wordId, weight);
  49.     }
  50.     return vector;
  51. }
  52.  
  53. //ineffective: putIfAbsent instead of computeIfAbsent for non constant values
  54. //ignoring result of putIfAbsent
  55. private SparseVector getFeatureVector(String category) {
  56.     featureStore.putIfAbsent(category, new SparseVector());
  57.     return featureStore.get(category);
  58. }
  59.  
  60. //ineffective: weights.containsKey + get
  61. if(weights.containsKey(wordOtherId)) {
  62.     product += weights.get(wordOtherId) * word.getValue();
  63. }
  64.  
  65. // here I more or less ready to forget modifying ```get```, because it’s encapsulated and easy to track
  66. // ConcurrentHashMap mixed with non thread safe code - more or less fine
  67. // Where is putIfAbsent and computeIfAbsent here, where we need it at most
  68. public class WordSet {
  69.     private final Map<String, Integer> dict = new ConcurrentHashMap<>();
  70.     /**
  71.      * Returns unique word if already seen in the text. Otherwise, assings a new id.
  72.      */
  73.     public int getWordId(String s) {
  74.         if (!dict.containsKey(s)) {
  75.             // this will work, but anybody can get this easily from the first sight? Such an id generation should be done in a separate method with description why it’s legit at all
  76.             dict.put(s, dict.size());
  77.         }
  78.         return dict.get(s);
  79.     }
  80.     public int size() {
  81.         return dict.size();
  82.     }
  83. }
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement