Untitled

Question 1:

Input: String text block, List[String] blaclist words
Output: list[detected word, count]
Optimize

How long is each block? How long is each potential word? How similar are the words? Do we need to normalize upfront (e.g. stemming, capitalization)?

Let's take a first pass with some simple answers to understand the problem:

- Assume no normalization
- Assume words are dissimilar
 -- If words were similar to one another, like 'aardvark', 'aardvarking', 'aardvarker', we could share memory between them--e.g. using a List[Char]
- Assume strings come from a very large set of words
 -- If the set of words were small, we may benefit from reusing instances of words; the JVM, for example, caches boxed Integer values from 0 to 255. If the strings came from a known dictionary, we could share that underlying storage between instances (like BytesRef in Lucene.)
- Assume the text block fits into memory.
- For the first pass, prioritize simplicity over memory & compute usage.
- Assume the words are of arbitrary length
 -- Pushes us to using hash-based maps & sets instead of tries: tries are great for lookup bounded by the length of the string, but unbounded length & a diverse dictionary can significantly increase memory usage
- Assume that the number of occurrences fits in an Int
 -- Otherwise, we could use checked operations to identify overflows

def counts(
  block: String,
  blacklistedWords: immutable.Set[String]
): mutable.Map[String, Int] =
   block.split(" ").filterNot({ word => blacklistedWords.contains(word)}).groupBy(identity).map(_.size)
}

Let's imagine we profile this and see...

We notice that the blocks are long and the words are short, so we're creating a ton of new objects. This isn't normally a problem for the JVM: allocations are just a pointer bump, and the new generation is cheap to clean; it's assumed objects die young. (Plus, these string objects are just references to the underlying text block with new indices.) But perhaps the string is so large, and our new generations so small, that we're tenuring large *lists* of substrings and triggering expensive cleanup of tenured generations.

We're materializing all of the lists into memory and only counting them at the end. The code was easy to write--and I'd start there until it was proven performance-critical--but we really don't care about holding the whole list; we just care about its length. So let's try a new approach that doesn't hold the list of occurrences in memory:

def counts(block: String, blacklistedWords: immutable.Set[String]): immutable.Map[String, Int] = {
  val words: Seq[String] = block.split(" ")
  // Remove any blacklisted words; don't bother counting them
  val countableWords = words.filterNot({ word => blacklistedWords.contains(word)})
  // For very long lists, we could parallelize this operation if we handled
  // cross-thread access well; since this is a first pass, keep it simple.
  val countableWordToCount = mutable.Map[String, Int].withDefaultValue(0)
  countableWords foreach { countableWord =>
    countableWordToCount(countableWord) = countableWordToCount(countableWord) + 1
  }
  // make it immutable, so the caller is assured they can safely store it without copying
  countableWordToCount.toMap
}

Let's say the performance was still unacceptable, and we use the profiler to identify a cause:

* Let's say we noticed blacklistedWords.contains was slow.
** Perhaps there were many hash collisions in the blacklisted words, so it's obtaining worst-case O(n) lookup behavior instead of the average O(1) case. In that case, we may adjust the hash function to better differentiate buckets. (Wallach et al describe an algorithmic complexity attack against IDS software to exploit this worst-case behavior; Rust changes the behavior of the hash function at startup to avoid these attacks.) The same applies to the countableWordToCount lookup.
** Perhaps converting to an immutable Map takes a while, because it's copying (as the original reference is still mutable). We *could* pass a mutable map back to the client, but we'd need to understand its behavior (if it's just copying into its own immutable map, we've just moved the work around) and make our guarantees clear.
** Perhaps the number of countable words is very large, and so our loop takes a while to run. We could split into smaller groups of words and parallelize the operation. If the string is enormous, we could split it, too. For example, perhaps we split the string into k substrings (each at a tokenization boundary), run the function each in parallel, and sum the results. In Scala, the `par` feature can help.