bigrams and trigrams using bash

https://davidxmoody.com/2015/word-frequency-analysis-with-command-line-tools/

#The first awk statement prints out the previous word and the current word on the same line (skipping the very first word). #The second statement just sets the previous word for use on the next line. I'm sure it could be prettier but it works well.

tr -sc “[A-Z][a-z][א-ת][0-9]’” '[\012*]' < "$IN_FILE" | \
  tr '[A-Z]' '[a-z]' | \
  awk -- 'prev!="" { print prev,$0; } { prev=$0; }' | \
  sort | uniq -c | sort -nr | \
  head -n200

#This next script prints out trigrams instead of bigrams using the same kind of method. This could also be done with a for #loop for n-grams of any size.

tr -sc “[A-Z][a-z][א-ת][0-9]’” '[\012*]' < "$IN_FILE" | \
  tr '[A-Z]' '[a-z]' | \
  awk -- 'first!=""&&second!="" { print first,second,$0; } { first=second; second=$0; }' | \
  sort | uniq -c | sort -nr | \
  head -n200