Norod78

bigrams and trigrams using bash

Feb 20th, 2020
126
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. https://davidxmoody.com/2015/word-frequency-analysis-with-command-line-tools/
  2.  
  3. #The first awk statement prints out the previous word and the current word on the same line (skipping the very first word). #The second statement just sets the previous word for use on the next line. I'm sure it could be prettier but it works well.
  4.  
  5. tr -sc[A-Z][a-z][א-ת][0-9]’” '[\012*]' < "$IN_FILE" | \
  6.   tr '[A-Z]' '[a-z]' | \
  7.   awk -- 'prev!="" { print prev,$0; } { prev=$0; }' | \
  8.   sort | uniq -c | sort -nr | \
  9.   head -n200
  10.  
  11. #This next script prints out trigrams instead of bigrams using the same kind of method. This could also be done with a for #loop for n-grams of any size.
  12.  
  13. tr -sc[A-Z][a-z][א-ת][0-9]’” '[\012*]' < "$IN_FILE" | \
  14.   tr '[A-Z]' '[a-z]' | \
  15.   awk -- 'first!=""&&second!="" { print first,second,$0; } { first=second; second=$0; }' | \
  16.   sort | uniq -c | sort -nr | \
  17.   head -n200
Add Comment
Please, Sign In to add comment