Not a member of Pastebin yet?
Sign Up,
it unlocks many cool features!
- https://davidxmoody.com/2015/word-frequency-analysis-with-command-line-tools/
- #The first awk statement prints out the previous word and the current word on the same line (skipping the very first word). #The second statement just sets the previous word for use on the next line. I'm sure it could be prettier but it works well.
- tr -sc “[A-Z][a-z][א-ת][0-9]’” '[\012*]' < "$IN_FILE" | \
- tr '[A-Z]' '[a-z]' | \
- awk -- 'prev!="" { print prev,$0; } { prev=$0; }' | \
- sort | uniq -c | sort -nr | \
- head -n200
- #This next script prints out trigrams instead of bigrams using the same kind of method. This could also be done with a for #loop for n-grams of any size.
- tr -sc “[A-Z][a-z][א-ת][0-9]’” '[\012*]' < "$IN_FILE" | \
- tr '[A-Z]' '[a-z]' | \
- awk -- 'first!=""&&second!="" { print first,second,$0; } { first=second; second=$0; }' | \
- sort | uniq -c | sort -nr | \
- head -n200
Add Comment
Please, Sign In to add comment