Don't like ads? PRO users don't see any ads ;-)
Guest

Untitled

By: a guest on Jul 20th, 2012  |  syntax: None  |  size: 1.38 KB  |  hits: 8  |  expires: Never
download  |  raw  |  embed  |  report abuse  |  print
Text below is selected. Please press Ctrl+C to copy to your clipboard. (⌘+C on Mac)
  1. Identifying top recurring words from a list of e-mails based on a dictionary of interesting words
  2. find -type f | parallel --tag 'eml-to-text {} | grep -o -n -b -f /tmp/list_of_interesting_words'
  3.        
  4. find . -type f | parallel 'eml-to-text {} >/tmp/unpacked/{#}'
  5. find /tmp/unpacked -type f | parallel -X grep -H -o -n -b -f /tmp/list_of_interesting_words
  6.        
  7. cat /tmp/list_of_interesting_words | parallel --pipe --block 10k --files > /tmp/blocks_of_words
  8.        
  9. find /tmp/unpacked -type f | parallel -j1 -I ,, parallel --arg-file-sep // -X grep -H -o -n -b -f ,, {} // - :::: /tmp/blocks_of_words
  10.        
  11. ... | sort -k4 -t: > index.by.word
  12.        
  13. ... | sort -k4 -t: | tee index.by.word | awk 'FS=":" {print $4}' | uniq -c
  14.        
  15. find . -type f | parallel --tag 'eml-to-text {} | grep -F -w -o -n -b -f /tmp/list_of_interesting_words' | sort -k3 -t: | tee index.by.word | awk 'FS=":" {print $3}' | uniq -c
  16.        
  17. result <- empty list
  18. for each email e:
  19.     for each word w:
  20.         if is_interesting_word(w, string_data_structure):
  21.             add (filename, line_number, start_position, word) to results
  22.        
  23. list = ['a', 'bunch', 'of', 'interesting', 'words']
  24. linepos = 0
  25.  
  26. with open("file") as f:
  27.     for line in f:
  28.         linepos += 1
  29.         wordpos = 0
  30.         for word in line.split():
  31.             wordpos += 1
  32.             if word in list:
  33.                 print "%s found at line %s, word %s" % (word, linepos, wordpos)