Guest User

Untitled

a guest
Oct 22nd, 2017
88
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.49 KB | None | 0 0
  1. #
  2. # Usage: python3 clean_chinese_corpus.py <corpus-to-clean> > <output>
  3. #
  4.  
  5.  
  6. import argparse
  7. import re
  8.  
  9.  
  10. # Parse args
  11. parser = argparse.ArgumentParser(description='Clean Chinese corpus')
  12. parser.add_argument('filename', metavar='F', type=str, nargs=1,
  13. help='file to be cleaned')
  14.  
  15. args = parser.parse_args()
  16.  
  17.  
  18. # Clean corpus
  19. with open(args.filename[0]) as f:
  20. for line in f:
  21. for c in re.findall(r'[\u4e00-\u9fff|\s]+', line):
  22. print(c, end='')
  23. print('')
Add Comment
Please, Sign In to add comment