Advertisement
Guest User

Untitled

a guest
Dec 29th, 2011
179
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.38 KB | None | 0 0
  1. #!/usr/bin/python
  2. #Prune the database dump of all articles starting with Wikipedia namespace.
  3. #Usage: pv enwiki-latest-pages-articles.xml.bz2 | bunzip2 | ../prune-wiki | bzip2 > out.xml.bz2
  4. # pbzip2 .bz2 files don't seem to work with wikitaxi
  5. #File must end in .xml.bz2
  6.  
  7. #Get the dump:
  8. #wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
  9.  
  10.  
  11. import sys, re
  12.  
  13. #try cutting out <ref> to </ref> ???
  14.  
  15. #Cutting out File:|MediaWiki:|Help: saves only 2%
  16. #test various cuts before publishing.
  17. #Note what's cut in the readme.txt                    
  18. #List of articles in namespace:
  19. #http://en.wikipedia.org/w/index.php?title=Special%3ASearch&redirs=1&search=a&fulltext=Search&ns8=1&title=Special%3ASearch&advanced=1&fulltext=Advanced+search
  20.  
  21. #Help: help editing wikipedia (how to start a page)
  22. #MediaWiki: various website scripts?
  23. #File: Information about files
  24.  
  25. #Stats:
  26. #Original: 6.8 GB
  27. #Wikipedia:|MediaWiki:|Help:|File: 5.5 GB
  28. #Wikipedia:|Help:MediaWiki: 5.6 GB
  29. #Wikipedia: 5.6 GB
  30. bad_titles='Wikipedia:'             # <-- Best default cut
  31. #bad_titles+='|Help:|MediaWiki:'
  32. #bad_titles+='|File:'
  33. sys.stderr.write('Cutting '+bad_titles+'\n') #; exit(1)
  34.  
  35. output=True
  36. for line in sys.stdin:
  37.     if re.match('    <title>',line):
  38.         title=re.split('<title>',line)[1]
  39.         if re.match(bad_titles,title):
  40.             output=False
  41.         else:
  42.             output=True
  43.     if output:
  44.         print line
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement