daily pastebin goal
14%
SHARE
TWEET

Untitled

a guest Dec 29th, 2011 107 Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
  1. #!/usr/bin/python
  2. #Prune the database dump of all articles starting with Wikipedia namespace.
  3. #Usage: pv enwiki-latest-pages-articles.xml.bz2 | bunzip2 | ../prune-wiki | bzip2 > out.xml.bz2
  4. # pbzip2 .bz2 files don't seem to work with wikitaxi
  5. #File must end in .xml.bz2
  6.  
  7. #Get the dump:
  8. #wget http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
  9.  
  10.  
  11. import sys, re
  12.  
  13. #try cutting out <ref> to </ref> ???
  14.  
  15. #Cutting out File:|MediaWiki:|Help: saves only 2%
  16. #test various cuts before publishing.
  17. #Note what's cut in the readme.txt                    
  18. #List of articles in namespace:
  19. #http://en.wikipedia.org/w/index.php?title=Special%3ASearch&redirs=1&search=a&fulltext=Search&ns8=1&title=Special%3ASearch&advanced=1&fulltext=Advanced+search
  20.  
  21. #Help: help editing wikipedia (how to start a page)
  22. #MediaWiki: various website scripts?
  23. #File: Information about files
  24.  
  25. #Stats:
  26. #Original: 6.8 GB
  27. #Wikipedia:|MediaWiki:|Help:|File: 5.5 GB
  28. #Wikipedia:|Help:MediaWiki: 5.6 GB
  29. #Wikipedia: 5.6 GB
  30. bad_titles='Wikipedia:'                         # <-- Best default cut
  31. #bad_titles+='|Help:|MediaWiki:'
  32. #bad_titles+='|File:'
  33. sys.stderr.write('Cutting '+bad_titles+'\n') #; exit(1)
  34.  
  35. output=True
  36. for line in sys.stdin:
  37.         if re.match('    <title>',line):
  38.                 title=re.split('<title>',line)[1]
  39.                 if re.match(bad_titles,title):
  40.                         output=False
  41.                 else:
  42.                         output=True
  43.         if output:
  44.                 print line
RAW Paste Data
We use cookies for various purposes including analytics. By continuing to use Pastebin, you agree to our use of cookies as described in the Cookies Policy. OK, I Understand
 
Top