Advertisement
Guest User

Untitled

a guest
Oct 25th, 2014
157
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 3.13 KB | None | 0 0
  1. import pandas as pd
  2. import re
  3. import urllib2
  4. data = urllib2.urlopen('http://www.census.gov/acs/www/Downloads/data_documentation/pums/DataDict/PUMSDataDict13.txt')
  5.  
  6. ## replace newline characters so we can use dots and find everything until a double
  7. ## carriage return (replaced to ||) with a lookahead assertion.
  8. data=data.replace('n','|')
  9.  
  10. datadict=pd.DataFrame(re.findall("([A-Z]{2,8})s{2,9}([0-9]{1})s{2,6}|s{2,4}([A-Za-z-() ]{3,85})",data,re.MULTILINE),columns=['variable','width','description'])
  11. datadict.head(5)
  12.  
  13. +----+----------+-------+------------------------------------------------+
  14. | | variable | width | description |
  15. +----+----------+-------+------------------------------------------------+
  16. | 0 | RT | 1 | Record Type |
  17. +----+----------+-------+------------------------------------------------+
  18. | 1 | SERIALNO | 7 | Housing unit |
  19. +----+----------+-------+------------------------------------------------+
  20. | 2 | DIVISION | 1 | Division code |
  21. +----+----------+-------+------------------------------------------------+
  22. | 3 | PUMA | 5 | Public use microdata area code (PUMA) based on |
  23. +----+----------+-------+------------------------------------------------+
  24. | 4 | REGION | 1 | Region code |
  25. +----+----------+-------+------------------------------------------------+
  26. | 5 | ST | 2 | State Code |
  27. +----+----------+-------+------------------------------------------------+
  28.  
  29. datadict_exp=pd.DataFrame(
  30. re.findall("([A-Z]{2,9})s{2,9}([0-9]{1})s{2,6}|s{4}([A-Za-z-();<> 0-9]{2,85})|s{11,15}([a-z0-9]{0,2})[ ].([A-Za-z/-() ]{2,120})",
  31. data,re.MULTILINE))
  32. datadict_exp.head(5)
  33.  
  34. +----+----------+-------+---------------------------------------------------+---------+--------------+
  35. | id | variable | width | description | value_1 | label_1 |
  36. +----+----------+-------+---------------------------------------------------+---------+--------------+
  37. | 0 | DIVISION | 1 | Division code | 0 | Puerto Rico |
  38. +----+----------+-------+---------------------------------------------------+---------+--------------+
  39. | 1 | REGION | 1 | Region code | 1 | Northeast |
  40. +----+----------+-------+---------------------------------------------------+---------+--------------+
  41. | 2 | ST | 2 | State Code | 1 | Alabama/AL |
  42. +----+----------+-------+---------------------------------------------------+---------+--------------+
  43. | 3 | NP | 2 | Number of person records following this housin... | 0 | Vacant unit |
  44. +----+----------+-------+---------------------------------------------------+---------+--------------+
  45. | 4 | TYPE | 1 | Type of unit | 1 | Housing unit |
  46. +----+----------+-------+---------------------------------------------------+---------+--------------+
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement