Advertisement
vikramk3

tags.py

Oct 15th, 2014
261
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 2.29 KB | None | 0 0
  1. # -*- coding: utf-8 -*-
  2. """
  3. Created on Wed Sep 24 21:56:27 2014
  4.  
  5. @author: vikramk3
  6. """
  7.  
  8. #!/usr/bin/env python
  9. # -*- coding: utf-8 -*-
  10. import xml.etree.ElementTree as ET
  11. import pprint
  12. import re
  13. """
  14. Your task is to explore the data a bit more.
  15. Before you process the data and add it into MongoDB, you should
  16. check the "k" value for each "<tag>" and see if they can be valid keys in MongoDB,
  17. as well as see if there are any other potential problems.
  18.  
  19. We have provided you with 3 regular expressions to check for certain patterns
  20. in the tags. As we saw in the quiz earlier, we would like to change the data model
  21. and expand the "addr:street" type of keys to a dictionary like this:
  22. {"address": {"street": "Some value"}}
  23. So, we have to see if we have such tags, and if we have any tags with problematic characters.
  24. Please complete the function 'key_type'.
  25. """
  26.  
  27.  
  28. lower = re.compile(r'^([a-z]|_)*$')
  29. lower_colon = re.compile(r'^([a-z]|_)*:([a-z]|_)*$')
  30. problemchars = re.compile(r'[=\+/&<>;\'"\?%#$@\,\. \t\r\n]')
  31.  
  32.  
  33. def key_type(element, keys):
  34.     if element.tag == "tag":
  35.         # I obtain the k_value for the tag
  36.         k_value=element.get("k")
  37.         # I perform three searches for the three types of patterns
  38.         search_res1=re.search(lower,k_value)
  39.         search_res2=re.search(lower_colon,k_value)
  40.         search_res3=re.search(problemchars,k_value)
  41.         # I perform if-elseif-else to identify and count the key_types
  42.         if search_res1:
  43.             keys["lower"]=keys["lower"]+1
  44.         elif search_res2:
  45.             keys["lower_colon"]=keys["lower_colon"] +1
  46.         elif search_res3:
  47.             keys["problemchars"]=keys["problemchars"] +1
  48.         else:
  49.             keys["other"]=keys["other"] + 1
  50.  
  51.         pass
  52.        
  53.     return keys
  54.  
  55.  
  56.  
  57. def process_map(filename):
  58.     # I added element.clear() after each iteration to avoid memory problems with the iterative parsing
  59.     keys = {"lower": 0, "lower_colon": 0, "problemchars": 0, "other": 0}
  60.     for _, element in ET.iterparse(filename):
  61.         keys = key_type(element, keys)
  62.         element.clear()
  63.  
  64.     return keys
  65.  
  66.  
  67.  
  68. def test():
  69.  
  70.     keys = process_map('C:\\Users\\vikramk3\\Documents\\Courses\\Data_Wrangling\\austin_texas.osm')
  71.     pprint.pprint(keys)
  72.  
  73.  
  74. if __name__ == "__main__":
  75.     test()
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement