Advertisement
Guest User

Untitled

a guest
Jul 27th, 2017
96
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 7.21 KB | None | 0 0
  1.  
  2. Conversation opened. 1 unread message.
  3.  
  4. Skip to content
  5. Using Gmail with screen readers
  6. Search
  7.  
  8.  
  9.  
  10. Gmail
  11. COMPOSE
  12. Labels
  13. Inbox
  14. Starred
  15. Important
  16. Sent Mail
  17. Drafts
  18. [Imap]/Trash
  19. Deleted Messages
  20. Notes
  21. Personal
  22. Travel
  23. More
  24. Hangouts
  25.  
  26.  
  27.  
  28.  
  29.   More
  30. 1 of 3  
  31.  
  32. Print all In new window
  33. nyt regex~
  34. Inbox
  35. x
  36.  
  37. Torri Raines
  38. Attachments3:19 PM (2 minutes ago)
  39.  
  40. to me
  41. The problem: the re.sub on line 38 works just fine (output is the "cleaned up" file), but leaves the
  42.  
  43. "                Please verify you're not a robot by clicking the box.
  44.                Invalid email address. Please re-enter.
  45.                You must select a newsletter to subscribe to.
  46.            
  47.                Sign Up
  48.  
  49.                You agree to receive occasional updates and special offers for The New York Times's products and services.
  50.  
  51.        Thank you for subscribing.
  52.        An error has occurred. Please try again later.
  53.        You are already subscribed to this email.
  54.        View all New York Times newsletters.
  55.    
  56.    
  57.                See Sample
  58.                Manage Email Preferences
  59.        Not you?
  60.        Privacy Policy
  61.        Opt out or contact us anytime"
  62.  
  63. junk in the middle, which I want to get rid of. So I'm trying to do a more focused replace before it, on line 37, except it's just not doing anything. Using ctrl+F in notepad++ suggests my regex is fine, so I dunno why it won't replace. I've tried it with and without DOTALL, with and without having the regex string in its own variable, and with a couple of different beginnings for the regex string. I also want to get rid of the stuff higher up in the article that starts with "<h2 class="visually-hidden" id="newsletter-promo-heading">Newsletter Sign Up</h2>" but I just tried that for the beginning of the regex string and it didn't work either. The messy file is before any subbing.
  64. 3 Attachments
  65.  
  66.    
  67. Click here to Reply or Forward
  68. 13.47 GB (89%) of 15 GB used
  69. Manage
  70. Terms - Privacy
  71. Last account activity: 1 hour ago
  72. Details
  73. Torri Raines's profile photo
  74. Torri Raines
  75. Research Assistant
  76.  
  77. Show details
  78.  
  79.  
  80. import urllib
  81. import re
  82. from collections import defaultdict
  83.  
  84. def process_article(article_string, full_article):
  85.     badchars = [';', ':', '!', '?', '\\', '/', '*', '"', '<', '>', '|']
  86.     htmljunk = ['<p>', '</p>', '<b>', '</b>']
  87.     if re.search('<meta property="article:tag" content="(.*?)" />"', article_string):
  88.         tags = re.findall('<meta property="article:tag" content="(.*?)" />"', article_string)
  89.         for tag in tags:
  90.             yearsOfTags[year][tag] += 1
  91.     else:
  92.         print "no tags"
  93.         #no_tags_counter += 1
  94.         print article_string
  95.  
  96.  
  97.     if re.search('<meta name="author" content="(.*?)" />', article_string, re.DOTALL):
  98.         author = re.findall('<meta name="author" content="(.*?)" />', article_string, re.DOTALL)
  99.         author = str(author[0])
  100.  
  101.     elif re.search('<meta name="byl" content="(.*?)" />', article_string):
  102.         author = re.findall('<meta name="byl" content="(.*?)" />', article_string)
  103.         author = str(author[0])
  104.  
  105.     else:
  106.         author = "None_found"
  107.         print article_string
  108.     author = author.strip(' ')
  109.     author = author.replace('By ', '')
  110.     author = "~" + author.replace(' ', '_')
  111.  
  112.  
  113.     for marker in htmljunk:
  114.         full_article = full_article.replace(marker, '')
  115.     junkstring = '<div class="control input-control">.*?<div id="#continues-post-newsletter"></div>'
  116.     full_article = re.sub(junkstring, '', full_article, re.DOTALL)
  117.     full_article = re.sub('<.*?>', '', full_article)
  118.     if re.search('<meta property="og:title" content="(.*?)" />', article_string):
  119.         title = re.findall('<meta property="og:title" content="(.*?)" />', article_string)
  120.         title = str(title[0])
  121.     elif re.search('<title>(.*?)</title>', article_string):
  122.         title = re.findall('<title>(.*?)</title>', article_string)
  123.         title = str(title[0])
  124.     else:
  125.         title = "article"
  126.         #print article_string
  127.     for char in badchars:
  128.         title = title.replace(char, '')
  129.     if len(title) > 150:
  130.         title = title.split()
  131.         title = title[:5]
  132.         title = ' '.join(title)
  133.     title = str(year) + "_" + str(month) + "_" + str(page) + "_" + title + ".txt"
  134.     title = outfolder + "/" + title
  135.  
  136.     writeFile = open(title, 'w')
  137.     print >>writeFile, author, full_article
  138.     print "found article"
  139.  
  140.  
  141. outfolder = "Raw Articles/Test"
  142.  
  143. no_tags_counter = 0
  144. failed = 0
  145.  
  146. yearsOfTags = defaultdict(lambda: defaultdict(int))
  147. tagsOutfile = "test tags.txt"
  148. tagsOutfile = open(outfolder + "/" + tagsOutfile, 'w')
  149.  
  150. baseurl = 'https://query.nytimes.com/svc/add/v1/sitesearch.json?end_date=20170103&begin_date=20170101&sort=asc&page=0&fq=document_type%3A%22article%22&facet=true'
  151.  
  152. for year in range(2006,2007):
  153.     #print year
  154.     for month in range(1,13):
  155.         month = str(month).zfill(2)     #convert month formatting
  156.         #print month
  157.         for x in range(0,1):
  158.             start_date = x * 10 + 1
  159.             start_date = str(start_date).zfill(2)
  160.             #print "start: ", start_date
  161.             end_date = (x + 1) * 10
  162.             #print "end: ", end_date
  163.             #end_date = str(end_date).zfill(2)
  164.             for page in range(1,11):
  165.                 url = 'https://query.nytimes.com/svc/add/v1/sitesearch.json?end_date=' + str(year) + str(month) + str(end_date) + '&begin_date=' + str(year) + str(month) + str(start_date) + '&sort=asc&page=' + str(page) + '&fq=document_type%3A%22article%22&facet=true'
  166.  
  167.                 sitestring = str(urllib.urlopen(url).read()).replace('\\', '')
  168.                 #print sitestring
  169.                 if re.search('"web_url":"(.*?)",', sitestring):
  170.                     article_urls = re.findall('"web_url":"(.*?)",', sitestring, re.DOTALL)
  171.                 #do a search for the '"web_url":"(.*?)",' and if that returns something, do a findall for the same thing
  172.  
  173.                     for article in article_urls:
  174.                         #article_url = 'https://88h6obas83.execute-api.us-east-1.amazonaws.com/dev/get_article?id=' + article
  175.                         #pass the results of the findall into a urlopen that appends the results to the end of https://88h6obas83.execute-api.us-east-1.amazonaws.com/dev/get_article?id=
  176.                         article_string = str(urllib.urlopen(article).read()).replace('\\', '')
  177.                         #print article_string
  178.                         if re.search('<p class="story-body-text story-content" data.*?">(.*?)<\p>', article_string, re.DOTALL):
  179.                             found_article = re.findall('<p class="story-body-text story-content" data.*?">(.*?)<\p>', article_string, re.DOTALL)
  180.                             full_article = ' '.join(found_article)
  181.                             process_article(article_string, full_article)
  182.                         elif re.search('<p itemprop="articleBody">(.*?)<\p>', article_string, re.DOTALL):
  183.                             found_article = re.findall('<p itemprop="articleBody">(.*?)<\p>', article_string, re.DOTALL)
  184.                             full_article = ' '.join(found_article)
  185.                             process_article(article_string, full_article)
  186.                         else:
  187.                             print article_string
  188.                             failed += 1
  189.     yearlyOutfile = str(year) + "_tags.txt"
  190.     yearlyOutfile = open(outfolder + "/" + yearlyOutfile, 'w')
  191.     for year in sorted(yearsOfTags):
  192.         print >>yearlyOutfile, "\n" + str(year)
  193.         for tag in yearsOfTags[year]:
  194.             print >>yearlyOutfile, "%s\t%s" % (tag, yearsOfTags[year][tag])
  195.  
  196. print "Failed: ", failed
  197. print "No tags: ", no_tags_counter
  198.  
  199. for year in sorted(yearsOfTags):
  200.     print >>tagsOutfile, "\n" + str(year)
  201.     for tag in yearsOfTags[year]:
  202.         print >>tagsOutfile, "%s\t%s" % (tag, yearsOfTags[year][tag])
  203. NYTcrawler_no-outline.py
  204. Open with Google Docs
  205. Displaying NYTcrawler_no-outline.py.
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement