Advertisement
Guest User

Untitled

a guest
Jul 21st, 2019
161
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 2.07 KB | None | 0 0
  1. ### Programm to connect to the University of Calgary's LIBRARIES
  2. ### AND CULTURAL RESOURCES api and retrieve "Alberta Gazette 1908" .
  3. ### The website https://cdm22007.contentdm.oclc.org/digital/collection/p22007coll9/id/492409/rec/2
  4. ### Does this via a fetch request which can be viewed in the developer tools network tab.
  5. ### This program emulates this behavior. It makes a request to the api which returns a json object
  6. ### Which needs is then unicode encoded and writen toa file called book txt.
  7. ### The following improvements are needed:
  8. #   Error handling in case the server fails to respond. (atm just rerun program)
  9. import requests,json, time
  10.  
  11. #Original URL for reference
  12. #url = "https://cdm22007.contentdm.oclc.org/digital/api/collections/p22007coll9/items/491899/false"
  13.  
  14.  
  15. page_count = 512;
  16.  
  17. url_prefix = "https://cdm22007.contentdm.oclc.org/digital/api/collections/p22007coll9/items/"
  18. url_page_count = 491899;
  19. url_suffix = "/false"
  20.  
  21. urls = []
  22. url =""
  23.  
  24.  
  25. #Construct a list of urls that will serve as a work queue
  26. for i in range(page_count):
  27.     # construct a url from the prefix, incremented page counter and suffix
  28.     #and store it in the list of urls
  29.     url = url_prefix+ str(url_page_count+1)+url_suffix
  30.     urls.append(url)
  31.     url_page_count += 1
  32.  
  33. f= open("book.txt","w+" ,  encoding="utf-8") #Not specifying the encoding causes some errors
  34.  
  35. for j in range(len(urls)):
  36.     print(j)
  37.  
  38.     if j% 50 == 0: #Delay to give server time to process and inorder to not exceed request limit
  39.         time.sleep(7)
  40.        
  41.     r = requests.get(url = urls[j])
  42.     json_data = json.loads(r.text)
  43.     # save the page number, url and text in file
  44.    
  45.     f.write("Page #"+ str(j) +"\n")
  46.     f.write("URL :" + urls[j]+"\n")
  47.    
  48.     f.write(json_data['text'])
  49.     #page divider
  50.     #https://www.asciiart.eu/art-and-design/dividers
  51.     f.write("\n\n\n\n\n")
  52.     f.write("^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^v^")
  53.     f.write("\n\n\n\n\n")
  54.  
  55.    
  56.    
  57. f.close()
  58. print("done")
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement