Advertisement
Guest User

Untitled

a guest
Jul 19th, 2019
464
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 2.36 KB | None | 0 0
  1. """If you are like me, you have folders with papers, mostly from arxiv.org, and
  2. need to fetch their bibtex entries from google scholar every time you write something.
  3. This script will help you with that, by sending very polite requests to google scholar.
  4. This means it will take a while to complete, but it is still recommended to run it behind a VPN.
  5. The free ProtonVPN works very well for this purpose.
  6.  
  7. Copy fetch_bib.py in a folder with your papers.
  8. It will generate a _generated_bibliography.bib file
  9. with all the bibtex entries pulled from google scholar.
  10. It uses only the filenames of pdfs within the folder.
  11.  
  12. Scenarios that work well:
  13. - Arxiv ID in filename. It will try the whole title first,
  14. but fall back to just trying the arxiv ID. Handles versioned arxiv IDs (1907.00000v3)
  15. - Words in filenames separated by '_', ' - ' or spaces.
  16. - Only works if the correct paper is the first search result on google scholar.
  17. - When the lookup fails, it prints info to the terminal for you to handle manually.
  18.  
  19. Usage:
  20. python fetch_bibtex.py
  21.  
  22. """
  23.  
  24. import scholarly
  25. import os
  26.  
  27. # import requests
  28. import re
  29. from time import sleep
  30.  
  31. # regex strings
  32. versioned_arxiv = r"(([0-9]{4}\.[0-9]{5})v[0-9]+)"
  33. arxiv = r"([0-9]{4}\.[0-9]{5})"
  34.  
  35. papers = os.listdir(".")
  36. titles = []
  37. for p in papers:
  38. if p[-4:] == ".pdf":
  39. # Capture versioned arxiv IDs and strip the version
  40. m = re.match(versioned_arxiv, p)
  41. if m:
  42. p = p.replace(m.group(1), m.group(2))
  43.  
  44. p = p.replace(" - ", " ")
  45. p = p.replace("_", " ")
  46. titles.append(p[:-4])
  47. else:
  48. print(f"Skipping file: {p}")
  49.  
  50.  
  51. for t in titles:
  52. r = scholarly.search_pubs_query(t)
  53. hit = None
  54. try:
  55. hit = next(r)
  56. except StopIteration:
  57.  
  58. m = re.match(arxiv, t)
  59. if m:
  60. print(f"Trying Arxiv ID only for: '{t}'")
  61. sleep(3)
  62. r = scholarly.search_pubs_query(m.group(1))
  63. try:
  64. hit = next(r)
  65. except StopIteration:
  66. print(f"Still no results on: {m.group(1)}")
  67. else:
  68. print(f"No results found for: '{t}'")
  69.  
  70. if hit:
  71. bibtext = scholarly.scholarly._get_page(hit.url_scholarbib)
  72. # bibtext = requests.get(hit.url_scholarbib).text
  73. with open("_generated_bibliography.bib", "a") as f:
  74. f.write(bibtext)
  75. f.write("\n")
  76. sleep(3)
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement