aweebbyanyothername

Untitled

Aug 11th, 2020
108
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 2.61 KB | None | 0 0
  1. # -*- coding: utf-8 -*-
  2.  
  3. from bs4 import BeautifulSoup
  4. import urllib3
  5.  
  6.  
  7. htmlToOpen = open('test.html')
  8. #This opens the html file that you want to work in. To do multiple html files, use a for loop and change the value into variables.
  9. #https://stackoverflow.com/questions/1120707/using-python-to-execute-a-command-on-every-file-in-a-folder
  10.  
  11. soup = BeautifulSoup(htmlToOpen, 'html.parser')
  12. #this creates an HTML object
  13. #i'm using html parser.
  14.  
  15. imageWebPath = ""
  16. #This just declares an empty string variable that I will replace using the image web path
  17.  
  18. newImagePath = "image.jpg"
  19. #this is the new image. To do multiple images, use a for loop and append a number to the file name using a counter of some sort.
  20.  
  21. imagesInHtmlFile = soup.findAll('img')
  22. #this finds all of the images in the html file.
  23. #https://stackoverflow.com/questions/43982002/extract-src-attribute-from-img-tag-using-beautifulsoup/47166671
  24.  
  25.  
  26. for image in imagesInHtmlFile:
  27.     imageWebPath = image['src']
  28.     image['src'] = image['src'].replace(imageWebPath,newImagePath)
  29.    
  30. #This for loop finds all the image src attribute vallues and replaces it with a new path.
  31. #You will need to add to this for loop for multiple images and use a counter.
  32.  
  33.  
  34. htmlFile = open('whatever.html',"w+",encoding='utf-8')
  35. #this opens a new whatever.html file. You will need to use variables instead of values to change the naming convention.
  36. #Also, w+ creates a file if none exists.
  37.  
  38. htmlFile.write(str(soup))
  39. #this writes the data to the whatever.html file.
  40.  
  41. htmlFile.close()
  42. #I just closed the whatever.html file
  43.  
  44. htmlToOpen.close()
  45. #Closed the test.html file
  46.  
  47. ##################################################################################
  48.  
  49. http = urllib3.PoolManager()
  50. #https://urllib3.readthedocs.io/en/latest/ I need to create a PoolManager Object
  51.  
  52. imageFile = open('test.jpg',"wb+")
  53. #this opens the new image file that will be stored on the host server. The variable name is the same as in line 20.
  54. #wb+ allows me to create a new image file if image.jpg doesn't work. The b in wb+ allows me to write a byte.
  55.  
  56. get = http.request('GET', imageWebPath, preload_content=False)
  57. #I created a get object so that I can download the image. This method takes three arguments
  58. #the first argument is a CRUD protocol (not sure if that's the correct saying, but its get, post, etc..)
  59. #The second argument is the image path.
  60. #Ignore the third path.
  61.  
  62. dataDownload = get.data
  63. #I download the data from the website. This is the simpliest way that I can think of.
  64.  
  65. imageFile.write(dataDownload)
  66. #this writes the data to the image.jpg file.
  67.    
  68. imageFile.close()
  69.  
Add Comment
Please, Sign In to add comment