Advertisement
Python253

pdf2html

Mar 14th, 2024
603
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
Python 1.37 KB | None | 0 0
  1. #!/usr/bin/env python3
  2. # -*- coding: utf-8 -*-
  3. # Filename: pdf2html.py
  4. # Version: 1.0.0
  5. # Author: Jeoi Reqi
  6.  
  7. """
  8. Description:
  9. This script converts a PDF file (.pdf) to an HTML file (.html).
  10. It extracts text and formatting information from each page of the PDF and writes it to an HTML file.
  11.  
  12. Requirements:
  13. - Python 3.x
  14. - PyMuPDF library (install using: pip install PyMuPDF)
  15.  
  16. Usage:
  17. 1. Save this script as 'pdf2html.py'.
  18. 2. Ensure your PDF file ('example.pdf') is in the same directory as the script.
  19. 3. Install the PyMuPDF library using the command: 'pip install PyMuPDF'
  20. 4. Run the script.
  21.  
  22. Note: Adjust the 'pdf_filename' and 'html_filename' variables in the script as needed.
  23. """
  24.  
  25. import fitz  # PyMuPDF
  26.  
  27. def pdf_to_html(pdf_filename, html_filename):
  28.     pdf_document = fitz.open(pdf_filename)
  29.     html_content = ""
  30.  
  31.     for page_num in range(pdf_document.page_count):
  32.         page = pdf_document[page_num]
  33.         html_content += page.get_text("html")
  34.  
  35.     with open(html_filename, 'w', encoding='utf-8') as html_file:
  36.         html_file.write(html_content)
  37.  
  38. if __name__ == "__main__":
  39.     # Set the filenames for the PDF and HTML files
  40.     pdf_filename = 'example.pdf'
  41.     html_filename = 'pdf2html.html'
  42.  
  43.     # Convert the PDF to an HTML file
  44.     pdf_to_html(pdf_filename, html_filename)
  45.  
  46.     print(f"Converted '{pdf_filename}' to '{html_filename}'.")
  47.  
  48.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement