Advertisement
nicuf

pytesseract - OCR convert pdf to txt

Aug 7th, 2023
90
0
Never
Not a member of Pastebin yet? Sign Up, it unlocks many cool features!
text 0.41 KB | None | 0 0
  1. import pytesseract
  2. from PIL import Image
  3. import pdf2image
  4.  
  5. # Converteste PDF in imagini
  6. images = pdf2image.convert_from_path('calea_catre_fisierul_pdf')
  7.  
  8. # Extrage textul din fiecare imagine
  9. text = ""
  10. for img in images:
  11. text += pytesseract.image_to_string(img)
  12.  
  13. # Salveaza textul intr-un fisier .txt
  14. with open('calea_catre_fisierul_txt', 'w', encoding='utf-8') as txt_file:
  15. txt_file.write(text)
  16.  
Advertisement
Add Comment
Please, Sign In to add comment
Advertisement