Tesseract 사용
https://pymupdf.readthedocs.io/en/latest/about.html 사이트에서 주요 내용을 발췌함. 자세한 내용을 사이트 참고 바람
Feature
- Supports Multiple Document Formats
PDF, XPS, EPUB, MOBI, FB2, CBZ, SVG, TXT, Image
- -Implementation
Python and C
Render Document Pages
Write Text to PDF Page
Supports CJK characters
Extract Text
Extract Text as Markdown
Extract Tables
Extract Vector Graphics
Draw Vector Graphics (PDF)
- Based on Existing, Mature Library
MuPDF
Automatic Repair of Damaged PDFs
Encrypted PDFs
Linerarized PDFs
Incremental Updates
Integrates with Jupyter and IPython Notebooks
Joining / Merging PDF with other Document Types
OCR API for Seamless Integration with Tesseract
Integrated Checkpoint / Restart Feature (PDF)
PDF Optional Content
PDF Embedded Files
PDF Redactions
PDF Annotations
PDF Form Fields
PDF Page Labels
Support Font Sub-Setting
Installation
Installation
pip install --upgrade pymupdf
Build and install from a local PyMuPDF source tree
code clone
git clone https://github.com/pymupdf/PyMuPDF.git
Build and install
cd PyMuPDF && pip install .
PyMuPDF에서 OCR 기능을 사용하려면 Tesseract의 언어 지원 데이터가 필요합니다. 이를 위해 다음 단계를 완료해야 합니다:
Tesseract의 언어 지원 폴더 위치를 확인합니다:
Windows: C:/Program Files/Tesseract-OCR/tessdata
Unix 시스템: /usr/share/tesseract-ocr/4.00/tessdata
환경 변수 TESSDATA_PREFIX를 설정합니다:
Windows: setx TESSDATA_PREFIX “C:/Program Files/Tesseract-OCR/tessdata”
Unix 시스템: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata