Tesseract 사용 ==================================== https://pymupdf.readthedocs.io/en/latest/about.html 사이트에서 주요 내용을 발췌함. 자세한 내용을 사이트 참고 바람 Feature ------------ - Supports Multiple Document Formats - PDF, XPS, EPUB, MOBI, FB2, CBZ, SVG, TXT, Image -Implementation - Python and C - Render Document Pages - Write Text to PDF Page - Supports CJK characters - Extract Text - Extract Text as Markdown - Extract Tables - Extract Vector Graphics - Draw Vector Graphics (PDF) - Based on Existing, Mature Library - MuPDF - Automatic Repair of Damaged PDFs - Encrypted PDFs - Linerarized PDFs - Incremental Updates - Integrates with Jupyter and IPython Notebooks - Joining / Merging PDF with other Document Types - OCR API for Seamless Integration with Tesseract - Integrated Checkpoint / Restart Feature (PDF) - PDF Optional Content - PDF Embedded Files - PDF Redactions - PDF Annotations - PDF Form Fields - PDF Page Labels - Support Font Sub-Setting Installation -------------- Installation .. code-block:: pip install --upgrade pymupdf Build and install from a local PyMuPDF source tree - code clone .. code-block:: git clone https://github.com/pymupdf/PyMuPDF.git - Build and install .. code-block:: cd PyMuPDF && pip install . PyMuPDF에서 OCR 기능을 사용하려면 Tesseract의 언어 지원 데이터가 필요합니다. 이를 위해 다음 단계를 완료해야 합니다: 1. Tesseract의 언어 지원 폴더 위치를 확인합니다: - Windows: C:/Program Files/Tesseract-OCR/tessdata - Unix 시스템: /usr/share/tesseract-ocr/4.00/tessdata 2. 환경 변수 TESSDATA_PREFIX를 설정합니다: - Windows: setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata" - Unix 시스템: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata