Tesseract 사용

https://pymupdf.readthedocs.io/en/latest/about.html 사이트에서 주요 내용을 발췌함. 자세한 내용을 사이트 참고 바람

Feature

  • Supports Multiple Document Formats
    • PDF, XPS, EPUB, MOBI, FB2, CBZ, SVG, TXT, Image

-Implementation
  • Python and C

  • Render Document Pages

  • Write Text to PDF Page

  • Supports CJK characters

  • Extract Text

  • Extract Text as Markdown

  • Extract Tables

  • Extract Vector Graphics

  • Draw Vector Graphics (PDF)

  • Based on Existing, Mature Library
    • MuPDF

  • Automatic Repair of Damaged PDFs

  • Encrypted PDFs

  • Linerarized PDFs

  • Incremental Updates

  • Integrates with Jupyter and IPython Notebooks

  • Joining / Merging PDF with other Document Types

  • OCR API for Seamless Integration with Tesseract

  • Integrated Checkpoint / Restart Feature (PDF)

  • PDF Optional Content

  • PDF Embedded Files

  • PDF Redactions

  • PDF Annotations

  • PDF Form Fields

  • PDF Page Labels

  • Support Font Sub-Setting

Installation

Installation

pip install --upgrade pymupdf

Build and install from a local PyMuPDF source tree

  • code clone

git clone https://github.com/pymupdf/PyMuPDF.git
  • Build and install

cd PyMuPDF && pip install .

PyMuPDF에서 OCR 기능을 사용하려면 Tesseract의 언어 지원 데이터가 필요합니다. 이를 위해 다음 단계를 완료해야 합니다:

  1. Tesseract의 언어 지원 폴더 위치를 확인합니다:

    • Windows: C:/Program Files/Tesseract-OCR/tessdata

    • Unix 시스템: /usr/share/tesseract-ocr/4.00/tessdata

  2. 환경 변수 TESSDATA_PREFIX를 설정합니다:

    • Windows: setx TESSDATA_PREFIX “C:/Program Files/Tesseract-OCR/tessdata”

    • Unix 시스템: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata