Tesseract 사용
====================================

https://pymupdf.readthedocs.io/en/latest/about.html 사이트에서 주요 내용을 발췌함. 자세한 내용을 사이트 참고 바람


Feature 
------------

- Supports Multiple Document Formats
    - PDF, XPS, EPUB, MOBI, FB2, CBZ, SVG, TXT, Image
-Implementation
    - Python and C 
- Render Document Pages
- Write Text to PDF Page 
- Supports CJK characters
- Extract Text
- Extract Text as Markdown
- Extract Tables
- Extract Vector Graphics
- Draw Vector Graphics (PDF)
- Based on Existing, Mature Library 
    - MuPDF 
- Automatic Repair of Damaged PDFs
- Encrypted PDFs	
- Linerarized PDFs
- Incremental Updates
- Integrates with Jupyter and IPython Notebooks
- Joining / Merging PDF with other Document Types
- OCR API for Seamless Integration with Tesseract
- Integrated Checkpoint / Restart Feature (PDF)
- PDF Optional Content
- PDF Embedded Files
- PDF Redactions	
- PDF Annotations
- PDF Form Fields
- PDF Page Labels
- Support Font Sub-Setting


Installation
--------------

Installation

.. code-block::
    
    pip install --upgrade pymupdf


Build and install from a local PyMuPDF source tree

- code clone 

.. code-block::
    
    git clone https://github.com/pymupdf/PyMuPDF.git


- Build and install

.. code-block::
    
    cd PyMuPDF && pip install .


PyMuPDF에서 OCR 기능을 사용하려면 Tesseract의 언어 지원 데이터가 필요합니다. 이를 위해 다음 단계를 완료해야 합니다:

1. Tesseract의 언어 지원 폴더 위치를 확인합니다:

    - Windows: C:/Program Files/Tesseract-OCR/tessdata
    - Unix 시스템: /usr/share/tesseract-ocr/4.00/tessdata

2. 환경 변수 TESSDATA_PREFIX를 설정합니다:

    - Windows: setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"
    - Unix 시스템: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata