PyMuPDF
====================================

https://pymupdf.readthedocs.io/en/latest/about.html 사이트에서 주요 내용을 발췌함. 자세한 내용을 사이트 참고 바람


Feature 
------------

- Supports Multiple Document Formats
    - PDF, XPS, EPUB, MOBI, FB2, CBZ, SVG, TXT, Image
- Implementation
    - Python and C 
- Render Document Pages
- Write Text to PDF Page 
- Supports CJK characters
- Extract Text
- Extract Text as Markdown
- Extract Tables
- Extract Vector Graphics
- Draw Vector Graphics (PDF)
- Based on Existing, Mature Library 
    - MuPDF 
- Automatic Repair of Damaged PDFs
- Encrypted PDFs	
- Linerarized PDFs
- Incremental Updates
- Integrates with Jupyter and IPython Notebooks
- Joining / Merging PDF with other Document Types
- OCR API for Seamless Integration with Tesseract
- Integrated Checkpoint / Restart Feature (PDF)
- PDF Optional Content
- PDF Embedded Files
- PDF Redactions	
- PDF Annotations
- PDF Form Fields
- PDF Page Labels
- Support Font Sub-Setting


Installation
--------------

Installation

.. code-block::
    
    pip install --upgrade pymupdf


Build and install from a local PyMuPDF source tree

- code clone 

.. code-block::
    
    git clone https://github.com/pymupdf/PyMuPDF.git


- Build and install

.. code-block::
    
    cd PyMuPDF && pip install .


PyMuPDF에서 OCR 기능을 사용하려면 Tesseract의 언어 지원 데이터가 필요합니다. 이를 위해 다음 단계를 완료해야 합니다:

1. Tesseract의 언어 지원 폴더 위치를 확인합니다:

    - Windows: C:/Program Files/Tesseract-OCR/tessdata
    - Unix 시스템: /usr/share/tesseract-ocr/4.00/tessdata

2. 환경 변수 TESSDATA_PREFIX를 설정합니다:

    - Windows: setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"
    - Unix 시스템: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata



Basics
--------------

.. code-block::

    import pymupdf
    doc = pymupdf.open("a.pdf") # open a document    

- Extract Text

.. code-block::

    import pymupdf

    doc = pymupdf.open("a.pdf") # open a document
    out = open("output.txt", "wb") # create a text output
    for page in doc: # iterate the document pages
        text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
        out.write(text) # write text of page
        out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
    out.close()    


- 이미지에 있는 text를 추출할 수 있음. 

.. code-block::

    tp = page.get_textpage_ocr()
    text = page.get_text(textpage=tp)

- 텍스트를 특정 영역에서 추출하는 방법 문서에서 표를 추출하는 방법 등 많은 예제 있음. 
    - https://pymupdf.readthedocs.io/en/latest/recipes-text.html#recipestext

- Markdown 형식으로 텍스트를 추출할 수도 있음 
    - https://pymupdf.readthedocs.io/en/latest/rag.html#rag-outputting-as-md


- Extract images from a PDF 

.. code-block::

    import pymupdf

    doc = pymupdf.open("test.pdf") # open a document

    for page_index in range(len(doc)): # iterate over pdf pages
        page = doc[page_index] # get the page
        image_list = page.get_images()

        # print the number of images found on the page
        if image_list:
            print(f"Found {len(image_list)} images on page {page_index}")
        else:
            print("No images found on page", page_index)

        for image_index, img in enumerate(image_list, start=1): # enumerate the image list
            xref = img[0] # get the XREF of the image
            pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

            if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
                pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

            pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
            pix = None

- Extract vector Graphics

.. code-block::

    doc = pymupdf.open("some.file")
    page = doc[0]
    paths = page.get_drawings()    

- How to Extract Drawings 
    - https://pymupdf.readthedocs.io/en/latest/recipes-drawing-and-graphics.html#recipesdrawingandgraphics-extract-drawings


- PDF 파일들 Merge 하기 

.. code-block::

    import pymupdf

    doc_a = pymupdf.open("a.pdf") # open the 1st document
    doc_b = pymupdf.open("b.pdf") # open the 2nd document

    doc_a.insert_pdf(doc_b) # merge the docs
    doc_a.save("a+b.pdf") # save the merged document with a new filename    


- PDF 파일들이랑 다른 유형의 파일을 Merge하기 

.. code-block::

    import pymupdf

    doc_a = pymupdf.open("a.pdf") # open the 1st document
    doc_b = pymupdf.open("b.svg") # open the 2nd document

    doc_a.insert_file(doc_b) # merge the docs
    doc_a.save("a+b.pdf") # save the merged document with a new filename


- Document.insert_pdf() 및 Document.insert_file()를 사용하여 PDF를 쉽게 병합할 수 있습니다. 열려 있는 PDF 문서가 주어지면, 한 문서에서 다른 문서로 페이지 범위를 복사할 수 있습니다. 복사된 페이지가 삽입될 위치를 선택할 수 있으며, 페이지 순서를 반대로 하거나 페이지 회전을 변경할 수도 있습니다.


- PDF에 watermark 추가하기 

.. code-block::

    import pymupdf

    doc = pymupdf.open("document.pdf") # open a document

    for page_index in range(len(doc)): # iterate over pdf pages
        page = doc[page_index] # get the page

        # insert an image watermark from a file name to fit the page bounds
        page.insert_image(page.bound(),filename="watermark.png", overlay=False)

    doc.save("watermarked-document.pdf") # save the document with a new filename


- PDF에 image 추가하기


.. code-block::
    
    import pymupdf

    doc = pymupdf.open("document.pdf") # open a document

    for page_index in range(len(doc)): # iterate over pdf pages
        page = doc[page_index] # get the page

        # insert an image logo from a file name at the top left of the document
        page.insert_image(pymupdf.Rect(0,0,50,50),filename="my-logo.png")

    doc.save("logo-document.pdf") # save the document with a new filename


- PDF 회전

.. code-block::

    import pymupdf

    doc = pymupdf.open("test.pdf") # open document
    page = doc[0] # get the 1st page of the document
    page.set_rotation(90) # rotate the page
    doc.save("rotated-page-1.pdf")

- PDF 자르기 

.. code-block::

    import pymupdf

    doc = pymupdf.open("test.pdf") # open document
    page = doc[0] # get the 1st page of the document
    page.set_cropbox(pymupdf.Rect(100, 100, 400, 400)) # set a cropbox for the page
    doc.save("cropped-page-1.pdf")

- File Attach하기 

.. code-block::

    import pymupdf

    doc = pymupdf.open("test.pdf") # open main document
    attachment = pymupdf.open("my-attachment.pdf") # open document you want to attach

    page = doc[0] # get the 1st page of the document
    point = pymupdf.Point(100, 100) # create the point where you want to add the attachment
    attachment_data = attachment.tobytes() # get the document byte data as a buffer

    # add the file annotation with the point, data and the file name
    file_annotation = page.add_file_annot(point, attachment_data, "attachment.pdf")

    doc.save("document-with-attachment.pdf") # save the document


Basics 기능 일부 발췌함. 그외 기능 아주 많음. 웹문서 참고할 것.    


PyMuPDF, LLM & RAG 
-----------------------

- PyMuPDF를 대형 언어 모델(LLM) 프레임워크 및 전체 RAG솔루션에 통합하는 것은 문서 데이터를 제공하는 가장 빠르고 신뢰할 수 있는 방법입니다.
- 몇 가지 잘 알려진 LLM 솔루션들은 PyMuPDF와 자체 인터페이스를 가지고 있습니다.

- Markdown으로 내보내야 한다면:
   - PyMuPDF4LLM 사용해보라 

- LangChain과의 통합
    - LangChain의 전용 로더를 사용하여 직접 통합하는 것은 간단합니다:

.. code-block::

    from langchain_community.document_loaders import PyMuPDFLoader
    loader = PyMuPDFLoader("example.pdf")
    data = loader.load()    

- LlamaIndex와의 통합
    - Chunking을 위한 데이터 준비
        - PyMuPDF로 Markdown 출력하여 Chunking 준비할 수 있음. 

.. code-block::

    # 문서를 Markdown으로 변환
    import pymupdf4llm
    md_text = pymupdf4llm.to_markdown("input.pdf")

    # UTF8 인코딩으로 파일에 텍스트 쓰기
    import pathlib
    pathlib.Path("output.md").write_bytes(md_text.encode())


- Markdown 출력을 LangChain에서 사용하는 방법

.. code-block::

    import pymupdf4llm
    from langchain.text_splitter import MarkdownTextSplitter

    # Get the MD text
    md_text = pymupdf4llm.to_markdown("input.pdf")  # get markdown for all pages

    splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

    splitter.create_documents([md_text])