PyMuPDF

https://pymupdf.readthedocs.io/en/latest/about.html 사이트에서 주요 내용을 발췌함. 자세한 내용을 사이트 참고 바람

Feature

Supports Multiple Document Formats
- PDF, XPS, EPUB, MOBI, FB2, CBZ, SVG, TXT, Image
Implementation
- Python and C
Render Document Pages
Write Text to PDF Page
Supports CJK characters
Extract Text
Extract Text as Markdown
Extract Tables
Extract Vector Graphics
Draw Vector Graphics (PDF)
Based on Existing, Mature Library
- MuPDF
Automatic Repair of Damaged PDFs
Encrypted PDFs
Linerarized PDFs
Incremental Updates
Integrates with Jupyter and IPython Notebooks
Joining / Merging PDF with other Document Types
OCR API for Seamless Integration with Tesseract
Integrated Checkpoint / Restart Feature (PDF)
PDF Optional Content
PDF Embedded Files
PDF Redactions
PDF Annotations
PDF Form Fields
PDF Page Labels
Support Font Sub-Setting

Installation

Installation

pip install --upgrade pymupdf

Build and install from a local PyMuPDF source tree

code clone

git clone https://github.com/pymupdf/PyMuPDF.git

Build and install

cd PyMuPDF && pip install .

PyMuPDF에서 OCR 기능을 사용하려면 Tesseract의 언어 지원 데이터가 필요합니다. 이를 위해 다음 단계를 완료해야 합니다:

Tesseract의 언어 지원 폴더 위치를 확인합니다:
- Windows: C:/Program Files/Tesseract-OCR/tessdata
- Unix 시스템: /usr/share/tesseract-ocr/4.00/tessdata
환경 변수 TESSDATA_PREFIX를 설정합니다:
- Windows: setx TESSDATA_PREFIX “C:/Program Files/Tesseract-OCR/tessdata”
- Unix 시스템: declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata

Basics

import pymupdf
doc = pymupdf.open("a.pdf") # open a document

Extract Text

import pymupdf

doc = pymupdf.open("a.pdf") # open a document
out = open("output.txt", "wb") # create a text output
for page in doc: # iterate the document pages
    text = page.get_text().encode("utf8") # get plain text (is in UTF-8)
    out.write(text) # write text of page
    out.write(bytes((12,))) # write page delimiter (form feed 0x0C)
out.close()

이미지에 있는 text를 추출할 수 있음.

tp = page.get_textpage_ocr()
text = page.get_text(textpage=tp)

텍스트를 특정 영역에서 추출하는 방법 문서에서 표를 추출하는 방법 등 많은 예제 있음.
- https://pymupdf.readthedocs.io/en/latest/recipes-text.html#recipestext
Markdown 형식으로 텍스트를 추출할 수도 있음
- https://pymupdf.readthedocs.io/en/latest/rag.html#rag-outputting-as-md
Extract images from a PDF

import pymupdf

doc = pymupdf.open("test.pdf") # open a document

for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page
    image_list = page.get_images()

    # print the number of images found on the page
    if image_list:
        print(f"Found {len(image_list)} images on page {page_index}")
    else:
        print("No images found on page", page_index)

    for image_index, img in enumerate(image_list, start=1): # enumerate the image list
        xref = img[0] # get the XREF of the image
        pix = pymupdf.Pixmap(doc, xref) # create a Pixmap

        if pix.n - pix.alpha > 3: # CMYK: convert to RGB first
            pix = pymupdf.Pixmap(pymupdf.csRGB, pix)

        pix.save("page_%s-image_%s.png" % (page_index, image_index)) # save the image as png
        pix = None

Extract vector Graphics

doc = pymupdf.open("some.file")
page = doc[0]
paths = page.get_drawings()

How to Extract Drawings
- https://pymupdf.readthedocs.io/en/latest/recipes-drawing-and-graphics.html#recipesdrawingandgraphics-extract-drawings
PDF 파일들 Merge 하기

import pymupdf

doc_a = pymupdf.open("a.pdf") # open the 1st document
doc_b = pymupdf.open("b.pdf") # open the 2nd document

doc_a.insert_pdf(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename

PDF 파일들이랑 다른 유형의 파일을 Merge하기

import pymupdf

doc_a = pymupdf.open("a.pdf") # open the 1st document
doc_b = pymupdf.open("b.svg") # open the 2nd document

doc_a.insert_file(doc_b) # merge the docs
doc_a.save("a+b.pdf") # save the merged document with a new filename

Document.insert_pdf() 및 Document.insert_file()를 사용하여 PDF를 쉽게 병합할 수 있습니다. 열려 있는 PDF 문서가 주어지면, 한 문서에서 다른 문서로 페이지 범위를 복사할 수 있습니다. 복사된 페이지가 삽입될 위치를 선택할 수 있으며, 페이지 순서를 반대로 하거나 페이지 회전을 변경할 수도 있습니다.
PDF에 watermark 추가하기

import pymupdf

doc = pymupdf.open("document.pdf") # open a document

for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page

    # insert an image watermark from a file name to fit the page bounds
    page.insert_image(page.bound(),filename="watermark.png", overlay=False)

doc.save("watermarked-document.pdf") # save the document with a new filename

PDF에 image 추가하기

import pymupdf

doc = pymupdf.open("document.pdf") # open a document

for page_index in range(len(doc)): # iterate over pdf pages
    page = doc[page_index] # get the page

    # insert an image logo from a file name at the top left of the document
    page.insert_image(pymupdf.Rect(0,0,50,50),filename="my-logo.png")

doc.save("logo-document.pdf") # save the document with a new filename

PDF 회전

import pymupdf

doc = pymupdf.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
page.set_rotation(90) # rotate the page
doc.save("rotated-page-1.pdf")

PDF 자르기

import pymupdf

doc = pymupdf.open("test.pdf") # open document
page = doc[0] # get the 1st page of the document
page.set_cropbox(pymupdf.Rect(100, 100, 400, 400)) # set a cropbox for the page
doc.save("cropped-page-1.pdf")

File Attach하기

import pymupdf

doc = pymupdf.open("test.pdf") # open main document
attachment = pymupdf.open("my-attachment.pdf") # open document you want to attach

page = doc[0] # get the 1st page of the document
point = pymupdf.Point(100, 100) # create the point where you want to add the attachment
attachment_data = attachment.tobytes() # get the document byte data as a buffer

# add the file annotation with the point, data and the file name
file_annotation = page.add_file_annot(point, attachment_data, "attachment.pdf")

doc.save("document-with-attachment.pdf") # save the document

Basics 기능 일부 발췌함. 그외 기능 아주 많음. 웹문서 참고할 것.

PyMuPDF, LLM & RAG

PyMuPDF를 대형 언어 모델(LLM) 프레임워크 및 전체 RAG솔루션에 통합하는 것은 문서 데이터를 제공하는 가장 빠르고 신뢰할 수 있는 방법입니다.
몇 가지 잘 알려진 LLM 솔루션들은 PyMuPDF와 자체 인터페이스를 가지고 있습니다.
Markdown으로 내보내야 한다면:
- PyMuPDF4LLM 사용해보라
LangChain과의 통합
- LangChain의 전용 로더를 사용하여 직접 통합하는 것은 간단합니다:

from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader("example.pdf")
data = loader.load()

LlamaIndex와의 통합
- Chunking을 위한 데이터 준비
  
  PyMuPDF로 Markdown 출력하여 Chunking 준비할 수 있음.

# 문서를 Markdown으로 변환
import pymupdf4llm
md_text = pymupdf4llm.to_markdown("input.pdf")

# UTF8 인코딩으로 파일에 텍스트 쓰기
import pathlib
pathlib.Path("output.md").write_bytes(md_text.encode())

Markdown 출력을 LangChain에서 사용하는 방법

import pymupdf4llm
from langchain.text_splitter import MarkdownTextSplitter

# Get the MD text
md_text = pymupdf4llm.to_markdown("input.pdf")  # get markdown for all pages

splitter = MarkdownTextSplitter(chunk_size=40, chunk_overlap=0)

splitter.create_documents([md_text])