pdfnaut¶
Warning
This library is currently in a very early stage of development. It has only been tested with a small set of known to be spec-compliant documents.
pdfnaut aims to become a PDF processor for Python – a library capable of reading and writing PDF documents.
pdfnaut currently works best for handling low-level scenarios. A high-level reader (PdfDocument
) is provided although it’s pretty much in the works.
Features¶
Low level, typed PDF manipulation
Encryption (AES/ARC4)
Document building/serialization
Install¶
pdfnaut
can be installed from PyPI:
python3 -m pip install pdfnaut
python -m pip install pdfnaut
Important
While pdfnaut
supports encryption with ARC4 and AES, it does not include their implementations by default. You must either supply your own or preferably install a supported package like pycryptodome
that can provide these.
Examples¶
The low-level API, seen in the example below, illustrates how pdfnaut
can be used to inspect PDFs and retrieve information. Of course, each PDF will have a different structure and so knowledge of that structure is needed.
from pdfnaut import PdfParser
with open("tests/docs/sample.pdf", "rb") as doc:
pdf = PdfParser(doc.read())
pdf.parse()
# Get the pages object from the trailer
root = pdf.resolve_reference(pdf.trailer["Root"])
page_tree = pdf.resolve_reference(root["Pages"])
# Get the contents of the first page
page = pdf.resolve_reference(page_tree["Kids"][0])
page_stream = pdf.resolve_reference(page["Contents"])
print(page_stream.decompress())
The high-level API currently provides some abstraction for PdfParser
. Notably, it includes a helper property for accessing pages called flattened_pages
.
from pdfnaut import PdfDocument
pdf = PdfDocument.from_filename("tests/docs/sample.pdf")
first_page = list(pdf.flattened_pages)[0]
if "Contents" in first_page:
first_page_stream = pdf.resolve_reference(first_page["Contents"])
print(first_page_stream.decompress())