PDF Document Reference

pdfnaut provides a high-level interface for working with PDFs in the form of PdfDocument.

class pdfnaut.document.PdfDocument[source]

Bases: PdfParser

A PDF document that can be read and written to.

In essence, it is a high-level wrapper around PdfParser intended for PDF users who want to work with a document via high-level interfaces.

T = ~T
__init__(data: bytes, *, strict: bool = False) None[source]
access_level

The current access level of the document. It may be either of the values in PermsAcquired:

  • Owner (2): Full access to the document. If the document is not encrypted, this is the default value.

  • User (1): Access to the document under restrictions.

  • None (0): Document is currently encrypted.

property access_permissions: UserAccessPermissions | None

User access permissions relating to the document if any.

See UserAccessPermissions for details.

build_xref_map(subsections: list[PdfXRefSubsection]) dict[tuple[int, int], PdfXRefEntry]

Creates a dictionary mapping references to XRef entries in the document.

property catalog: PdfDictionary

The document catalog representing the root of the document’s object hierarchy, including references to the page tree, outlines, destinations, and other core elements in a PDF document.

For details on the contents of the document catalog, see ISO 32000-2:2020 § 7.7.2 “Document catalog dictionary”.

copy_metadata(direction: MetadataCopyDirection) None[source]

Performs reconciling of the document metadata sources by copying data from one source to another, based on the provided direction.

A PDF may store document metadata in either the document information (DocInfo) dictionary or in XMP. This function ensures that the two sources are equivalent by using the metadata mapping described in Reconciling PDF metadata.

If the metadata source to copy to does not exist, it will be created; otherwise, it will be overwritten. ValueError is raised if the source to copy from does not exist.

decrypt(password: str) PermsAcquired[source]

Decrypts this document through the Standard security handler using the provided password.

The standard security handler may specify 2 passwords: an owner password and a user password. The owner password would allow full access to the PDF and the user password should allow access according to the permissions specified in the document.

When the document is decrypted successfully, the object cache is cleared to make way for the new objects in decrypted form.

Returns:

A value specifying the permissions acquired by password.

Return type:

PermsAcquired

property doc_info: Info | None

The Info entry of the document trailer which includes the document-level information described in ISO 32000-2:2020 § 14.3.3 “Document information dictionary”.

Some documents may specify a metadata stream rather than a DocInfo dictionary. Such metadata can be accessed using PdfDocument.xmp_info.

PDF 2.0 deprecated all keys of the DocInfo dictionary except for CreationDate and ModDate.

property extensions: ExtensionMap | None

Developer-defined extensions to this document. This feature was introduced in ISO 32000-1 (PDF 1.7). See ExtensionMap for details.

property flattened_pages: Generator[Page, None, None]

A generator suitable for iterating over the pages of a PDF.

classmethod from_filename(path: str | Path, *, strict: bool = False) PdfDocument[source]

Loads a PDF document from a file path.

get_merged_xrefs() dict[tuple[int, int], PdfXRefEntry]

Combines all XRef updates in the document into a cross-reference mapping that includes all entries.

get_object(reference: PdfReference | tuple[int, int], cache: bool = True) PdfObject | PdfStream | PdfNull | FreeObject | Any

Resolves a reference into the indirect object it points to.

Parameters:
  • reference (PdfReference | tuple[int, int]) – A PdfReference object or a tuple of two integers representing, in order, the object number and the generation number.

  • cache (bool, optional) –

    Whether to interact with the object store when resolving references. Defaults to True.

    When True, the parser will read entries from the object store and write new ones if they are not present. If False, the parser will always fetch new entries and will not write to the object store.

    Note that the object store will be accessed regardless of the value of cache if the object is new and is not included in the xref table.

Returns:

The object the reference resolves to.

If the reference is invalid (i.e. does not exist), returns PdfNull. If the object referred to is a free object, returns FreeObject.

property has_encryption: bool

Whether this document includes encryption.

header_version

The document’s PDF version as seen in the header.

This value should be used if no Version entry exists in the document catalog or if the header’s version is newer. Otherwise, use the Version entry.

property language: str | None

A language identifier that shall specify the natural language for all text in the document except where overridden by language specifications for structure elements or marked content.

See ISO 32000-2:2020 § 14.9.2 “Natural language specification” for details.

If this entry is absent or invalid, the language shall be considered unknown.

lookup_xref_start() int

Scans through the PDF until it finds the XRef offset then returns it.

property mark_info: MarkInfo | None

Information pertaining to the document’s conformance to tagged PDF conventions.

See MarkInfo for details.

classmethod new() PdfDocument[source]

Creates a blank PDF document.

new_outline() None[source]

Creates an empty outline tree.

objects

A mapping of objects present in the document.

property outline: OutlineTree | None

The outline tree including a hierarchy of outline items or bookmarks used for document-level navigation.

property outline_tree: PdfDictionary | None

The document’s outline tree including what is commonly referred to as bookmarks. See ISO 32000-2:2020 § 12.3.3 “Document outline” for details.

property page_layout: Literal['SinglePage', 'OneColumn', 'TwoColumnLeft', 'TwoColumnRight', 'TwoPageLeft', 'TwoPageRight']

The page layout to use when opening the document. May be one of the following values:

  • SinglePage: Display one page at a time (default).

  • OneColumn: Display the pages in one column.

  • TwoColumnLeft: Display the pages in two columns, with odd-numbered pages on the left.

  • TwoColumnRight: Display the pages in two columns, with odd-numbered pages on the right.

  • TwoPageLeft: Display the pages two at a time, with odd-numbered pages on the left (PDF 1.5).

  • TwoPageRight: Display the pages two at a time, with odd-numbered pages on the right (PDF 1.5).

property page_mode: Literal['UseNone', 'UseOutlines', 'UseThumbs', 'FullScreen', 'UseOC', 'UseAttachments']

Value specifying how the document shall be displayed when opened:

  • UseNone: Neither document outline nor thumbnail images visible (default).

  • UseOutlines: Document outline visible.

  • UseThumbs: Thumbnail images visible.

  • FullScreen: Full-screen mode, with no menu bar, window controls, or any other window visible.

  • UseOC: Optional content group panel visible (PDF 1.5).

  • UseAttachments: Attachments panel visible (PDF 1.6).

property page_tree: PdfDictionary

2020 § 7.7.3 “Page Tree”.

PdfDocument.pages should be preferred in typical usage.

Type:

The document’s page tree described in ISO 32000-2

property pages: PageList

The page list in the document.

parse(start_xref: int | None = None) None

Parses the entire document.

It begins by parsing the most recent XRef table and trailer. If this trailer points to a previous XRef, this function is called again with a start_xref offset until no more XRefs are found.

It also sets up the Standard security handler for use in case the document is encrypted.

Parameters:

start_xref (int, optional) – The offset where the most recent XRef can be found. If no offset is provided, this function will attempt to locate one.

parse_compressed_xref() PdfXRefSection

Parses a compressed cross-reference stream which includes both the XRef table and information from the PDF trailer as described in ISO 32000-2:2020 § 7.5.8 “Cross-reference streams”.

parse_header() str

Parses the %PDF-n.m header that is expected to be at the start of a PDF file.

parse_indirect_object(xref_entry: InUseXRefEntry, reference: PdfReference | None) PdfObject | PdfStream

Parses an indirect object not within an object stream, or basically, an object that is directly referred to by an xref_entry and a reference.

parse_simple_trailer() PdfDictionary

Parses the PDF’s standard trailer which is used to quickly locate other cross reference tables and special objects.

The trailer is separate if the XRef table is standard (uncompressed). Otherwise it is part of the XRef object.

parse_simple_xref() list[PdfXRefSubsection]

Parses a standard, uncompressed XRef table of the format described in ISO 32000-2:2020 § 7.5.4 “Cross-Reference table”.

If startxref points to an XRef object, parse_compressed_xref() should be called instead.

parse_stream(xref_entry: InUseXRefEntry, extent: int) bytes

Parses the contents of a PDF stream at xref_entry.

extent specifies the amount of bytes the stream is expected to have.

parse_xref_and_trailer() PdfXRefSection

Parses both the cross-reference table and the PDF trailer.

PDFs may include a typical uncompressed XRef table (and hence separate XRefs and trailers) or an XRef stream that combines both.

property pdf_version: str

The version of the PDF standard implemented by this document.

For obtaining the PDF version, the /Version entry in the catalog is checked. If no such key is present, the version specified in the header is returned. If both are present, the version returned is the latest specified according to lexicographical comparison.

save(filepath: str | Path | IO[bytes]) None

Saves the contents of this parser to filepath.

filepath may be either a string containing a path, a pathlib.Path instance, or a byte stream (that is, any class implementing IO[bytes]).

security_handler

The document’s standard security handler, if any, as specified in the Encrypt dictionary of the PDF trailer.

This field being set indicates that a supported security handler was used for encryption. If not set, the parser will not attempt to decrypt this document.

trailer

The most recent trailer in the PDF document.

For details on the contents of the trailer, see ISO 32000-2:2020 § 7.5.5 “File Trailer”.

updates: list[PdfXRefSection]

A list of all incremental updates present in the document (most recent update first).

property viewer_preferences: ViewerPreferences | None

Settings controlling how a PDF reader shall display a document on the screen. If this value is absent, the PDF reader should choose its own default preferences.

See ViewerPreferences for details.

property xmp_info: XmpMetadata | None

The /Metadata entry of the document catalog which includes document-level metadata stored as XMP.

xref: dict[tuple[int, int], PdfXRefEntry]

A cross-reference mapping combining the entries of all XRef tables present in the document.

The key is a tuple of two integers: object number and generation number. The value is any of the 3 types of XRef entries (free, in use, compressed).

This attribute reflects the state of the XRef table when the document was first loaded. Assume read-only.