PDF Parser Reference¶

pdfnaut.cos.parser.MapObject: TypeAlias = 'PdfObject | PdfStream | FreeObject'¶

str(object=’’) -> str str(bytes_or_buffer[, encoding[, errors]]) -> str

Create a new string object from the given object. If encoding or errors is specified, then the object must expose a data buffer that will be decoded using the given encoding and error handler. Otherwise, returns the result of object.__str__() (if defined) or repr(object). encoding defaults to sys.getdefaultencoding(). errors defaults to ‘strict’.

class pdfnaut.cos.parser.ObjectMap[source]¶

Bases: UserDict[int, PdfObject | PdfStream | FreeObject]

A mapping of object numbers to either object references, in-use objects or free objects.

Object references included in ObjectMap.unresolved are items that have not been requested yet. Once an object is requested, it is removed from the unresolved set and added to the map as is.

Free objects are indicated with the FreeObject class.

__init__(pdf: PdfParser) → None[source]¶

add(pdf_object: PdfObject | PdfStream) → PdfReference[PdfObject | PdfStream][source]¶: Adds a new pdf_object to the map. Returns its reference.

delete(obj_num: int) → MapObject | None[source]¶: Deletes object with number obj_num. Returns the object if it exists, otherwise returns None.

fill() → None[source]¶: Fills the object map with the items available in the PDF’s xref table.

free(obj_num: int) → None[source]¶: Marks object with number obj_num as a free object.

get_next_ref() → PdfReference[source]¶: Creates a new reference based on the current object number in the map.

initial_reference_map: dict[int, tuple[int, int]]¶: A mapping of object numbers to reference tuples for the initial entries made when the object map is filled.

unresolved¶: A set of unresolved object numbers (objects that have not been requested or cached yet).

class pdfnaut.cos.parser.ObjectStream[source]¶

Bases: object

A mapping of object numbers to PDF objects representing an object stream (see ISO 32000-2:2020 § 7.5.7 “Object Streams”).

Parameters:

pdf (PdfParser) – The PDF parser or document to which this object stream belongs.
stream (PdfStream) – The stream being represented by this object.
stream_objnum (int) – The object number of this stream within the PDF document.

__init__(pdf: PdfParser, stream: PdfStream, stream_objnum: int) → None[source]¶

Parameters:

pdf (PdfParser) – The PDF parser or document to which this object stream belongs.
stream (PdfStream) – The stream being represented by this object.
stream_objnum (int) – The object number of this stream within the PDF document.

get_object(index: int, *, cache: bool = True) → PdfObject[source]¶

Gets an object at a specified index inside an object stream.

Parameters:

index (int) – The index of an object within the stream.
cache (bool, optional, keyword only) –
Whether to access or write to the object store (by default, True).

If True, this method will always retrieve from and write objects to the object store if possible. If False, this method will always retrieve objects from the contents of the stream.

parse_indices() → list[tuple[int, int]][source]¶

Parses the object stream’s indices.

The indices are a list of 2-element pairs specifying, in order, the object number of an item within the stream and the object’s location within the stream relative to the offset in the /First key.

to_stream() → PdfStream[source]¶: Returns a PdfStream representing the contents of this object stream.

class pdfnaut.cos.parser.PdfParser[source]¶

Bases: object

A parser that can completely parse a PDF document.

It consumes the PDF’s cross-reference tables and trailers. It merges the tables into a single one and provides an interface to individually parse each indirect object using PdfTokenizer.

Parameters:

data (bytes) – The document to be processed.
strict (bool, optional, keyword only) – Whether to warn or fail on issues caused by non-spec-compliance. Defaults to False.

__init__(data: bytes, *, strict: bool = False) → None[source]¶

build_xref_map(subsections: list[PdfXRefSubsection]) → dict[tuple[int, int], PdfXRefEntry][source]¶: Creates a dictionary mapping references to XRef entries in the document.

decrypt(password: str) → PermsAcquired[source]¶

Decrypts this document through the Standard security handler using the provided password.

The standard security handler may specify 2 passwords: an owner password and a user password. The owner password would allow full access to the PDF and the user password should allow access according to the permissions specified in the document.

When the document is decrypted successfully, the object cache is cleared to make way for the new objects in decrypted form.

Returns:

A value specifying the permissions acquired by password.

If the document is not encrypted, defaults to PermsAcquired.OWNER
if the document was not decrypted, defaults to PermsAcquired.NONE

Return type:

PermsAcquired

get_merged_xrefs() → dict[tuple[int, int], PdfXRefEntry][source]¶: Combines all XRef updates in the document into a cross-reference mapping that includes all entries.

get_object(reference: PdfReference[T], cache: bool = True) → T[source]¶

get_object(reference: tuple[int, int], cache: bool = True) → PdfObject | PdfStream | PdfNull | FreeObject

Resolves a reference into the indirect object it points to.

Parameters:

reference (PdfReference | tuple[int, int]) – A PdfReference object or a tuple of two integers representing, in order, the object number and the generation number.
cache (bool, optional) –
Whether to interact with the object store when resolving references. Defaults to True.

When True, the parser will read entries from the object store and write new ones if they are not present. If False, the parser will always fetch new entries and will not write to the object store.

Note that the object store will be accessed regardless of the value of cache if the object is new and is not included in the xref table.

Returns:

The object the reference resolves to.

If the reference is invalid (i.e. does not exist), returns PdfNull. If the object referred to is a free object, returns FreeObject.

header_version¶

The document’s PDF version as seen in the header.

This value should be used if no Version entry exists in the document catalog or if the header’s version is newer. Otherwise, use the Version entry.

lookup_xref_start() → int[source]¶: Scans through the PDF until it finds the XRef offset then returns it.

objects¶: A mapping of objects present in the document.

parse(start_xref: int | None = None) → None[source]¶

Parses the entire document.

It begins by parsing the most recent XRef table and trailer. If this trailer points to a previous XRef, this function is called again with a start_xref offset until no more XRefs are found.

It also sets up the Standard security handler for use in case the document is encrypted.

Parameters:: start_xref (int, optional) – The offset where the most recent XRef can be found. If no offset is provided, this function will attempt to locate one.

parse_compressed_xref() → PdfXRefSection[source]¶: Parses a compressed cross-reference stream which includes both the XRef table and information from the PDF trailer as described in ISO 32000-2:2020 § 7.5.8 “Cross-reference streams”.

parse_header() → str[source]¶: Parses the %PDF-n.m header that is expected to be at the start of a PDF file.

parse_indirect_object(xref_entry: InUseXRefEntry, reference: PdfReference | None) → PdfObject | PdfStream[source]¶: Parses an indirect object not within an object stream, or basically, an object that is directly referred to by an xref_entry and a reference.

parse_simple_trailer() → PdfDictionary[source]¶

Parses the PDF’s standard trailer which is used to quickly locate other cross reference tables and special objects.

The trailer is separate if the XRef table is standard (uncompressed). Otherwise it is part of the XRef object.

parse_simple_xref() → list[PdfXRefSubsection][source]¶

Parses a standard, uncompressed XRef table of the format described in ISO 32000-2:2020 § 7.5.4 “Cross-Reference table”.

If startxref points to an XRef object, parse_compressed_xref() should be called instead.

parse_stream(xref_entry: InUseXRefEntry, extent: int) → bytes[source]¶

Parses the contents of a PDF stream at xref_entry.

extent specifies the amount of bytes the stream is expected to have.

parse_xref_and_trailer() → PdfXRefSection[source]¶

Parses both the cross-reference table and the PDF trailer.

PDFs may include a typical uncompressed XRef table (and hence separate XRefs and trailers) or an XRef stream that combines both.

save(filepath: str | Path | IO[bytes]) → None[source]¶

Saves the contents of this parser to filepath.

filepath may be either a string containing a path, a pathlib.Path instance, or a byte stream (that is, any class implementing IO[bytes]).

security_handler¶

The document’s standard security handler, if any, as specified in the Encrypt dictionary of the PDF trailer.

This field being set indicates that a supported security handler was used for encryption. If not set, the parser will not attempt to decrypt this document.

trailer¶

The most recent trailer in the PDF document.

For details on the contents of the trailer, see ISO 32000-2:2020 § 7.5.5 “File Trailer”.

updates: list[PdfXRefSection]¶: A list of all incremental updates present in the document (most recent update first).

xref: dict[tuple[int, int], PdfXRefEntry]¶

A cross-reference mapping combining the entries of all XRef tables present in the document.

The key is a tuple of two integers: object number and generation number. The value is any of the 3 types of XRef entries (free, in use, compressed).

This attribute reflects the state of the XRef table when the document was first loaded. Assume read-only.

class pdfnaut.cos.parser.PermsAcquired[source]¶

Bases: IntEnum

Permissions acquired after opening or decrypting a document.

NONE = 0¶: No permissions acquired, document is still encrypted.

OWNER = 2¶: Owner permissions (all permissions).

USER = 1¶: User permissions within the limits specified by the security handler.

__new__(value)¶