COS Objects Reference

The PDF 2.0 specification defines the following basic object types:

pdfnaut Object Mapping

PDF Object

Python Object

Booleans (true/false)

bool

Integers (123)

int

Real numbers (123.456)

float

Literal strings ((hello world))

bytes

Hexadecimal strings (<616263>)

PdfHexString

Names (/Type)

PdfName

Arrays ([1 2 3])

PdfArray

Dictionaries (<< /Type /Catalog ... >>)

PdfDictionary

Streams

PdfStream

Null

PdfNull

Indirect references (1 0 R)

PdfReference

The spec also defines general-purpose data structures built from the basic object types.

  • Strings are divided into:

    • ASCII strings.

    • Byte strings: hex strings or literal strings containing binary data.

    • PDFDocEncoded strings

    • Text strings: encoded in either PDFDocEncoding, UTF-16BE or UTF-8. The latter was introduced in PDF 2.0

  • Dates: implemented via encode_iso8824() and parse_iso8824().

  • The following data structures do not currently have a dedicated type:

    • File specifications

    • Functions

    • Name trees

    • Number trees

    • Rectangles

    • Text streams

Base Objects

pdfnaut.cos.objects.base.PdfObject

alias of bool | int | float | bytes | PdfArray | PdfDictionary | PdfHexString | PdfName | PdfReference | PdfNull

class pdfnaut.cos.objects.base.PdfComment[source]

Bases: object

A comment introduced by the presence of the percent sign (%) outside a string or inside a content stream. Comments have no syntactical meaning and shall be interpreted as whitespace (see ISO 32000-2:2020 § 7.2.4 “Comments”).

__init__(value: bytes) None
value: bytes

The value of this comment.

class pdfnaut.cos.objects.base.PdfHexString[source]

Bases: object

A string of characters encoded in hexadecimal useful for including arbitrary binary data in a PDF (see ISO 32000-2:2020 § 7.3.4.3 “Hexadecimal Strings”).

__init__(raw: bytes) None
classmethod from_raw(data: bytes) Self[source]

Creates a hexadecimal string from data.

raw: bytes

The hex value of the string.

property value: bytes

The decoded value of the hex string.

class pdfnaut.cos.objects.base.PdfInlineImage[source]

Bases: object

A PDF inline image within a content stream (see ISO 32000-2:2020 § 8.9.7 “Inline images”).

__init__(details: PdfDictionary, raw: bytes) None
details: PdfDictionary

Details about the inline image.

raw: bytes

The raw contents of the inline image.

class pdfnaut.cos.objects.base.PdfName[source]

Bases: Generic[T]

An atomic symbol uniquely defined by a sequence of 8-bit characters (see ISO 32000-2:2020 § 7.3.5 “Name Objects”).

__init__(value: T) None
value: T

The value of this name.

class pdfnaut.cos.objects.base.PdfNull[source]

Bases: object

A PDF ‘null’ object, distinct from all other PDF objects (see ISO 32000-2:2020 § 7.3.9 “Null Object”).

class pdfnaut.cos.objects.base.PdfOperator[source]

Bases: object

A PDF operator within a content stream (see ISO 32000-2:2020 § 7.8.2 “Content streams”).

__init__(name: bytes, args: list[PdfObject] | list[PdfInlineImage]) None
args: list[PdfObject] | list[PdfInlineImage]

The arguments or operands provided to this operator.

name: bytes

The name of this operator.

class pdfnaut.cos.objects.base.PdfReference[source]

Bases: Generic[T]

A reference to a PDF indirect object (see ISO 32000-2:2020 § 7.3.10 “Indirect objects”).

__init__(object_number: int, generation: int) None
generation: int

The generation of the object being referenced.

get() T[source]

Returns the object this reference points to. If unable to resolve, returns PdfResolutionError

object_number: int

The object number of the object being referenced.

with_resolver(resolver: Callable[[PdfReference], T]) Self[source]

Sets a resolution method resolver for this reference.

pdfnaut.cos.objects.base.encode_text_string(text: str, *, utf8: bool = False) bytes[source]

Encodes a text string to either PDFDocEncoding or UTF-16BE. Strings are encoded with PDFDoc first then UTF-16BE if text cannot be encoded with PDFDoc.

If utf8 is True, text will be encoded in UTF-8 as fallback instead of UTF-16BE. Note that UTF-8 text strings are a PDF 2.0 feature which may not be supported by all PDF processors.

pdfnaut.cos.objects.base.parse_text_string(encoded: PdfHexString | bytes) str[source]

Parses a text string as described in ISO 32000-2:2020 § 7.9.2.2 “Text string type”.

Text strings may either be encoded in PDFDocEncoding, UTF-16BE, or (PDF 2.0) UTF-8. Each encoding is indicated by a byte-order mark at the beginning (FE FF for UTF-16BE and EF BB BF for UTF-8). PDFDocEncoded strings have no such mark.

Stream Objects

class pdfnaut.cos.objects.stream.PdfStream[source]

Bases: object

A sequence of bytes that may be of unlimited length. Objects with a large amount of data like images or fonts are usually represented by streams (see ISO 32000-2:2020 § 7.3.8 “Stream objects”).

__init__(details: ~pdfnaut.cos.objects.containers.PdfDictionary[str, bool | int | float | bytes | ~pdfnaut.cos.objects.containers.PdfArray | ~pdfnaut.cos.objects.containers.PdfDictionary | ~pdfnaut.cos.objects.base.PdfHexString | ~pdfnaut.cos.objects.base.PdfName | ~pdfnaut.cos.objects.base.PdfReference | ~pdfnaut.cos.objects.base.PdfNull], raw: bytes, _crypt_params: ~pdfnaut.cos.objects.containers.PdfDictionary[str, ~typing.Any] = <factory>) None
classmethod create(raw: bytes, details: PdfDictionary | None = None, crypt_params: PdfDictionary | None = None) Self[source]

Creates a stream from unencoded data raw applying the filter(s) specified in details. The length of the encoded output will automatically be appended to details.

Raises pdfnaut.exceptions.PdfFilterError if a filter used is unsupported.

decode() bytes[source]

Returns the decoded contents of the stream. If no filter is defined, it returns the original contents.

Raises pdfnaut.exceptions.PdfFilterError if a filter used is unsupported.

details: PdfDictionary[str, bool | int | float | bytes | PdfArray | PdfDictionary | PdfHexString | PdfName | PdfReference | PdfNull]

2020 § 7.3.8.2 “Stream extent”.

Type:

The stream extent dictionary as described in ISO 32000-2

modify(raw: bytes) None[source]

Modifies this stream in place by encoding the raw data according to the parameters specified in the stream’s extent.

raw: bytes

The raw data in the stream.

Container Objects

class pdfnaut.cos.objects.containers.PdfArray[source]

Bases: UserList[ArrVal]

A heterogeneous collection of sequentially arranged items (see ISO 32000-2:2020 § 7.3.6 “Array objects”).

PdfArray is effectively a Python list. The main difference from a typical list is that PdfArray automatically resolves references when indexing.

The underlying data in unresolved form is stored in PdfArray.data.

class pdfnaut.cos.objects.containers.PdfDictionary[source]

Bases: UserDict[DictKey, DictVal]

An associative table containing pairs of objects or entries where each entry is composed of a key which is a name object and a value which is any PDF object (see ISO 32000-2:2020 § 7.3.7 “Dictionary objects”).

PdfDictionary is effectively a Python dictionary. Its keys are strings and its values are any PDF object. The main difference from a typical dictionary is that PdfDictionary automatically resolves references on key access.

The underlying data in unresolved form is stored in PdfDictionary.data.

XRef Objects

pdfnaut.cos.objects.xref.PdfXRefEntry

alias of FreeXRefEntry | InUseXRefEntry | CompressedXRefEntry

class pdfnaut.cos.objects.xref.CompressedXRefEntry[source]

Bases: object

A Type 2 or compressed entry. Compressed entries refer to objects stored within an object stream.

__init__(objstm_number: int, index_within: int) None
index_within: int

The index of the object within the object stream.

objstm_number: int

The object number of the object stream containing this object.

class pdfnaut.cos.objects.xref.FreeXRefEntry[source]

Bases: object

A Type 0 (f) or free entry. Free entries are entries not currently in use and are members of the linked list of free objects.

__init__(next_free_object: int, gen_if_used_again: int) None
gen_if_used_again: int

The generation to apply to an object if this entry is used again.

next_free_object: int

The object number of the next free object in the linked list.

class pdfnaut.cos.objects.xref.InUseXRefEntry[source]

Bases: object

A Type 1 (n) or in-use entry. In-use entries refer to the objects stored in a document.

__init__(offset: int, generation: int) None
generation: int

The generation of the object.

offset: int

The byte offset of the object in the file (starting after the %PDF marker).

class pdfnaut.cos.objects.xref.PdfXRefSection[source]

Bases: object

A cross-reference section in a XRef table representing an incremental update.

Each section is comprised of one or multiple subsections containing XRef entries.

__init__(subsections: list[PdfXRefSubsection], trailer: PdfDictionary) None
subsections: list[PdfXRefSubsection]

The subsections conforming this XRef section.

trailer: PdfDictionary

The trailer dictionary specified within this XRef section.

class pdfnaut.cos.objects.xref.PdfXRefSubsection[source]

Bases: object

A cross-reference subsection in an XRef section.

__init__(first_obj_number: int, count: int, entries: list[PdfXRefEntry]) None
count: int

The number of entries in this subsection.

entries: list[PdfXRefEntry]

The entries contained in this subsection.

first_obj_number: int

The object number of the first entry in this section. Each entry’s object number starts here and is incremented by one.