PDF Tokenizer Reference¶

class pdfnaut.cos.tokenizer.ContentStreamTokenizer[source]¶

Bases: object

A tokenizer designed to consume the contents within a content stream.

This tokenizer relies on PdfTokenizer to parse common tokens but has special handling for the operators inside a content stream.

__init__(contents: bytes) → None[source]¶

get_next_token() → PdfOperator | PdfComment | None[source]¶

Consumes the next token.

The return value is either a PdfOperator or a PdfComment, in case a token was consumed, or None, if the end of data has been reached.

parse_inline_image() → PdfOperator[source]¶

Parses an inline image.

Inline images are an alternative to image XObjects designed for embedding small images in a content stream.

Returns an operator EI (for “end image”) with a PdfInlineImage as its first and only operand.

class pdfnaut.cos.tokenizer.PdfTokenizer[source]¶

Bases: object

A tokenizer designed to consume individual objects that do not depend on a cross reference table. It is used by PdfParser for this purpose.

This tokenizer consumes basic objects such as arrays and dictionaries. Indirect objects and streams depend on an XRef table and hence are not sequentially parsable. It is not intended to parse these items but rather the objects stored within them.

Parameters:: data (bytes) – The contents to be parsed.

__init__(data: bytes) → None[source]¶

consume(n: int = 1) → bytes[source]¶: Consumes and returns n characters.

consume_while(callback: Callable[[bytes], bool], *, limit: int = -1) → bytes[source]¶: Consumes while callback returns True for an input character. If specified, it will only consume up to limit characters.

property done: bool¶: Whether the parser has reached the end of data.

Parses and returns the token at the current position.

Parameters:: parse_references (bool, optional, keyword only) – Whether to parse indirect references. This is intended for content streams where indirect references are disallowed.

matches(keyword: bytes) → bool[source]¶: Checks whether keyword starts at the current position.

parse_array() → PdfArray[source]¶: Parses a PDF array which represents a sequence of heterogeneous objects.

parse_comment() → PdfComment[source]¶: Parses a PDF comment. Comments have no syntactical meaning.

parse_dictionary() → PdfDictionary[source]¶

Parses a dictionary object.

In a PDF, dictionary keys are name objects and dictionary values are any object or reference. This parser maps name objects to strings in this context.

parse_hex_string() → PdfHexString[source]¶: Parses a hexadecimal string. Hexadecimal strings usually include arbitrary binary data. If the sequence is uneven, the last character is assumed to be 0.

parse_kv_map_until(delimiter: bytes) → PdfDictionary[source]¶

Parses from the current position a dictionary-like object, that is, an object composed of keys that are name objects and values that are any object.

The delimiter parameter specifies where this dictionary should end. The common ending (and default value) is “>>” for dictionary objects. However, this also accommodates for inline images which have the ID operator that can be used as a delimiter.

parse_literal_string() → bytes[source]¶: Parses a literal string. Literal strings may be composed entirely of ASCII or may include arbitrary binary data. They may also include escape sequences and octal values (\ddd).

parse_name() → PdfName[source]¶: Parses a name – a uniquely defined atomic symbol introduced with a slash and ending before a delimiter or whitespace.

parse_numeric() → int | float[source]¶

Parses a numeric object.

PDF has two types of numbers: integers (40, -30) and real numbers (3.14). The range and precision of these numbers may depend on the machine used to process the PDF.

peek(n: int = 1) → bytes[source]¶: Peeks n characters into data without advancing through the tokenizer.

peek_line() → bytes[source]¶: Peeks from the current position until an EOL marker is found (not included in the output).

skip(n: int = 1) → None[source]¶: Skips/advances n characters in the tokenizer.

skip_if_comment() → bool[source]¶: Advances through a PDF comment in case one occurs at the current position. Returns whether a comment was skipped.

skip_if_matches(keyword: bytes) → bool[source]¶: Advances len(keyword) characters if keyword starts at the current position. Returns whether the match was successful.

skip_next_eol(no_cr: bool = False) → None[source]¶: Skips the next EOL marker if matched. If no_cr is True, CR (\r) as is will not be treated as a newline.

skip_while(callback: Callable[[bytes], bool], *, limit: int = -1) → int[source]¶: Skips while callback returns True for an input character. If specified, it will only skip limit characters. Returns how many characters were skipped.

skip_whitespace() → None[source]¶: Advances through PDF whitespace.

try_parse_indirect(*, header: bool = False) → PdfReference | None[source]¶

Attempts to parse an indirect reference in the form [obj] [gen] R or an indirect object header in the form [obj] [gen] obj in case the header argument is true.

Returns the reference if one is found or None otherwise.