PDF Tokenizer Reference¶
- class pdfnaut.cos.tokenizer.ContentStreamTokenizer[source]¶
Bases:
objectA tokenizer designed to consume the contents within a content stream.
This tokenizer relies on
PdfTokenizerto parse common tokens but has special handling for the operators inside a content stream.- get_next_token() PdfOperator | PdfComment | None[source]¶
Consumes the next token.
The return value is either a
PdfOperatoror aPdfComment, in case a token was consumed, orNone, if the end of data has been reached.
- parse_inline_image() PdfOperator[source]¶
Parses an inline image.
Inline images are an alternative to image XObjects designed for embedding small images in a content stream.
Returns an operator
EI(for “end image”) with aPdfInlineImageas its first and only operand.
- class pdfnaut.cos.tokenizer.PdfTokenizer[source]¶
Bases:
objectA tokenizer designed to consume individual objects that do not depend on a cross reference table. It is used by
PdfParserfor this purpose.This tokenizer consumes basic objects such as arrays and dictionaries. Indirect objects and streams depend on an XRef table and hence are not sequentially parsable. It is not intended to parse these items but rather the objects stored within them.
- Parameters:
data (bytes) – The contents to be parsed.
- consume_while(callback: Callable[[bytes], bool], *, limit: int = -1) bytes[source]¶
Consumes while
callbackreturns True for an input character. If specified, it will only consume up tolimitcharacters.
- get_next_token(*, parse_references: bool = True) bool | int | float | bytes | PdfArray | PdfDictionary | PdfHexString | PdfName | PdfReference | PdfNull | PdfComment | None[source]¶
Parses and returns the token at the current position.
- Parameters:
parse_references (bool, optional, keyword only) – Whether to parse indirect references. This is intended for content streams where indirect references are disallowed.
- parse_array() PdfArray[source]¶
Parses a PDF array which represents a sequence of heterogeneous objects.
- parse_comment() PdfComment[source]¶
Parses a PDF comment. Comments have no syntactical meaning.
- parse_dictionary() PdfDictionary[source]¶
Parses a dictionary object.
In a PDF, dictionary keys are name objects and dictionary values are any object or reference. This parser maps name objects to strings in this context.
- parse_hex_string() PdfHexString[source]¶
Parses a hexadecimal string. Hexadecimal strings usually include arbitrary binary data. If the sequence is uneven, the last character is assumed to be 0.
- parse_kv_map_until(delimiter: bytes) PdfDictionary[source]¶
Parses from the current position a dictionary-like object, that is, an object composed of keys that are name objects and values that are any object.
The
delimiterparameter specifies where this dictionary should end. The common ending (and default value) is “>>” for dictionary objects. However, this also accommodates for inline images which have the ID operator that can be used as a delimiter.
- parse_literal_string() bytes[source]¶
Parses a literal string. Literal strings may be composed entirely of ASCII or may include arbitrary binary data. They may also include escape sequences and octal values (
\ddd).
- parse_name() PdfName[source]¶
Parses a name – a uniquely defined atomic symbol introduced with a slash and ending before a delimiter or whitespace.
- parse_numeric() int | float[source]¶
Parses a numeric object.
PDF has two types of numbers: integers (40, -30) and real numbers (3.14). The range and precision of these numbers may depend on the machine used to process the PDF.
- peek(n: int = 1) bytes[source]¶
Peeks
ncharacters intodatawithout advancing through the tokenizer.
- peek_line() bytes[source]¶
Peeks from the current position until an EOL marker is found (not included in the output).
- skip_if_comment() bool[source]¶
Advances through a PDF comment in case one occurs at the current position. Returns whether a comment was skipped.
- skip_if_matches(keyword: bytes) bool[source]¶
Advances
len(keyword)characters ifkeywordstarts at the current position. Returns whether the match was successful.
- skip_next_eol(no_cr: bool = False) None[source]¶
Skips the next EOL marker if matched. If
no_cris True, CR (\r) as is will not be treated as a newline.
- skip_while(callback: Callable[[bytes], bool], *, limit: int = -1) int[source]¶
Skips while
callbackreturns True for an input character. If specified, it will only skiplimitcharacters. Returns how many characters were skipped.
- try_parse_indirect(*, header: bool = False) PdfReference | None[source]¶
Attempts to parse an indirect reference in the form
[obj] [gen] Ror an indirect object header in the form[obj] [gen] objin case theheaderargument is true.Returns the reference if one is found or None otherwise.