Working with Pages

Pages are the contents that make up a PDF document. In PDFs, pages are stored in a tree structure known as the page tree. For simpler documents, it is usually a flat tree, but for larger documents, it may be comprised of multiple branches or leaf nodes for optimization purposes.

The page object (represented as Page) contains information about the page’s contents, resources, and appearance.

Accessing Pages

Direct access to the page tree is possible via PdfDocument.page_tree. However, in typical usage, you will want to use PdfDocument.pages instead which provides a page list that abstracts the tree structure into a flat collection of pages.

from pdfnaut import PdfDocument

pdf = PdfDocument.from_filename(r"tests/docs/usenix-example-paper.pdf")
print(pdf.pages)     # [<Page ...>, <Page ...>, ...]
print(pdf.pages[0])  # <Page mediabox=[0, 0, 612, 792] rotation=0>

The page list mostly behaves like any other Python sequence and so operations commonly performed on those should work identically on a page list.

  • To access the first page of a PDF, you do pdf.pages[0].

  • To access the last page, you do pdf.pages[-1].

  • The length of the page list can be obtained via len(pdf.pages).

  • The page list also supports accessing items via slicing, so an operation such as pdf.pages[2:5] is allowed.

Modifying Pages

As the Page object inherits from a PdfDictionary, you can modify its contents as you would any other mapping.

from pdfnaut import PdfDocument
from pdfnaut.cos.objects import PdfArray

pdf = PdfDocument.from_filename(r"tests/docs/usenix-example-paper.pdf")

pdf.pages[0]["CropBox"] = PdfArray([10, 10, 200, 200])

In this example, the CropBox property is modified so that a visual crop starting at position (10, 10) and ending at position (200, 200) takes place.

For common properties such as the page cropbox, you can use the available attributes in Page.

pdf.pages[0].cropbox = PdfArray([10, 10, 200, 200])

This performs the same action as in the previous example.

Inserting Pages

One of the most common operations performed when manipulating PDFs is from a set of actions known as page assembly. Page assembly refers to the process of inserting and removing pages from a document.

To insert pages into a PDF, you can use the PageList.append() and PageList.insert() methods.

from pdfnaut import PdfDocument
from pdfnaut.objects import Page

pdf = PdfDocument.from_filename(r"tests/docs/usenix-example-paper.pdf")

pdf.pages.append(Page(size=(595, 842)))

In the above example, a blank A4-size page is added to the end of the document.

You may also insert pages from a different document.

from pdfnaut import PdfDocument

pdf1 = PdfDocument.from_filename(r"tests/docs/usenix-example-paper.pdf")
pdf2 = PdfDocument.from_filename(r"tests/docs/pdf2-incremental-pdf")

pdf1.pages.insert(2, pdf2.pages[0])

The example above inserts a page from the second PDF into the second position (before the third page).

Important

When importing pages from another document, certain elements such as form widgets and certain types of annotations may not be preserved in working order as they either depend on the document itself or are defined at document level rather than at page level.

It is also possible to append multiple pages to a PDF using the PageList.extend() method.

Removing Pages

pdfnaut also allows removing pages via the PageList.pop() method.

from pdfnaut import PdfDocument
from pdfnaut.objects import Page

pdf = PdfDocument.from_filename(r"tests/docs/usenix-example-paper.pdf")

pdf.pages.pop(0)

In the above example, this pops the first page in the document.

Removing pages via the del operation is also supported: del pdf.pages[n].