Modifying PDF metadata

PDFs can store metadata in two ways:

  • the document information or “DocInfo” dictionary

  • metadata streams applicable to either an individual object or to the document itself.

The document information dictionary is specified in the /Info key of the PDF trailer. It provides document-level metadata and is easy to work with. PDF 2.0 has however deprecated most of its keys in favor of using metadata streams.

Metadata streams store metadata in XMP format and are the recommended way of applying object or document-level metadata.

pdfnaut can natively read from and write to both document information dictionaries and metadata streams.

Reading document information

The DocInfo dictionary can be accessed using the PdfDocument.doc_info attribute (see Info for a list of available fields).

Note that the DocInfo dictionary and all of its fields are optional (may be None).

from pdfnaut import PdfDocument

pdf = PdfDocument.from_filename("./tests/docs/sample.pdf")
# It is possible for a document to not have DocInfo
assert pdf.doc_info is not None, "No document information available."

print(pdf.doc_info.title) # None
print(pdf.doc_info.producer) # Nevrona Designs

Reading XMP metadata

Document-level XMP metadata can be accessed using the PdfDocument.xmp_info attribute. For information on available properties, see XmpMetadata.

from pdfnaut import PdfDocument

doc = PdfDocument.from_filename("./tests/docs/pdf2-incremental.pdf")
# It is possible for a document to not have XMP metadata
assert doc.xmp_info is not None, "No document XMP metadata available."

print(doc.xmp_info.dc_title["x-default"]) # A simple PDF 2.0 example file
print(doc.xmp_info.xmp_create_date) # 2017-05-24 10:30:11+00:00
print(doc.xmp_info.dc_creator)  # ["Datalogics Incorporated"]

Writing document information

Adding or modifying document information can be done by modifying the respective attributes in PdfDocument.doc_info. Read Info for information on each entry.

import datetime

from pdfnaut import PdfDocument
from pdfnaut.objects.trailer import Info

pdf = PdfDocument.from_filename("./tests/docs/sample.pdf")
if pdf.doc_info is None:
    # Creates a new DocInfo dictionary if not specified.
    pdf.doc_info = Info()

pdf.doc_info.title = "Sample PDF file"
pdf.doc_info.modify_date = datetime.datetime.now()

pdf.save("new-sample.pdf")

If you want to remove the DocInfo dictionary, simply set the doc_info attribute to None.

from pdfnaut import PdfDocument

pdf = PdfDocument.from_filename("./tests/docs/sample.pdf")
pdf.doc_info = None

pdf.save("new-sample.pdf")

This is also the case if you want to remove fields from the DocInfo dictionary. Setting a field to None or using del on the field will remove it.

Writing XMP metadata

Adding or modifying XMP metadata can be modified in a similar manner to modifying the DocInfo dictionary by modifying the respective attributes. Read XmpMetadata for information on each entry.

import datetime

from pdfnaut import PdfDocument
from pdfnaut.objects.xmp import XmpMetadata

pdf = PdfDocument.from_filename("./tests/docs/sample.pdf")
if pdf.xmp_info is None:
    # Creates a new XMP metadata object if not specified.
    pdf.xmp_info = XmpMetadata()

pdf.xmp_info.dc_title = "Sample PDF file"
pdf.xmp_info.xmp_modify_date = datetime.datetime.now()

pdf.save("new-sample.pdf")

If you want to remove the document-level XMP metadata, simply set the xmp_info attribute to None.

from pdfnaut import PdfDocument

doc = PdfDocument.from_filename("./tests/docs/sample.pdf")
doc.xmp_info = None

doc.save("new-sample.pdf")

Removing a field can be done by setting it to None or by using del.

Reconciling PDF metadata

As mentioned earlier, a PDF supports two types of metadata: DocInfo and XMP. Both sources should ideally be equivalent so that PDF processors can retrieve the same metadata regardless of the source they use to extract it.

PdfDocument provides the PdfDocument.copy_metadata() method which allows copying from one metadata source to another. This ensures that the data in both sources is equivalent.

from pdfnaut import MetadataCopyDirection, PdfDocument

pdf = PdfDocument.from_filename(r"tests/docs/pdf2-incremental.pdf")

print(pdf.doc_info)  # None
print(pdf.xmp_info)  # <XmpMetadata pdf_producer="Datalogics" [...]>

pdf.copy_metadata(MetadataCopyDirection.XMP_TO_DOC_INFO)
print(pdf.doc_info)  # Info(producer="Datalogics", ...)

Because the structure of the DocInfo dictionary and XMP differ, the mappings described in ISO 32000-2:2020 (PDF 2.0) § 14.3.3 are used. Note that not all metadata fields in XMP can be copied to DocInfo and vice versa. Only the standard properties are mapped.

XMP - DocInfo Mapping

XMP

DocInfo

pdf:Producer

Producer

pdf:Keywords

Keywords

pdf:PDFVersion

N/A [1]

pdf:Trapped

Trapped

xmp:CreatorTool

Creator

xmp:CreateDate

CreationDate

xmp:MetadataDate

ModDate

xmp:ModifyDate

ModDate

dc:title

Title [2]

dc:creator

Author [3]

dc:description

Subject [2]

dc:subject

N/A [1]

dc:rights

N/A [1]

dc:format

N/A [1]

Footnotes