PdfHandler

High-level utilities for inspecting and modifying PDF files.

This module exposes PdfHandler, a convenience wrapper around pikepdf and pdfminer for:

  • text extraction and word counting

  • encryption, decryption, and permission inspection

  • moving, deleting, and resizing PDFs

  • merging PDFs with optional separator pages

class pdfhandler.pdf_handler.PdfHandler(pdf_path)[source]

Bases: object

Helper for common operations on a single PDF file.

The handler validates the input path on construction and then provides methods for:

  • extracting text and counting words

  • checking and changing encryption / permissions

  • moving, deleting, and resizing the file

  • merging PDFs and inserting separator pages

cp(new_path=None)[source]

Copy the PDF to a specified location and return its Path.

Parameters:

new_path (str | Path | None, optional) – Path to the new copy. If None it will be saved to the original PDF’s path with ‘-copy’ embedded between the stem and suffix. (Default: None).

Return type:

Path

decrypt(output=None, in_place=False, owner_password=None)[source]

Decrypt the PDF if it is currently encrypted.

If in_place is False (recommended), a decrypted copy is saved to a new file; otherwise, the original file is overwritten. If the PDF is not encrypted, no changes are made.

Parameters:
  • output (str | Path | None, default None) – Destination path for the decrypted PDF. Ignored if in_place=True. If None, a new file is created with "-Decrypted" appended to the original name.

  • in_place (bool, default False) – Whether to overwrite the original file in place.

  • owner_password (str | None, default None) – The owner password used to unlock and decrypt the PDF.

Return type:

None

encrypt(output=None, in_place=False, password=None, owner_password=None)[source]

Encrypt the PDF if it is not already encrypted.

This creates an encrypted version of the PDF using restrictive permissions by default. If in_place is False, the encrypted file is saved to a new path; otherwise, the original file is overwritten.

For fine-grained control over permissions, use save_pike_pdf() directly.

Parameters:
  • output (str | Path | None, default None) – Destination path for the encrypted PDF. Ignored if in_place=True. If None, a new file is created with "-Encrypted" appended to the original name.

  • in_place (bool, default False) – Whether to overwrite the original file in place.

  • password (str | None, default None) – The user password required to open the PDF. If None or empty, no password is required to view.

  • owner_password (str | None, default None) – The owner password used to set encryption and permissions.

Return type:

None

get_pdf_permissions()[source]

Return the current permission settings of the PDF.

Returns:

A dictionary mapping permission names to boolean values. Keys include:

  • "extract"

  • "modify_annotation"

  • "modify_assembly"

  • "modify_form"

  • "modify_other"

  • "print_lowres"

  • "print_highres"

Return type:

dict[str, bool]

get_pdf_text(pages=None)[source]

Extract text from the PDF, optionally from specific pages.

Parameters:

pages (PageNumberType, optional) –

Pages to extract text from. If None (default), all pages are included. Acceptable formats include:

  • a single int or str (e.g., 5 or "5")

  • a range as a str (e.g., "2-4")

  • a comma/space/"and"-delimited str (e.g., "1, 3 and 5-6")

  • a list of ints and/or strs (e.g., [1, "3", "5-7"])

Returns:

The extracted text as a single string. Returns an empty string if no text is found.

Return type:

str

classmethod merge_pdfs(pdf0_path, pdf1_path, output_path, add_separator=False, separator_type='black')[source]

Merge two PDF files, placing the first file on top.

Parameters:
  • pdf0_path (str | Path) – Path to the first PDF, which will appear first in the output.

  • pdf1_path (str | Path) – Path to the second PDF, which will appear after the first.

  • output_path (str | Path) – Path to save the merged output PDF.

  • add_separator (bool, default False) – If True, insert a separator page between the PDFs.

  • separator_type ({"black", "blank"}, default "black") –

    Type of separator page to insert:

    • "black" : a black bar (~1 in height)

    • "blank" : a full blank page

Raises:

ValueError – If separator_type is not "black" or "blank".

Return type:

None

mv(dst)[source]

Move the PDF to a new location and update the internal path.

Parameters:

dst (str | Path) – Destination path, including the filename and .pdf extension.

Return type:

None

pdf_is_encrypted()[source]

Return whether the PDF is encrypted.

Returns:

True if the PDF is encrypted, False otherwise.

Return type:

bool

classmethod pdfs_are_duplicates(pdf0_path, pdf1_path)[source]

Return whether two PDFs have identical extracted text content.

Text is extracted using pdfminer. Layout, formatting, and metadata differences are ignored.

Parameters:
  • pdf0_path (str | Path) – Path to the first PDF file.

  • pdf1_path (str | Path) – Path to the second PDF file.

Returns:

True if the extracted text from both PDFs is identical, False otherwise.

Return type:

bool

print_permissions()[source]

Print encryption and permission status to the console.

Output is color-coded using colorama:

  • green for enabled permissions

  • red for disabled permissions

Return type:

None

resize(width, height, output_path=None)[source]

Resize all pages in the PDF to the specified dimensions.

Parameters:
  • width (int) – Desired page width in points (1 inch = 72 points).

  • height (int) – Desired page height in points (1 inch = 72 points).

  • output_path (str | Path | None, default None) – Path to save the resized PDF. If None, a new file is created in the same directory with the name pattern {original_name}-{width}x{height}.pdf.

Raises:

ValueError – If output_path is provided and does not end with .pdf.

Return type:

None

rm()[source]

Delete the PDF file from disk.

Return type:

None

save_pike_pdf(output, in_place=False, crypt_type=None, password=None, owner_password=None, extract=True, modify_annotation=True, modify_assembly=True, modify_form=True, modify_other=True, print_lowres=True, print_highres=True)[source]

Save the PDF with optional encryption or decryption applied.

Parameters:
  • output (str | Path | None) – Destination for the saved file. Ignored if in_place is True. If None, a new file is saved with a suffix such as "-Encrypted" or "-Decrypted" depending on usage.

  • in_place (bool, default False) – If True, overwrites the original file. If False, creates a new file.

  • crypt_type (str | None, default None) –

    A preset encryption mode. Must be one of:

    • "decrypt" : disables encryption entirely

    • "encrypt" : enables encryption with all permissions set to False

    • "no_copy" : like "decrypt" but with extract permission set to False

    • None : uses the individual permission arguments below

  • password (str | None, default None) – User password for opening the encrypted PDF. If None or an empty string, no password is required to open.

  • owner_password (str | None, default None) – Owner password used to set permissions. A default value is used if this is None.

  • extract (bool, default True) – Whether users can extract text or images.

  • modify_annotation (bool, default True) – Whether users can modify annotations.

  • modify_assembly (bool, default True) – Whether users can rearrange pages or merge documents.

  • modify_form (bool, default True) – Whether users can fill in or edit form fields.

  • modify_other (bool, default True) – Whether users can make general modifications.

  • print_lowres (bool, default True) – Whether users can print in low resolution.

  • print_highres (bool, default True) – Whether users can print in high resolution.

Raises:

ValueError – If crypt_type is invalid or if the resolved output path is invalid.

Return type:

None

word_count(pages=None)[source]

Count the number of words in the PDF.

Parameters:

pages (PageNumberType, optional) – Pages to include in the word count. If None (default), all pages are included. See get_pdf_text() for accepted formats.

Returns:

The total number of words found on the specified pages.

Return type:

int