PyMuPDF IndexError: Fix For Merged PDF Outlines
Have you encountered an IndexError when using PyMuPDF's delete_page() function on PDFs created by merging multiple documents? This issue often arises due to the way outlines (bookmarks) are handled during the merging process. This article dives deep into the root cause of this bug, provides a step-by-step guide to reproducing it, and offers insights into the underlying mechanisms within PyMuPDF that lead to this error. If you're struggling with this frustrating problem, you've come to the right place. Let's explore how to diagnose and potentially resolve this issue.
Understanding the IndexError
The IndexError: list index out of range typically occurs when you're trying to access an element in a list using an index that is outside the valid range of indices for that list. In the context of PyMuPDF and merged PDFs, this error surfaces within the delete_page() function. Specifically, the problem lies in how PyMuPDF handles PDFs that have been created by merging multiple PDFs, each potentially containing their own Table of Contents (TOC) or bookmarks. The merging process can sometimes lead to inconsistencies in the structure of the PDF's outline, causing the delete_page() function to miscalculate the number of outline entries. This miscalculation then results in an attempt to access a non-existent index, triggering the dreaded IndexError. Understanding the root cause is the first step towards finding a solution, and this article aims to provide that understanding.
The Root Causes: Two Critical Bugs
The IndexError stems from a combination of two bugs within PyMuPDF's internal logic. Let's break down each bug to understand how they contribute to the problem:
Bug 1: Flawed Logic in JM_outline_xrefs()
The function JM_outline_xrefs() (located in src/_init_.py, around line 21052 in PyMuPDF version 1.26.7) is responsible for extracting cross-reference (xref) numbers from the PDF's outline. The logic within this function contains a critical assumption: that the /Type key, which identifies the type of PDF object, should only exist on the root of the outline. However, this assumption doesn't hold true for merged PDFs.
In merged PDFs, child bookmarks can incorrectly inherit the /Type /Outlines entry from the source PDFs during the merging process. This means that not only the root outline, but also individual bookmarks might have this /Type key. The problematic code snippet is:
if newxref in xrefs or mupdf.pdf_dict_get(thisobj, PDF_NAME('Type')).m_internal:
break # Stops when ANY /Type key is found
This code prematurely terminates the xref extraction process when it encounters any /Type key, even if it's on a child bookmark. Consequently, JM_outline_xrefs() might return an incomplete list of xrefs, or even an empty list ([]), if the first bookmark it encounters has the incorrect /Type entry. This is a crucial point to grasp, as this incomplete list of xrefs directly impacts the subsequent steps in the delete_page() function.
Bug 2: Missing Bounds Check in delete_pages()
The delete_pages() function (located in src/_init_.py, around line 4434) is where the IndexError ultimately manifests. This function iterates through the list of outline xrefs obtained from JM_outline_xrefs() and performs operations based on the corresponding Table of Contents (TOC) entries. The core of the problem lies in this loop:
for i, xref in enumerate(self.get_outline_xrefs()):
if toc[i][2] - 1 in frozen_numbers: # No bounds check!
The critical flaw here is the absence of a bounds check. The code directly accesses toc[i] without verifying that the index i is within the valid range of the toc list. This is where the incomplete xref list returned by JM_outline_xrefs() becomes fatal. If len(self.get_outline_xrefs()) (the number of xrefs) is less than len(toc) (the number of TOC entries), the loop will eventually try to access an index i that is beyond the bounds of the toc list, resulting in the IndexError. This lack of bounds checking is the direct trigger for the error you're seeing.
Reproducing the Bug: A Step-by-Step Guide
To better understand the bug and verify potential fixes, it's essential to be able to reproduce it consistently. Here's a step-by-step guide:
-
Obtain a Merged PDF: The key to reproducing this bug is using a PDF that has been created by merging multiple PDFs, especially if those PDFs contain their own TOC or bookmarks. You can either create such a PDF yourself using a PDF merging tool or use a sample PDF that exhibits the issue (like the
BOC-Agenda-Packet-November-25-2025_Redacted.pdfmentioned in the original bug report). -
Install PyMuPDF: If you haven't already, install PyMuPDF using pip:
pip install pymupdf -
Run the Reproduction Code: Use the following Python code snippet, which is adapted from the original bug report, to reproduce the error:
import pymupdf # Open a PDF created by merging multiple PDFs with TOC try: doc = pymupdf.open(