High severity intermediate · Fix: 2-5 min

ValueError

builtins.ValueError

What this error means

PDF chunking fails when binary or non-text content is passed to text extraction methods expecting strings.

Stack trace

traceback

Traceback (most recent call last):
  File "app.py", line 42, in <module>
    chunks = text_splitter.split_text(pdf_page_content)
  File "/usr/local/lib/python3.9/site-packages/langchain/text_splitter.py", line 78, in split_text
    raise ValueError("Input must be a string, got bytes instead")
ValueError: Input must be a string, got bytes instead

QUICK FIX

Decode PDF content bytes to string with .decode('utf-8') before chunking to avoid binary input errors.

Why it happens

PDF chunking tools expect text input but sometimes receive raw binary content extracted from PDFs, such as images or encoded streams. Passing binary data to text splitters causes type errors because they cannot process non-string inputs.

Detection

Add input type checks before chunking to assert the content is a string, or log the type of content received to catch binary data before processing.

Causes & fixes

Extracted PDF page content is raw binary bytes instead of decoded text string

✓ Fix

Decode the binary content to a UTF-8 string before passing it to the chunking or text splitting function

Using a PDF extraction method that returns raw bytes for images or embedded objects

✓ Fix

Filter out or skip non-text PDF elements before chunking, or use a PDF parser that extracts only text content

Passing the entire PDF page object or raw stream instead of its text attribute

✓ Fix

Access and pass only the text attribute of the PDF page object to the chunking function

Code: broken vs fixed

Broken - triggers the error

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_page_content = pdf_reader.getPage(0).extract_text()  # returns bytes in some cases
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
chunks = text_splitter.split_text(pdf_page_content)  # ValueError: Input must be a string, got bytes instead

Fixed - works correctly

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

pdf_page_bytes = pdf_reader.getPage(0).extract_text()
if isinstance(pdf_page_bytes, bytes):
    pdf_page_content = pdf_page_bytes.decode('utf-8')  # decode bytes to string
else:
    pdf_page_content = pdf_page_bytes
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
chunks = text_splitter.split_text(pdf_page_content)  # fixed: input is string
print(chunks)

Decoded the binary PDF page content from bytes to UTF-8 string before passing it to the text splitter, preventing the ValueError.

⚠

Workaround

Wrap the chunking call in try/except ValueError, and if bytes are detected, decode them to string before retrying the chunking operation.

✓

Prevention

Use PDF parsers that extract clean text strings only, and validate input types before chunking to ensure no binary data is processed.

Python 3.9+ · langchain-core >=0.1.0 · tested on 0.2.x

Verified 2026-04

Verify ↗

Community Notes

No notes yetBe the first to share a version-specific fix or tip.