Workflow Beginner easy · 5 min decision_step

split_documents() vs split_text()

What you will learn

Decide whether to split raw text or pre-loaded document objects, based on what metadata you need to preserve for retrieval.

Step 2: Document Preparation: after loading documents and before chunking into a vector store

Why this matters

Using the wrong splitter loses document metadata (filename, source URL, page numbers) that your retrieval chain needs to cite sources. Users see incomplete or confusing results when they ask where information came from.

Explanation

split_text() chunks a raw string into smaller text pieces. It's fast but discards context: you lose track of which document each chunk came from. split_documents() operates on Document objects (which carry metadata like source, page, author) and preserves that metadata in every chunk. In RAG, users expect to see "this answer came from page 5 of earnings_report.pdf." If you use split_text() on extracted text, you can't tell them that.

The decision is simple: if you have Document objects with metadata, use split_documents(). If you only have a raw string and don't care about sources, use split_text(). Most RAG workflows start with documents (PDFs, web pages, files) so split_documents() is the standard choice. Split_text() appears when you're prototyping with hardcoded strings or when metadata is genuinely irrelevant.

Both use the same underlying chunking logic (overlap, chunk size), so the output chunk quality is identical. The only difference is metadata preservation.

Code

python

# pip install langchain langchain-text-splitters

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

# Example 1: split_documents() with metadata preservation
print("=== split_documents() ===")
docs = [
    Document(
        page_content="The revenue in Q1 2024 was 1.5 billion dollars. The margin improved by 3%. Growth was driven by cloud services.",
        metadata={"source": "earnings_report.pdf", "page": 5}
    ),
    Document(
        page_content="Operating expenses were reduced by 12%. Cloud infrastructure costs decreased due to optimization.",
        metadata={"source": "earnings_report.pdf", "page": 6}
    )
]

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

chunks_with_metadata = splitter.split_documents(docs)
for i, chunk in enumerate(chunks_with_metadata):
    print(f"Chunk {i}: {chunk.page_content[:50]}...")
    print(f"  Metadata: {chunk.metadata}\n")

# Example 2: split_text() without metadata
print("\n=== split_text() ===")
raw_text = "The revenue in Q1 2024 was 1.5 billion dollars. The margin improved by 3%. Growth was driven by cloud services. Operating expenses were reduced by 12%."

chunks_no_metadata = splitter.split_text(raw_text)
for i, chunk in enumerate(chunks_no_metadata):
    print(f"Chunk {i}: {chunk[:50]}...")
    print(f"  Type: {type(chunk)} (no metadata attached)\n")

Output

=== split_documents() ===
Chunk 0: The revenue in Q1 2024 was 1.5 billion...
  Metadata: {'source': 'earnings_report.pdf', 'page': 5}

Chunk 1: margin improved by 3%. Growth was driven by...
  Metadata: {'source': 'earnings_report.pdf', 'page': 5}

Chunk 2: cloud services. Operating expenses were...
  Metadata: {'source': 'earnings_report.pdf', 'page': 6}

Chunk 3: reduced by 12%. Cloud infrastructure costs...
  Metadata: {'source': 'earnings_report.pdf', 'page': 6}

=== split_text() ===
Chunk 0: The revenue in Q1 2024 was 1.5 billion...
  Type: <class 'str'> (no metadata attached)

Chunk 1: margin improved by 3%. Growth was driven by...
  Type: <class 'str'> (no metadata attached)

Chunk 2: cloud services. Operating expenses were...
  Type: <class 'str'> (no metadata attached)

Chunk 3: reduced by 12%. Cloud infrastructure costs...
  Type: <class 'str'> (no metadata attached)

Your options

Recommended

split_documents()

You loaded files, PDFs, or URLs and have Document objects with metadata (source, page_number, author, etc.). This is the standard RAG path.

Pros

Preserves all metadata in every chunk. Users can be cited ('Answer from sales_deck.pdf, slide 12'). Retriever can filter by source. Production-ready for attribution.

Cons

Requires Document objects as input; slightly more setup if starting from raw strings.

from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader

loader = TextLoader('document.txt')
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_documents(docs)
print(f"Chunk 0 metadata: {chunks[0].metadata}")

split_text()

You only have a raw string (hardcoded text, API response, manual input). Metadata is not needed or doesn't exist. Prototyping or simple text processing.

Pros

Simpler for one-off strings. No Document wrapper needed. Fast to test an idea.

Cons

Loses all source information. Breaks retrieval chains that need to cite sources. Not suitable for production RAG.

from langchain_text_splitters import RecursiveCharacterTextSplitter

raw_text = """This is a long document..."""
splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
chunks = splitter.split_text(raw_text)
print(f"Number of chunks: {len(chunks)}")

Validation step

Run this check: `if hasattr(chunks[0], 'metadata') and chunks[0].metadata: print('✓ Metadata preserved')`. If the first chunk is a string, you used split_text() and lost metadata. If it's a Document object with non-empty metadata, you correctly used split_documents().

At scale

At scale, a 1000-page PDF will generate 10,000+ chunks. If you use split_text() and then try to reconstruct source information from chunk content, you'll fail silently. With split_documents(), metadata stays attached even with 100,000 chunks. Test with real file volumes before deploying.

↩

Rollback plan

If you chose split_text() and lost metadata, reload your documents and call split_documents() instead. You cannot reliably reverse-engineer which source a chunk came from after the fact. Start over with the Document objects.

Debug symptoms

Retrieval results work but have no source/page information. User asks 'where did you find that?' and you have no answer.

Diagnosis

Used split_text() instead of split_documents(). Chunks are plain strings with no metadata attached.

Fix

Switch to split_documents(). Reload your source files as Document objects (using loaders like TextLoader, PDFPlumberLoader, etc.) before splitting.

Vector store indexing fails with error like 'Document object has no attribute page_content' or similar.

Diagnosis

Used split_documents() but the input was not Document objects; it was plain strings.

Fix

Wrap your strings in Document objects: `docs = [Document(page_content=text, metadata={}) for text in texts]` before calling split_documents().

Retrieval chain returns chunks but metadata is empty dict or None.

Diagnosis

Used split_documents() correctly, but the source Document objects had empty metadata dicts to begin with.

Fix

Ensure the loader or Document constructor populates metadata. For PDFPlumberLoader, use: `loader.load_and_split()` or manually set metadata: `Document(page_content=..., metadata={'source': 'file.pdf', 'page': 1})`.

Production upgrade path

Tutorial version: Use split_documents() and rely on the loader's default metadata. Production version: (1) Validate metadata is non-empty after split: `assert chunks[0].metadata, 'Metadata lost!'` (2) Add custom metadata fields that your retrieval chain filters on (e.g., doc_type, author, date_uploaded). (3) Store chunk_id (unique identifier for each chunk) in metadata for audit logs. (4) Log the count of chunks per document to detect silent data loss. Example: `metadata={'source': 'earnings_report.pdf', 'page': 5, 'chunk_id': 'doc_0_chunk_12', 'indexed_at': '2026-04-15T10:30:00Z'}`.

Common gotcha

Many tutorials show split_text() for simplicity. You copy that pattern, it works in a notebook, then your production app fails because the retriever has no source information to show users. By then you've indexed thousands of chunks with no metadata. Always use split_documents() unless you have a specific reason not to.

Experienced dev note

In production RAG systems, metadata is your audit trail. Regulatory compliance (finance, healthcare) often requires attribution: 'this answer came from document X, version Y, at timestamp Z.' split_documents() is not optional for these use cases. Even for non-regulated systems, users trust answers more when they see sources. The marginal cost of preserving metadata (essentially zero) far outweighs the risk of deploying without it. One more thing: metadata is immutable after chunking. If you mess it up at split time, you'll re-index everything to fix it. Get it right the first time.

Check your understanding

You have a PDF with 10 pages. You load it, split it with split_documents(), index it in Chroma, then retrieve the top chunk for a query. What information should that chunk contain besides the text itself, and why does your user care?

Show answer hint

The chunk should have metadata like source ('report.pdf') and page (e.g., 5). Users care because they want to verify the answer and re-read the original context, or cite the source in their own work.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.