Code Intermediate medium · 6 min

What an ingestion pipeline solves

What you will learn

An ingestion pipeline transforms raw documents into indexed, queryable vectors by automating parsing, chunking, embedding, and storage in a single reproducible workflow.

Why this matters

Without a pipeline, you manually chain document loading → text splitting → embedding → storage, making your code brittle, non-reproducible, and hard to swap components (e.g., switching embedding models or vector stores). Pipelines solve the 'integration nightmare' of RAG systems.

Skip if: You don't need an ingestion pipeline if you're working with pre-indexed documents in production (documents already in a vector store) or if you're only doing one-off batch indexing with no reprocessing. However, avoid ad-hoc manual chaining in any real project: it will grow into spaghetti code.

Explanation

What it is: An ingestion pipeline in LlamaIndex is a declarative workflow that takes raw documents and outputs them ready for retrieval: automatically handling parsing, chunking, embedding, and vector store insertion. It's the DAG (directed acyclic graph) between "files on disk" and "queryable index".

How it works mechanically: You define pipeline nodes (e.g., SimpleFileReader → SentenceSplitter → OpenAIEmbedding → PineconeVectorStore) and wire them together. Each node transforms documents or chunks, passing output to the next. When you run the pipeline, it executes the entire chain deterministically. The key insight: the pipeline itself is data-agnostic and reusable: you write it once, run it on different document sets, or swap out components without touching the orchestration logic.

When to use it: Use pipelines for any RAG system where documents may be re-indexed, where you need reproducibility, or where multiple team members need to ingest data consistently. For quick prototypes with static data, a manual chain is acceptable; for anything shipping to production or with frequent document updates, pipelines are non-negotiable.

Analogy

A factory assembly line. Raw materials (documents) enter, travel through stations (parser → chunker → embedder), and exit as finished products (indexed vectors). You design the line once, then feed it different raw materials without redesigning the stations.

Code

Illustrative only - not runnable without a valid API key

python

from llama_index.core import SimpleDirectoryReader, Document
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.core.storage import StorageContext
from llama_index.vector_stores.pinecone import PineconeVectorStore
import os

os.environ["OPENAI_API_KEY"] = "sk-test-key"

docs = [
    Document(text="LlamaIndex is a framework for building RAG applications."),
    Document(text="Ingestion pipelines automate the document-to-vector workflow."),
    Document(text="You can chain multiple nodes in a pipeline for complex workflows."),
]

pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512, chunk_overlap=20),
        OpenAIEmbedding(model="text-embedding-3-small"),
    ]
)

nodes = pipeline.run(documents=docs)

print(f"Created {len(nodes)} nodes")
print(f"First node text: {nodes[0].get_content()[:80]}...")
print(f"First node embedding length: {len(nodes[0].embedding)}")
print(f"First node has metadata: {nodes[0].metadata is not None}")

Output

Created 3 nodes
First node text: LlamaIndex is a framework for building RAG applications....
First node embedding length: 1536
First node has metadata: True

What just happened?

The pipeline took 3 raw documents, split them into 3 nodes (no splitting occurred because each sentence fit under 512 chars), embedded each node using OpenAI's text-embedding-3-small model (producing 1536-dimensional vectors), and returned a list of Node objects with embeddings attached. Each node retained the original text and metadata from the source document. The pipeline executed transformations in order: SentenceSplitter first, then OpenAIEmbedding second.

Common gotcha

Developers often assume that calling pipeline.run(documents=docs) stores data in a vector store automatically: it doesn't. The pipeline returns nodes with embeddings; you still need to wrap it with a VectorStoreIndex or manually insert into your vector store. The pipeline is the transformation layer, not the persistence layer. This distinction trips up people migrating from the old GPTVectorStoreIndex.from_documents() pattern, which did both.

Error recovery

MissingOpenAIKeyError

OpenAI embedding requires OPENAI_API_KEY set in environment. Set it: os.environ['OPENAI_API_KEY'] = 'sk-...' before instantiating OpenAIEmbedding.

AttributeError: 'NoneType' object has no attribute 'embedding'

A node in the pipeline was not embedded (embedding is None). This happens if you forgot to add an embedding transformation. Add OpenAIEmbedding() or another embedding model to the transformations list.

TypeError: 'IngestionPipeline' object is not callable

You tried to call the pipeline object like a function. Use pipeline.run(documents=docs), not pipeline(docs).

Experienced dev note

The real power of pipelines emerges when you need to re-index after document updates or A/B test different chunking strategies. Many teams build a one-off ingestion script, then 6 months later realize they can't reproduce it or swap the embedding model because the logic is buried in Jupyter notebooks. Build the pipeline abstraction from day one: it adds 5 minutes of code and saves 10 hours of debugging. Also: pipelines compose well with caching (llama-index caches embeddings by default), so you can re-run a pipeline on partially new data without re-embedding everything.

Check your understanding

You have 10,000 documents already indexed in Pinecone with OpenAI embeddings. Your team wants to switch to a different embedding model (e.g., Cohere) for better performance. Explain what parts of your ingestion pipeline you would change and what you would NOT need to change.

Show answer hint

A correct answer identifies that you'd replace the OpenAIEmbedding node with a CohereEmbedding node in the transformations list, re-run the pipeline, and re-insert into Pinecone. The Document parsing and chunking logic stays the same: only the embedding transformation changes. This is why pipelines are powerful: component swappability without orchestration logic changes.

VERSION In llama-index-core < 0.10.0, ingestion pipelines used a different API (transformations as a dict). Current version (0.12.x) uses the cleaner list-based transformations syntax shown here. If working with older code, migrate to the transformations=[...] pattern.

Next, you'll learn how to connect an ingestion pipeline directly to a VectorStoreIndex and set up retrieval from the indexed nodes.

Community Notes

No notes yetBe the first to share a version-specific fix or tip.