How to chunk code for RAG
Direct answer
Use text splitting techniques to divide code into logical, token-limited chunks for RAG, ensuring each chunk fits model context windows and preserves semantic boundaries.
Setup
Install
pip install langchain openai Env vars
OPENAI_API_KEY Imports
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI Examples
indef add(a, b):
return a + b
print(add(2, 3))
out['def add(a, b):\n return a + b\n\nprint(add(2, 3))']
inclass Calculator:
def add(self, a, b):
return a + b
def subtract(self, a, b):
return a - b
calc = Calculator()
print(calc.add(5, 7))
out["class Calculator:\n def add(self, a, b):\n return a + b\n", " def subtract(self, a, b):\n return a - b\n\ncalc = Calculator()\nprint(calc.add(5, 7))"]
in# Large script with multiple functions and classes spanning 2000 tokens
out["Chunk 1 with first 1000 tokens", "Chunk 2 with next 1000 tokens"]
Integration steps
- Import a text splitter like RecursiveCharacterTextSplitter from LangChain.
- Load your code as a string and configure the splitter with chunk size and overlap.
- Split the code into chunks respecting token limits and semantic boundaries.
- Use these chunks as documents for embedding or retrieval in your RAG pipeline.
- Query the retriever with user input to fetch relevant code chunks.
- Pass retrieved chunks plus query to the LLM for augmented generation.
Full code
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
# Load your code as a string (example code snippet)
code_text = '''
class Calculator:
def add(self, a, b):
return a + b
def subtract(self, a, b):
return a - b
calc = Calculator()
print(calc.add(5, 7))
'''
# Initialize the text splitter with chunk size and overlap
splitter = RecursiveCharacterTextSplitter(
chunk_size=500, # max tokens approx
chunk_overlap=50,
separators=["\n\n", "\n", " "]
)
# Split the code into chunks
chunks = splitter.split_text(code_text)
# Print chunks for inspection
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1} (length {len(chunk)} chars):\n{chunk}\n{'-'*40}")
# Example: Initialize OpenAI client for RAG retrieval or embedding
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Normally, you'd embed chunks and build a vector store here for retrieval
# This example only shows chunking logic output
Chunk 1 (length 142 chars):
class Calculator:
def add(self, a, b):
return a + b
def subtract(self, a, b):
return a - b
----------------------------------------
Chunk 2 (length 27 chars):
calc = Calculator()
print(calc.add(5, 7))
---------------------------------------- API trace
Request
{"model": "gpt-4o", "messages": [{"role": "user", "content": "Retrieve relevant code chunks for query"}]} Response
{"choices": [{"message": {"content": "Relevant code chunk text..."}}], "usage": {"total_tokens": 150}} Extract
response.choices[0].message.contentVariants
Streaming chunk output ›
Use streaming when processing large code chunks to provide incremental output and better UX.
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
code_text = '''def foo():\n pass\n\n# More code...'''
splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=30)
chunks = splitter.split_text(code_text)
for chunk in chunks:
print(chunk)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Process this code chunk:" + chunk}],
stream=True
)
for event in response:
print(event.choices[0].message.content, end='') Async chunking and retrieval ›
Use async when integrating chunking with concurrent API calls for efficiency.
import os
import asyncio
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
async def chunk_and_query():
code_text = 'def async_func():\n pass\n'
splitter = RecursiveCharacterTextSplitter(chunk_size=400, chunk_overlap=40)
chunks = splitter.split_text(code_text)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
for chunk in chunks:
response = await client.chat.completions.acreate(
model="gpt-4o",
messages=[{"role": "user", "content": f"Analyze this code chunk:\n{chunk}"}]
)
print(response.choices[0].message.content)
asyncio.run(chunk_and_query()) Alternative model for cost efficiency ›
Use smaller models like gpt-4o-mini for cheaper, faster chunk summarization when high accuracy is not critical.
import os
from langchain.text_splitter import RecursiveCharacterTextSplitter
from openai import OpenAI
code_text = 'def cheap_model_func():\n return True\n'
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_text(code_text)
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
for chunk in chunks:
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": f"Summarize this code chunk:\n{chunk}"}]
)
print(response.choices[0].message.content) Performance
Latency~800ms for gpt-4o non-streaming calls
Cost~$0.002 per 500 tokens for gpt-4o
Rate limitsTier 1: 500 RPM / 30K TPM
- Use chunk overlap sparingly to reduce redundant tokens.
- Trim comments or non-essential code before chunking to save tokens.
- Batch multiple chunks in one request if model context allows.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard chunking with gpt-4o | ~800ms | ~$0.002 per 500 tokens | High accuracy RAG |
| Streaming chunk output | Varies, faster perceived | Similar | Large codebases, better UX |
| Async chunking | Improved throughput | Similar | Concurrent processing |
| Using gpt-4o-mini | ~400ms | ~$0.0005 per 500 tokens | Cost-sensitive summarization |
Quick tip
Use semantic-aware text splitters like RecursiveCharacterTextSplitter to chunk code preserving logical blocks and token limits.
Common mistake
Splitting code arbitrarily by fixed character count without semantic awareness leads to broken code chunks and poor retrieval results.