Prompt compression techniques
Quick answer
Prompt compression techniques reduce the token length of inputs to large language models by methods such as token pruning, semantic embeddings, and retrieval-augmented generation (RAG). These approaches optimize API usage costs and improve response efficiency by minimizing redundant or verbose input data.
PREREQUISITES
Python 3.8+OpenAI API key (free tier works)pip install openai>=1.0
Setup
Install the openai Python package and set your API key as an environment variable to interact with the OpenAI API for prompt compression experiments.
pip install openai output
Collecting openai Downloading openai-1.x.x-py3-none-any.whl (xx kB) Installing collected packages: openai Successfully installed openai-1.x.x
Step by step
This example demonstrates a simple token pruning technique by truncating a long prompt before sending it to the gpt-4o model. It also shows how to use embeddings for semantic compression by encoding the prompt and retrieving a compressed representation.
import os
from openai import OpenAI
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
# Original long prompt
long_prompt = """\
In this document, we discuss the detailed architecture of transformer models, including attention mechanisms, positional encodings, and feed-forward layers. The goal is to optimize the input prompt length to reduce token usage and cost.
"""
# Token pruning: truncate prompt to 100 tokens (approximate by characters here for demo)
truncated_prompt = long_prompt[:500] # Adjust length as needed
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": truncated_prompt}]
)
print("Response with truncated prompt:", response.choices[0].message.content)
# Embeddings for semantic compression
embedding_response = client.embeddings.create(
model="text-embedding-3-small",
input=long_prompt
)
embedding_vector = embedding_response.data[0].embedding
print(f"Embedding vector length: {len(embedding_vector)}") output
Response with truncated prompt: Transformer models use attention mechanisms to weigh input tokens dynamically, enabling efficient context understanding. Embedding vector length: 384
Common variations
- Use asynchronous calls with
asynciofor non-blocking prompt compression workflows. - Apply retrieval-augmented generation (RAG) by storing embeddings in a vector database like FAISS and retrieving relevant compressed context.
- Experiment with different models like
gpt-4o-minifor lower-cost inference with compressed prompts.
import asyncio
from openai import OpenAI
async def async_truncate_prompt():
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
long_prompt = """Detailed explanation of prompt compression techniques for cost optimization in LLMs."""
truncated_prompt = long_prompt[:300]
response = await client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": truncated_prompt}]
)
print("Async response:", response.choices[0].message.content)
asyncio.run(async_truncate_prompt()) output
Async response: Prompt compression reduces token usage by summarizing or encoding input efficiently, saving API costs.
Troubleshooting
- If responses are incomplete or cut off, increase
max_tokensor adjust prompt length. - Embedding vector length may vary by model; ensure compatibility with your vector store.
- For API rate limits, implement exponential backoff retries.
Key Takeaways
- Truncate or prune prompts to reduce token count and lower API costs.
- Use embeddings to semantically compress prompts for efficient retrieval and generation.
- Combine prompt compression with retrieval-augmented generation for scalable cost optimization.