How to Intermediate · 4 min read

Prompt compression techniques

Quick answer
Prompt compression techniques reduce the token length of inputs to large language models by methods such as token pruning, semantic embeddings, and retrieval-augmented generation (RAG). These approaches optimize API usage costs and improve response efficiency by minimizing redundant or verbose input data.

PREREQUISITES

  • Python 3.8+
  • OpenAI API key (free tier works)
  • pip install openai>=1.0

Setup

Install the openai Python package and set your API key as an environment variable to interact with the OpenAI API for prompt compression experiments.

bash
pip install openai
output
Collecting openai
  Downloading openai-1.x.x-py3-none-any.whl (xx kB)
Installing collected packages: openai
Successfully installed openai-1.x.x

Step by step

This example demonstrates a simple token pruning technique by truncating a long prompt before sending it to the gpt-4o model. It also shows how to use embeddings for semantic compression by encoding the prompt and retrieving a compressed representation.

python
import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

# Original long prompt
long_prompt = """\
In this document, we discuss the detailed architecture of transformer models, including attention mechanisms, positional encodings, and feed-forward layers. The goal is to optimize the input prompt length to reduce token usage and cost.
"""

# Token pruning: truncate prompt to 100 tokens (approximate by characters here for demo)
truncated_prompt = long_prompt[:500]  # Adjust length as needed

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": truncated_prompt}]
)
print("Response with truncated prompt:", response.choices[0].message.content)

# Embeddings for semantic compression
embedding_response = client.embeddings.create(
    model="text-embedding-3-small",
    input=long_prompt
)
embedding_vector = embedding_response.data[0].embedding
print(f"Embedding vector length: {len(embedding_vector)}")
output
Response with truncated prompt: Transformer models use attention mechanisms to weigh input tokens dynamically, enabling efficient context understanding.
Embedding vector length: 384

Common variations

  • Use asynchronous calls with asyncio for non-blocking prompt compression workflows.
  • Apply retrieval-augmented generation (RAG) by storing embeddings in a vector database like FAISS and retrieving relevant compressed context.
  • Experiment with different models like gpt-4o-mini for lower-cost inference with compressed prompts.
python
import asyncio
from openai import OpenAI

async def async_truncate_prompt():
    client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
    long_prompt = """Detailed explanation of prompt compression techniques for cost optimization in LLMs."""
    truncated_prompt = long_prompt[:300]
    response = await client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": truncated_prompt}]
    )
    print("Async response:", response.choices[0].message.content)

asyncio.run(async_truncate_prompt())
output
Async response: Prompt compression reduces token usage by summarizing or encoding input efficiently, saving API costs.

Troubleshooting

  • If responses are incomplete or cut off, increase max_tokens or adjust prompt length.
  • Embedding vector length may vary by model; ensure compatibility with your vector store.
  • For API rate limits, implement exponential backoff retries.

Key Takeaways

  • Truncate or prune prompts to reduce token count and lower API costs.
  • Use embeddings to semantically compress prompts for efficient retrieval and generation.
  • Combine prompt compression with retrieval-augmented generation for scalable cost optimization.
Verified 2026-04 · gpt-4o, gpt-4o-mini, text-embedding-3-small
Verify ↗