Code beginner · 3 min read

How to use Ollama for embeddings in python

Direct answer
Use the Ollama Python client to send a prompt to an embedding model like llama2-embeddings and parse the returned embedding vector from the response JSON.

Setup

Install
bash
pip install ollama
Imports
python
import ollama

Examples

inGenerate embeddings for the text: 'OpenAI develops advanced AI models.'
out[0.123, -0.045, 0.987, ...] # vector of floats representing the embedding
inGet embeddings for 'Python is great for AI development.'
out[0.234, -0.056, 0.876, ...]
inEmbedding for empty string ''
out[0.0, 0.0, 0.0, ...] # typically a zero or near-zero vector

Integration steps

  1. Install the Ollama Python SDK.
  2. Import the Ollama module.
  3. Call the embedding model endpoint with the input text as a message.
  4. Parse the JSON response to extract the embedding vector.
  5. Use the embedding vector for downstream tasks like similarity search or classification.

Full code

python
import ollama

# Define the text to embed
text_to_embed = "OpenAI develops advanced AI models."

# Call the embedding model (example model name: 'llama2-embeddings')
response = ollama.chat(
    model="llama2-embeddings",
    messages=[{"role": "user", "content": text_to_embed}]
)

# Extract the embedding vector from the response
embedding_vector = response['embedding']

print("Embedding vector:", embedding_vector)
output
Embedding vector: [0.123456, -0.045678, 0.987654, ...]

API trace

Request
json
{"model": "llama2-embeddings", "messages": [{"role": "user", "content": "OpenAI develops advanced AI models."}]}
Response
json
{"embedding": [0.123456, -0.045678, 0.987654, ...], "model": "llama2-embeddings", "usage": {"tokens": 10}}
Extractresponse['embedding']

Variants

Streaming embeddings retrieval

Use streaming when embedding very large texts or when you want partial results as soon as possible.

python
import ollama

text = "Stream embeddings for this text."

# Assuming the SDK supports streaming embeddings (hypothetical example)
for chunk in ollama.chat(
    model="llama2-embeddings",
    messages=[{"role": "user", "content": text}],
    stream=True
):
    print("Received chunk:", chunk)

# Aggregate chunks to form full embedding vector
Async embeddings call

Use async calls to embed multiple texts concurrently or integrate into async web frameworks.

python
import asyncio
import ollama

async def get_embedding():
    response = await ollama.chat(
        model="llama2-embeddings",
        messages=[{"role": "user", "content": "Async embedding call example."}]
    )
    print("Async embedding vector:", response['embedding'])

asyncio.run(get_embedding())
Alternative model for smaller embeddings

Use smaller embedding models to reduce latency and cost when high precision is not critical.

python
import ollama

text = "Generate smaller embeddings for faster processing."
response = ollama.chat(
    model="llama2-embeddings-small",
    messages=[{"role": "user", "content": text}]
)
print("Small embedding vector:", response['embedding'])

Performance

Latency~300ms per embedding call for typical short text
CostFree (Ollama runs locally with no cost)
Rate limitsNo rate limits; runs locally on your hardware
  • Keep input text concise to reduce token usage.
  • Batch multiple texts into one request if supported.
  • Use smaller embedding models for less critical tasks.
ApproachLatencyCost/callBest for
Standard embedding call~300msFreeGeneral purpose embeddings
Streaming embeddingsVaries, faster partial resultsFreeLarge texts or real-time apps
Async embedding calls~300ms per call, concurrentFreeHigh throughput or async apps

Quick tip

Always normalize your embedding vectors after retrieval to improve similarity search accuracy.

Common mistake

Beginners often forget that Ollama runs locally and requires no API key, causing confusion about authentication.

Verified 2026-04 · llama2-embeddings, llama2-embeddings-small
Verify ↗