Code beginner · 3 min read

How to use Ollama for embeddings in python

Direct answer

Use the Ollama Python client to send a prompt to an embedding model like llama2-embeddings and parse the returned embedding vector from the response JSON.

Setup

Install

bash

pip install ollama

Imports

python

import ollama

Examples

inGenerate embeddings for the text: 'OpenAI develops advanced AI models.'

out[0.123, -0.045, 0.987, ...] # vector of floats representing the embedding

inGet embeddings for 'Python is great for AI development.'

out[0.234, -0.056, 0.876, ...]

inEmbedding for empty string ''

out[0.0, 0.0, 0.0, ...] # typically a zero or near-zero vector

Integration steps

Install the Ollama Python SDK.
Import the Ollama module.
Call the embedding model endpoint with the input text as a message.
Parse the JSON response to extract the embedding vector.
Use the embedding vector for downstream tasks like similarity search or classification.

Full code

python

import ollama

# Define the text to embed
text_to_embed = "OpenAI develops advanced AI models."

# Call the embedding model (example model name: 'llama2-embeddings')
response = ollama.chat(
    model="llama2-embeddings",
    messages=[{"role": "user", "content": text_to_embed}]
)

# Extract the embedding vector from the response
embedding_vector = response['embedding']

print("Embedding vector:", embedding_vector)

output

Embedding vector: [0.123456, -0.045678, 0.987654, ...]

API trace

Request

json

{"model": "llama2-embeddings", "messages": [{"role": "user", "content": "OpenAI develops advanced AI models."}]}

Response

json

{"embedding": [0.123456, -0.045678, 0.987654, ...], "model": "llama2-embeddings", "usage": {"tokens": 10}}

Extractresponse['embedding']

Variants

Streaming embeddings retrieval ›

Use streaming when embedding very large texts or when you want partial results as soon as possible.

python

import ollama

text = "Stream embeddings for this text."

# Assuming the SDK supports streaming embeddings (hypothetical example)
for chunk in ollama.chat(
    model="llama2-embeddings",
    messages=[{"role": "user", "content": text}],
    stream=True
):
    print("Received chunk:", chunk)

# Aggregate chunks to form full embedding vector

Async embeddings call ›

Use async calls to embed multiple texts concurrently or integrate into async web frameworks.

python

import asyncio
import ollama

async def get_embedding():
    response = await ollama.chat(
        model="llama2-embeddings",
        messages=[{"role": "user", "content": "Async embedding call example."}]
    )
    print("Async embedding vector:", response['embedding'])

asyncio.run(get_embedding())

Alternative model for smaller embeddings ›

Use smaller embedding models to reduce latency and cost when high precision is not critical.

python

import ollama

text = "Generate smaller embeddings for faster processing."
response = ollama.chat(
    model="llama2-embeddings-small",
    messages=[{"role": "user", "content": text}]
)
print("Small embedding vector:", response['embedding'])

Performance

Latency~300ms per embedding call for typical short text

CostFree (Ollama runs locally with no cost)

Rate limitsNo rate limits; runs locally on your hardware

Keep input text concise to reduce token usage.
Batch multiple texts into one request if supported.
Use smaller embedding models for less critical tasks.

Approach	Latency	Cost/call	Best for
Standard embedding call	~300ms	Free	General purpose embeddings
Streaming embeddings	Varies, faster partial results	Free	Large texts or real-time apps
Async embedding calls	~300ms per call, concurrent	Free	High throughput or async apps

✓

Quick tip

Always normalize your embedding vectors after retrieval to improve similarity search accuracy.

⚠

Common mistake

Beginners often forget that Ollama runs locally and requires no API key, causing confusion about authentication.

Verified 2026-04 · llama2-embeddings, llama2-embeddings-small

Verify ↗