Code beginner · 3 min read

How to use ctransformers in python

Direct answer

Use the Ollama Python client to load and run ctransformers models by initializing the client, loading the model with client.model.load(), and calling model.chat() with your prompt.

Setup

Install

bash

pip install ollama

Imports

python

import ollama

Examples

inHello, how are you?

outI'm doing great, thank you! How can I assist you today?

inWrite a Python function to add two numbers.

outdef add(a, b): return a + b

inExplain the concept of recursion.

outRecursion is a programming technique where a function calls itself to solve smaller instances of a problem until a base case is reached.

Integration steps

Install the Ollama Python SDK with pip.
Import ollama and initialize the client (no API key needed).
Load the ctransformers model using ollama.chat(model='ctransformers').
Call the model's chat method with your prompt to get a response.
Process and display the output from the model.

Full code

python

import ollama

# Load the ctransformers model
model = ollama.chat(model='ctransformers')

# Define the prompt
prompt = "Hello, how are you?"

# Get the chat completion from the model
response = model.chat(prompt)

# Print the model's response
print("Model response:", response.text)

output

Model response: I'm doing great, thank you! How can I assist you today?

API trace

Request

json

{"model": "ctransformers", "prompt": "Hello, how are you?"}

Response

json

{"text": "I'm doing great, thank you! How can I assist you today?", "usage": {"tokens": 15}}

Extractresponse.text

Variants

Streaming response version ›

Use streaming when you want to display the model's output token-by-token for better user experience on long responses.

python

import ollama

model = ollama.chat(model='ctransformers')
prompt = "Tell me a joke."

# Stream the response tokens as they arrive
for token in model.chat_stream(prompt):
    print(token, end='', flush=True)
print()

Async version ›

Use async calls to integrate model inference into asynchronous applications or to handle multiple concurrent requests efficiently.

python

import asyncio
import ollama

async def main():
    model = ollama.chat(model='ctransformers')
    prompt = "Explain quantum computing in simple terms."
    response = await model.chat_async(prompt)
    print("Async response:", response.text)

asyncio.run(main())

Alternative model usage ›

Use a different model like 'llama-3.1-70b' when you need more advanced reasoning or domain-specific knowledge.

python

import ollama

model = ollama.chat(model='llama-3.1-70b')
prompt = "Summarize the latest AI trends."
response = model.chat(prompt)
print("Summary:", response.text)

Performance

Latency~500ms for typical ctransformers chat completion

CostFree when running locally; Ollama has no paid plans or cloud usage fees

Rate limitsNo enforced limits for local usage

Keep prompts concise to reduce token usage and latency.
Reuse loaded model instances instead of reloading for each request.
Use streaming to start displaying output before full completion.

Approach	Latency	Cost/call	Best for
Standard chat call	~500ms	Free (local)	Simple synchronous calls
Streaming chat	~500ms + stream	Free (local)	Long responses with better UX
Async chat	~500ms	Free (local)	Concurrent or async apps

✓

Quick tip

Always load the ctransformers model once and reuse it for multiple prompts to reduce latency and improve efficiency.

⚠

Common mistake

Beginners often try to set an OLLAMA_API_KEY environment variable, but Ollama requires no authentication and runs locally.

Verified 2026-04 · ctransformers, llama-3.1-70b

Verify ↗