Code beginner · 3 min read

How to use ctransformers in python

Direct answer
Use the Ollama Python client to load and run ctransformers models by initializing the client, loading the model with client.model.load(), and calling model.chat() with your prompt.

Setup

Install
bash
pip install ollama
Imports
python
import ollama

Examples

inHello, how are you?
outI'm doing great, thank you! How can I assist you today?
inWrite a Python function to add two numbers.
outdef add(a, b): return a + b
inExplain the concept of recursion.
outRecursion is a programming technique where a function calls itself to solve smaller instances of a problem until a base case is reached.

Integration steps

  1. Install the Ollama Python SDK with pip.
  2. Import ollama and initialize the client (no API key needed).
  3. Load the ctransformers model using ollama.chat(model='ctransformers').
  4. Call the model's chat method with your prompt to get a response.
  5. Process and display the output from the model.

Full code

python
import ollama

# Load the ctransformers model
model = ollama.chat(model='ctransformers')

# Define the prompt
prompt = "Hello, how are you?"

# Get the chat completion from the model
response = model.chat(prompt)

# Print the model's response
print("Model response:", response.text)
output
Model response: I'm doing great, thank you! How can I assist you today?

API trace

Request
json
{"model": "ctransformers", "prompt": "Hello, how are you?"}
Response
json
{"text": "I'm doing great, thank you! How can I assist you today?", "usage": {"tokens": 15}}
Extractresponse.text

Variants

Streaming response version

Use streaming when you want to display the model's output token-by-token for better user experience on long responses.

python
import ollama

model = ollama.chat(model='ctransformers')
prompt = "Tell me a joke."

# Stream the response tokens as they arrive
for token in model.chat_stream(prompt):
    print(token, end='', flush=True)
print()
Async version

Use async calls to integrate model inference into asynchronous applications or to handle multiple concurrent requests efficiently.

python
import asyncio
import ollama

async def main():
    model = ollama.chat(model='ctransformers')
    prompt = "Explain quantum computing in simple terms."
    response = await model.chat_async(prompt)
    print("Async response:", response.text)

asyncio.run(main())
Alternative model usage

Use a different model like 'llama-3.1-70b' when you need more advanced reasoning or domain-specific knowledge.

python
import ollama

model = ollama.chat(model='llama-3.1-70b')
prompt = "Summarize the latest AI trends."
response = model.chat(prompt)
print("Summary:", response.text)

Performance

Latency~500ms for typical ctransformers chat completion
CostFree when running locally; Ollama has no paid plans or cloud usage fees
Rate limitsNo enforced limits for local usage
  • Keep prompts concise to reduce token usage and latency.
  • Reuse loaded model instances instead of reloading for each request.
  • Use streaming to start displaying output before full completion.
ApproachLatencyCost/callBest for
Standard chat call~500msFree (local)Simple synchronous calls
Streaming chat~500ms + streamFree (local)Long responses with better UX
Async chat~500msFree (local)Concurrent or async apps

Quick tip

Always load the ctransformers model once and reuse it for multiple prompts to reduce latency and improve efficiency.

Common mistake

Beginners often try to set an OLLAMA_API_KEY environment variable, but Ollama requires no authentication and runs locally.

Verified 2026-04 · ctransformers, llama-3.1-70b
Verify ↗