How to use ctransformers in python
client.model.load(), and calling model.chat() with your prompt.Setup
pip install ollama import ollama Examples
Integration steps
- Install the Ollama Python SDK with pip.
- Import ollama and initialize the client (no API key needed).
- Load the ctransformers model using ollama.chat(model='ctransformers').
- Call the model's chat method with your prompt to get a response.
- Process and display the output from the model.
Full code
import ollama
# Load the ctransformers model
model = ollama.chat(model='ctransformers')
# Define the prompt
prompt = "Hello, how are you?"
# Get the chat completion from the model
response = model.chat(prompt)
# Print the model's response
print("Model response:", response.text) Model response: I'm doing great, thank you! How can I assist you today?
API trace
{"model": "ctransformers", "prompt": "Hello, how are you?"} {"text": "I'm doing great, thank you! How can I assist you today?", "usage": {"tokens": 15}} response.textVariants
Streaming response version ›
Use streaming when you want to display the model's output token-by-token for better user experience on long responses.
import ollama
model = ollama.chat(model='ctransformers')
prompt = "Tell me a joke."
# Stream the response tokens as they arrive
for token in model.chat_stream(prompt):
print(token, end='', flush=True)
print() Async version ›
Use async calls to integrate model inference into asynchronous applications or to handle multiple concurrent requests efficiently.
import asyncio
import ollama
async def main():
model = ollama.chat(model='ctransformers')
prompt = "Explain quantum computing in simple terms."
response = await model.chat_async(prompt)
print("Async response:", response.text)
asyncio.run(main()) Alternative model usage ›
Use a different model like 'llama-3.1-70b' when you need more advanced reasoning or domain-specific knowledge.
import ollama
model = ollama.chat(model='llama-3.1-70b')
prompt = "Summarize the latest AI trends."
response = model.chat(prompt)
print("Summary:", response.text) Performance
- Keep prompts concise to reduce token usage and latency.
- Reuse loaded model instances instead of reloading for each request.
- Use streaming to start displaying output before full completion.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard chat call | ~500ms | Free (local) | Simple synchronous calls |
| Streaming chat | ~500ms + stream | Free (local) | Long responses with better UX |
| Async chat | ~500ms | Free (local) | Concurrent or async apps |
Quick tip
Always load the ctransformers model once and reuse it for multiple prompts to reduce latency and improve efficiency.
Common mistake
Beginners often try to set an OLLAMA_API_KEY environment variable, but Ollama requires no authentication and runs locally.