How to use LM Studio API with python
Direct answer
Use the
ollama Python client to send prompts to LM Studio models by initializing the client and calling client.chat() with your prompt and model name.Setup
Install
pip install ollama Imports
import ollama Examples
inHello, how are you?
outI'm doing great, thanks for asking! How can I assist you today?
inSummarize the benefits of AI.
outAI improves efficiency, automates tasks, enhances decision-making, and enables new innovations across industries.
in
outPlease provide a prompt to generate a response.
Integration steps
- Install the Ollama Python package with pip.
- Ollama runs locally and requires no API key or environment variable.
- Import ollama and call ollama.chat() with the model name and your prompt.
- Receive and process the response text from the API call.
Full code
import ollama
# Define the prompt and model
prompt = "Hello, how are you?"
model = "llama2"
# Call the LM Studio API via Ollama client
response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}])
# Print the generated response
print("Response:", response['choices'][0]['message']['content']) output
Response: I'm doing great, thanks for asking! How can I assist you today?
API trace
Request
{"model": "llama2", "messages": [{"role": "user", "content": "Hello, how are you?"}]} Response
{"choices": [{"message": {"content": "I'm doing great, thanks for asking! How can I assist you today?"}}]} Extract
response['choices'][0]['message']['content']Variants
Streaming response ›
Use streaming to display partial results immediately for better user experience with long outputs.
import ollama
prompt = "Tell me a story about a robot."
model = "llama2"
# Stream the response tokens
for chunk in ollama.chat_stream(model=model, messages=[{"role": "user", "content": prompt}]):
print(chunk['choices'][0]['delta'].get('content', ''), end='', flush=True)
print() Async version ›
Use async calls to handle multiple concurrent requests efficiently in asynchronous applications.
import asyncio
import ollama
async def main():
prompt = "Explain quantum computing in simple terms."
model = "llama2"
response = await ollama.chat_async(model=model, messages=[{"role": "user", "content": prompt}])
print("Response:", response['choices'][0]['message']['content'])
asyncio.run(main()) Alternative model ›
Use a chat-optimized model like llama2-chat for conversational or dialogue-based tasks.
import ollama
prompt = "Write a poem about spring."
model = "llama2-chat"
response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}])
print("Response:", response['choices'][0]['message']['content']) Performance
Latency~500ms to 1s per request depending on model size and prompt length
CostOllama is free and open-source software running locally, so no usage costs apply.
Rate limitsNo rate limits as Ollama runs locally on your hardware.
- Keep prompts concise to reduce token usage.
- Use shorter model names or lighter models for faster responses.
- Cache frequent queries to avoid repeated calls.
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard sync call | ~500ms-1s | Free | Simple synchronous use cases |
| Streaming response | Starts immediately, total ~1s | Free | Long outputs with better UX |
| Async call | ~500ms-1s | Free | Concurrent requests in async apps |
Quick tip
Ollama runs locally and requires no API key or environment variable, so no credentials need to be set.
Common mistake
Beginners often try to set an <code>OLLAMA_API_KEY</code> environment variable, but Ollama requires no authentication and runs locally.