Code beginner · 3 min read

How to use LM Studio API with python

Direct answer
Use the ollama Python client to send prompts to LM Studio models by initializing the client and calling client.chat() with your prompt and model name.

Setup

Install
bash
pip install ollama
Imports
python
import ollama

Examples

inHello, how are you?
outI'm doing great, thanks for asking! How can I assist you today?
inSummarize the benefits of AI.
outAI improves efficiency, automates tasks, enhances decision-making, and enables new innovations across industries.
in
outPlease provide a prompt to generate a response.

Integration steps

  1. Install the Ollama Python package with pip.
  2. Ollama runs locally and requires no API key or environment variable.
  3. Import ollama and call ollama.chat() with the model name and your prompt.
  4. Receive and process the response text from the API call.

Full code

python
import ollama

# Define the prompt and model
prompt = "Hello, how are you?"
model = "llama2"

# Call the LM Studio API via Ollama client
response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}])

# Print the generated response
print("Response:", response['choices'][0]['message']['content'])
output
Response: I'm doing great, thanks for asking! How can I assist you today?

API trace

Request
json
{"model": "llama2", "messages": [{"role": "user", "content": "Hello, how are you?"}]}
Response
json
{"choices": [{"message": {"content": "I'm doing great, thanks for asking! How can I assist you today?"}}]}
Extractresponse['choices'][0]['message']['content']

Variants

Streaming response

Use streaming to display partial results immediately for better user experience with long outputs.

python
import ollama

prompt = "Tell me a story about a robot."
model = "llama2"

# Stream the response tokens
for chunk in ollama.chat_stream(model=model, messages=[{"role": "user", "content": prompt}]):
    print(chunk['choices'][0]['delta'].get('content', ''), end='', flush=True)
print()
Async version

Use async calls to handle multiple concurrent requests efficiently in asynchronous applications.

python
import asyncio
import ollama

async def main():
    prompt = "Explain quantum computing in simple terms."
    model = "llama2"
    response = await ollama.chat_async(model=model, messages=[{"role": "user", "content": prompt}])
    print("Response:", response['choices'][0]['message']['content'])

asyncio.run(main())
Alternative model

Use a chat-optimized model like llama2-chat for conversational or dialogue-based tasks.

python
import ollama

prompt = "Write a poem about spring."
model = "llama2-chat"

response = ollama.chat(model=model, messages=[{"role": "user", "content": prompt}])
print("Response:", response['choices'][0]['message']['content'])

Performance

Latency~500ms to 1s per request depending on model size and prompt length
CostOllama is free and open-source software running locally, so no usage costs apply.
Rate limitsNo rate limits as Ollama runs locally on your hardware.
  • Keep prompts concise to reduce token usage.
  • Use shorter model names or lighter models for faster responses.
  • Cache frequent queries to avoid repeated calls.
ApproachLatencyCost/callBest for
Standard sync call~500ms-1sFreeSimple synchronous use cases
Streaming responseStarts immediately, total ~1sFreeLong outputs with better UX
Async call~500ms-1sFreeConcurrent requests in async apps

Quick tip

Ollama runs locally and requires no API key or environment variable, so no credentials need to be set.

Common mistake

Beginners often try to set an <code>OLLAMA_API_KEY</code> environment variable, but Ollama requires no authentication and runs locally.

Verified 2026-04 · llama2, llama2-chat
Verify ↗