Code beginner · 3 min read

How to use DeepEval in Python

Direct answer

Use the DeepEval API in Python by sending your model outputs and references to the DeepEval endpoint via the OpenAI-compatible client, then parse the returned evaluation scores from the response.

Setup

Install

bash

pip install openai

Env vars

DEEPSEEK_API_KEY

Imports

python

from openai import OpenAI
import os

Examples

inEvaluate model output 'The cat sat on the mat.' against reference 'A cat is sitting on a mat.'

outScore: 0.92, Feedback: Output is semantically close to the reference.

inEvaluate model output 'The quick brown fox jumps.' against reference 'A fast fox leaps over the lazy dog.'

outScore: 0.75, Feedback: Output captures some meaning but misses details.

inEvaluate model output '' (empty) against reference 'Hello world!'

outScore: 0.0, Feedback: Output is empty, no content to evaluate.

Integration steps

Install the OpenAI Python SDK and set the DEEPSEEK_API_KEY environment variable.
Import the OpenAI client and initialize it with your API key from os.environ.
Prepare the evaluation request by including the model output and reference text in the messages array.
Call the DeepEval model endpoint using client.chat.completions.create with model='deepseek-chat'.
Extract the evaluation score and feedback from the response's choices[0].message.content.
Use or display the evaluation results as needed in your application.

Full code

python

from openai import OpenAI
import os

# Initialize DeepSeek client
client = OpenAI(api_key=os.environ["DEEPSEEK_API_KEY"])

# Define model output and reference
model_output = "The cat sat on the mat."
reference_text = "A cat is sitting on a mat."

# Prepare messages for DeepEval
messages = [
    {"role": "user", "content": f"Evaluate this output: '{model_output}' against reference: '{reference_text}'"}
]

# Call DeepEval via deepseek-chat model
response = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages
)

# Extract evaluation result
evaluation = response.choices[0].message.content
print("Evaluation Result:", evaluation)

output

Evaluation Result: Score: 0.92, Feedback: Output is semantically close to the reference.

API trace

Request

json

{"model": "deepseek-chat", "messages": [{"role": "user", "content": "Evaluate this output: 'The cat sat on the mat.' against reference: 'A cat is sitting on a mat.'"}]}

Response

json

{"choices": [{"message": {"content": "Score: 0.92, Feedback: Output is semantically close to the reference."}}], "usage": {"total_tokens": 45}}

Extractresponse.choices[0].message.content

Variants

Streaming evaluation output ›

Use streaming when you want to display evaluation results progressively for better user experience.

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["DEEPSEEK_API_KEY"])

model_output = "The cat sat on the mat."
reference_text = "A cat is sitting on a mat."

messages = [{"role": "user", "content": f"Evaluate this output: '{model_output}' against reference: '{reference_text}'"}]

response = client.chat.completions.create(
    model="deepseek-chat",
    messages=messages,
    stream=True
)

for chunk in response:
    print(chunk.choices[0].delta.get('content', ''), end='')
print()

Async evaluation call ›

Use async calls to integrate DeepEval in applications requiring concurrency or non-blocking behavior.

python

import asyncio
from openai import OpenAI
import os

async def evaluate_async():
    client = OpenAI(api_key=os.environ["DEEPSEEK_API_KEY"])
    model_output = "The cat sat on the mat."
    reference_text = "A cat is sitting on a mat."
    messages = [{"role": "user", "content": f"Evaluate this output: '{model_output}' against reference: '{reference_text}'"}]
    response = await client.chat.completions.acreate(
        model="deepseek-chat",
        messages=messages
    )
    print("Async Evaluation Result:", response.choices[0].message.content)

asyncio.run(evaluate_async())

Alternative model for evaluation ›

Use the deepseek-reasoner model for more complex or reasoning-based evaluation tasks.

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["DEEPSEEK_API_KEY"])

model_output = "The cat sat on the mat."
reference_text = "A cat is sitting on a mat."

messages = [{"role": "user", "content": f"Evaluate this output: '{model_output}' against reference: '{reference_text}'"}]

response = client.chat.completions.create(
    model="deepseek-reasoner",
    messages=messages
)

print("Evaluation with Reasoner model:", response.choices[0].message.content)

Performance

Latency~900ms for deepseek-chat non-streaming evaluation

Cost~$0.0015 per 100 tokens evaluated

Rate limitsTier 1: 400 RPM / 25K TPM

Keep evaluation prompts concise to reduce token usage.
Batch multiple evaluations in one request when possible.
Avoid sending unnecessary context to minimize cost.

Approach	Latency	Cost/call	Best for
Standard call	~900ms	~$0.0015	Simple synchronous evaluation
Streaming call	~900ms (progressive)	~$0.0015	Real-time UI feedback
Async call	~900ms	~$0.0015	Concurrent or non-blocking apps

✓

Quick tip

Always format your evaluation prompt clearly with both output and reference to get precise DeepEval scores.

⚠

Common mistake

Beginners often forget to set the DEEPSEEK_API_KEY environment variable, causing authentication errors.

Verified 2026-04 · deepseek-chat, deepseek-reasoner

Verify ↗