Code intermediate · 3 min read

How to run Mistral with Hugging Face in python

Direct answer

Use the Hugging Face Transformers library to load and run the Mistral model in Python by calling AutoModelForCausalLM.from_pretrained with the Mistral model ID and generating text with pipeline or model.generate.

Setup

Install

bash

pip install transformers torch

Imports

python

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

Examples

inGenerate a greeting with Mistral

outHello! How can I assist you today?

inComplete the sentence: "The future of AI is"

outThe future of AI is incredibly promising, with advancements in natural language understanding and generation.

inEdge case: Generate text with empty prompt

out

Integration steps

Install the transformers and torch libraries
Import AutoTokenizer and AutoModelForCausalLM from transformers
Load the Mistral model and tokenizer using from_pretrained with the model ID
Create a text generation pipeline or use model.generate for inference
Pass your prompt text to generate output
Process and display the generated text

Full code

python

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Load the Mistral model and tokenizer from Hugging Face Hub
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Create a text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Define prompt
prompt = "The future of AI is"

# Generate text
outputs = text_generator(prompt, max_length=50, do_sample=True, temperature=0.7)

# Print generated text
print(outputs[0]['generated_text'])

output

The future of AI is incredibly promising, with advancements in natural language understanding and generation.

API trace

Request

json

{"model": "mistralai/Mistral-7B-v0.1", "inputs": "The future of AI is", "parameters": {"max_length": 50, "do_sample": true, "temperature": 0.7}}

Response

json

[{"generated_text": "The future of AI is incredibly promising, with advancements in natural language understanding and generation."}]

Extractoutputs[0]['generated_text']

Variants

Streaming generation with Hugging Face Accelerate ›

Use when you want more control over token-by-token generation or to integrate with custom streaming logic.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate tokens one by one
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7, streamer=None)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Async inference with Hugging Face and asyncio ›

Use for concurrent or asynchronous applications where you want non-blocking calls.

python

import asyncio
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

async def generate_async(prompt):
    model_name = "mistralai/Mistral-7B-v0.1"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

async def main():
    text = await generate_async("What is the capital of France?")
    print(text)

asyncio.run(main())

Use smaller Mistral variant for faster inference ›

Use when you need faster response times and can trade off some model size or instruction tuning.

python

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "Summarize the benefits of AI."
outputs = text_generator(prompt, max_length=40)
print(outputs[0]['generated_text'])

Performance

Latency~2-5 seconds per 50 tokens on a single high-end GPU

CostDepends on cloud GPU usage; model inference itself is free with local hardware

Rate limitsNo API rate limits when running locally; cloud-hosted endpoints may have limits

Use max_length to limit output tokens and reduce latency
Use do_sample=False for deterministic outputs to save compute
Batch multiple prompts to maximize GPU utilization

Approach	Latency	Cost/call	Best for
Standard pipeline	~3s per 50 tokens	Free on local GPU	Simple text generation
Streaming generation	~token latency	Free on local GPU	Real-time token streaming
Async inference	~3s per 50 tokens	Free on local GPU	Concurrent requests
Smaller Mistral variant	~1.5s per 50 tokens	Free on local GPU	Faster, less resource intensive

✓

Quick tip

Use device_map="auto" and torch_dtype=torch.float16 to optimize Mistral model loading on GPUs.

⚠

Common mistake

Forgetting to set device_map="auto" causes the model to load on CPU only, leading to slow inference.

Verified 2026-04 · mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1

Verify ↗