Code intermediate · 3 min read

How to run Mistral with Hugging Face in python

Direct answer
Use the Hugging Face Transformers library to load and run the Mistral model in Python by calling AutoModelForCausalLM.from_pretrained with the Mistral model ID and generating text with pipeline or model.generate.

Setup

Install
bash
pip install transformers torch
Imports
python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

Examples

inGenerate a greeting with Mistral
outHello! How can I assist you today?
inComplete the sentence: "The future of AI is"
outThe future of AI is incredibly promising, with advancements in natural language understanding and generation.
inEdge case: Generate text with empty prompt
out

Integration steps

  1. Install the transformers and torch libraries
  2. Import AutoTokenizer and AutoModelForCausalLM from transformers
  3. Load the Mistral model and tokenizer using from_pretrained with the model ID
  4. Create a text generation pipeline or use model.generate for inference
  5. Pass your prompt text to generate output
  6. Process and display the generated text

Full code

python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

# Load the Mistral model and tokenizer from Hugging Face Hub
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

# Create a text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# Define prompt
prompt = "The future of AI is"

# Generate text
outputs = text_generator(prompt, max_length=50, do_sample=True, temperature=0.7)

# Print generated text
print(outputs[0]['generated_text'])
output
The future of AI is incredibly promising, with advancements in natural language understanding and generation.

API trace

Request
json
{"model": "mistralai/Mistral-7B-v0.1", "inputs": "The future of AI is", "parameters": {"max_length": 50, "do_sample": true, "temperature": 0.7}}
Response
json
[{"generated_text": "The future of AI is incredibly promising, with advancements in natural language understanding and generation."}]
Extractoutputs[0]['generated_text']

Variants

Streaming generation with Hugging Face Accelerate

Use when you want more control over token-by-token generation or to integrate with custom streaming logic.

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")

prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)

# Generate tokens one by one
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7, streamer=None)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Async inference with Hugging Face and asyncio

Use for concurrent or asynchronous applications where you want non-blocking calls.

python
import asyncio
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

async def generate_async(prompt):
    model_name = "mistralai/Mistral-7B-v0.1"
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

async def main():
    text = await generate_async("What is the capital of France?")
    print(text)

asyncio.run(main())
Use smaller Mistral variant for faster inference

Use when you need faster response times and can trade off some model size or instruction tuning.

python
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "Summarize the benefits of AI."
outputs = text_generator(prompt, max_length=40)
print(outputs[0]['generated_text'])

Performance

Latency~2-5 seconds per 50 tokens on a single high-end GPU
CostDepends on cloud GPU usage; model inference itself is free with local hardware
Rate limitsNo API rate limits when running locally; cloud-hosted endpoints may have limits
  • Use max_length to limit output tokens and reduce latency
  • Use do_sample=False for deterministic outputs to save compute
  • Batch multiple prompts to maximize GPU utilization
ApproachLatencyCost/callBest for
Standard pipeline~3s per 50 tokensFree on local GPUSimple text generation
Streaming generation~token latencyFree on local GPUReal-time token streaming
Async inference~3s per 50 tokensFree on local GPUConcurrent requests
Smaller Mistral variant~1.5s per 50 tokensFree on local GPUFaster, less resource intensive

Quick tip

Use device_map="auto" and torch_dtype=torch.float16 to optimize Mistral model loading on GPUs.

Common mistake

Forgetting to set device_map="auto" causes the model to load on CPU only, leading to slow inference.

Verified 2026-04 · mistralai/Mistral-7B-v0.1, mistralai/Mistral-7B-Instruct-v0.1
Verify ↗