How to run Mistral with Hugging Face in python
Direct answer
Use the Hugging Face Transformers library to load and run the Mistral model in Python by calling
AutoModelForCausalLM.from_pretrained with the Mistral model ID and generating text with pipeline or model.generate.Setup
Install
pip install transformers torch Imports
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch Examples
inGenerate a greeting with Mistral
outHello! How can I assist you today?
inComplete the sentence: "The future of AI is"
outThe future of AI is incredibly promising, with advancements in natural language understanding and generation.
inEdge case: Generate text with empty prompt
out
Integration steps
- Install the transformers and torch libraries
- Import AutoTokenizer and AutoModelForCausalLM from transformers
- Load the Mistral model and tokenizer using from_pretrained with the model ID
- Create a text generation pipeline or use model.generate for inference
- Pass your prompt text to generate output
- Process and display the generated text
Full code
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
# Load the Mistral model and tokenizer from Hugging Face Hub
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
# Create a text generation pipeline
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
# Define prompt
prompt = "The future of AI is"
# Generate text
outputs = text_generator(prompt, max_length=50, do_sample=True, temperature=0.7)
# Print generated text
print(outputs[0]['generated_text']) output
The future of AI is incredibly promising, with advancements in natural language understanding and generation.
API trace
Request
{"model": "mistralai/Mistral-7B-v0.1", "inputs": "The future of AI is", "parameters": {"max_length": 50, "do_sample": true, "temperature": 0.7}} Response
[{"generated_text": "The future of AI is incredibly promising, with advancements in natural language understanding and generation."}] Extract
outputs[0]['generated_text']Variants
Streaming generation with Hugging Face Accelerate ›
Use when you want more control over token-by-token generation or to integrate with custom streaming logic.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
prompt = "Explain quantum computing in simple terms."
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
# Generate tokens one by one
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7, streamer=None)
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) Async inference with Hugging Face and asyncio ›
Use for concurrent or asynchronous applications where you want non-blocking calls.
import asyncio
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
async def generate_async(prompt):
model_name = "mistralai/Mistral-7B-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, temperature=0.7)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
async def main():
text = await generate_async("What is the capital of France?")
print(text)
asyncio.run(main()) Use smaller Mistral variant for faster inference ›
Use when you need faster response times and can trade off some model size or instruction tuning.
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_name = "mistralai/Mistral-7B-Instruct-v0.1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16, device_map="auto")
text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)
prompt = "Summarize the benefits of AI."
outputs = text_generator(prompt, max_length=40)
print(outputs[0]['generated_text']) Performance
Latency~2-5 seconds per 50 tokens on a single high-end GPU
CostDepends on cloud GPU usage; model inference itself is free with local hardware
Rate limitsNo API rate limits when running locally; cloud-hosted endpoints may have limits
- Use max_length to limit output tokens and reduce latency
- Use do_sample=False for deterministic outputs to save compute
- Batch multiple prompts to maximize GPU utilization
| Approach | Latency | Cost/call | Best for |
|---|---|---|---|
| Standard pipeline | ~3s per 50 tokens | Free on local GPU | Simple text generation |
| Streaming generation | ~token latency | Free on local GPU | Real-time token streaming |
| Async inference | ~3s per 50 tokens | Free on local GPU | Concurrent requests |
| Smaller Mistral variant | ~1.5s per 50 tokens | Free on local GPU | Faster, less resource intensive |
Quick tip
Use device_map="auto" and torch_dtype=torch.float16 to optimize Mistral model loading on GPUs.
Common mistake
Forgetting to set device_map="auto" causes the model to load on CPU only, leading to slow inference.