What is MMLU benchmark
MMLU (Massive Multitask Language Understanding) benchmark evaluates large language models on 57 diverse academic and professional subjects to measure their knowledge and reasoning abilities. It tests models with multiple-choice questions requiring factual recall and problem-solving across domains like history, math, and law.MMLU (Massive Multitask Language Understanding) is a benchmark that measures a language model's knowledge and reasoning across multiple academic and professional subjects.How it works
MMLU tests language models by presenting them with multiple-choice questions from 57 subjects, including STEM, humanities, and professional fields. Each question requires the model to select the correct answer from several options, assessing both factual knowledge and reasoning skills. Think of it as a comprehensive exam that covers a wide curriculum to evaluate a model's general intelligence and domain expertise.
Concrete example
Here is a simplified example of how to query a model on an MMLU-style question using the OpenAI API with gpt-4o-mini:
from openai import OpenAI
import os
client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
question = "What is the derivative of sin(x)?"
options = ["A) cos(x)", "B) -cos(x)", "C) sin(x)", "D) -sin(x)"]
prompt = f"Question: {question}\nOptions:\n" + "\n".join(options) + "\nAnswer:"
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}]
)
print("Model answer:", response.choices[0].message.content.strip()) Model answer: A) cos(x)
When to use it
Use MMLU to benchmark and compare language models on broad knowledge and reasoning capabilities across many domains. It is ideal for evaluating generalist models intended for academic, professional, or multi-domain tasks. Avoid using it for narrow domain-specific evaluation or tasks requiring open-ended generation, as it focuses on multiple-choice accuracy.
Key terms
| Term | Definition |
|---|---|
| MMLU | Massive Multitask Language Understanding benchmark for LLM knowledge and reasoning |
| Multiple-choice | Question format with several answer options, only one correct |
| Domain | Specific academic or professional subject area tested |
| Reasoning | Ability to logically solve problems beyond memorization |
Key Takeaways
-
MMLUtests LLMs on 57 diverse subjects with multiple-choice questions. - It measures both factual knowledge and reasoning ability across domains.
- Use
MMLUfor broad, multi-domain model evaluation, not narrow tasks.