Concept Intermediate · 3 min read

What is MMLU benchmark

Quick answer
The MMLU (Massive Multitask Language Understanding) benchmark is a comprehensive evaluation suite that tests language models on 57 diverse academic subjects, measuring their knowledge and reasoning skills. It assesses model performance on multiple-choice questions from fields like history, law, and STEM to gauge general intelligence.
MMLU (Massive Multitask Language Understanding) is a benchmark that evaluates language models across many academic tasks to measure their broad knowledge and reasoning capabilities.

How it works

The MMLU benchmark works by presenting a language model with multiple-choice questions spanning 57 subjects, including humanities, STEM, and professional fields. Think of it like a giant academic exam where each subject is a different test section. The model must select the correct answer from several options, testing both factual knowledge and reasoning.

Imagine a student taking a standardized test with sections on math, history, and law. The MMLU benchmark measures how well the AI "student" performs across all these subjects, revealing its generalist capabilities rather than expertise in just one area.

Concrete example

Here is a simplified example of an MMLU-style question and how a model might be evaluated:

python
from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

question = "Which of the following is a prime number?\nA) 15\nB) 17\nC) 21\nD) 27"

messages = [{"role": "user", "content": question}]

response = client.chat.completions.create(
    model="gpt-4o",
    messages=messages
)

answer = response.choices[0].message.content.strip()
print(f"Model answer: {answer}")
output
Model answer: B) 17

When to use it

Use the MMLU benchmark when you want to evaluate a language model's broad academic knowledge and reasoning across many domains, especially for research or model comparison. It is not suitable for testing conversational skills or domain-specific fine-tuned models focused on narrow tasks.

For example, use MMLU to benchmark a new general-purpose LLM or to compare models' capabilities in professional and academic knowledge.

Key terms

TermDefinition
MMLUMassive Multitask Language Understanding benchmark for evaluating LLMs across 57 academic subjects.
Multiple-choice questionsQuestions with several answer options where only one is correct.
Language modelAn AI model trained to understand and generate human language.
ReasoningThe ability to apply logic and knowledge to answer questions correctly.

Key Takeaways

  • MMLU tests language models on a wide range of academic subjects using multiple-choice questions.
  • It measures both factual knowledge and reasoning skills across 57 diverse domains.
  • Use MMLU to benchmark generalist LLMs, not for conversational or narrowly specialized tasks.
Verified 2026-04 · gpt-4o
Verify ↗