Concept Intermediate · 3 min read

What is MMLU benchmark

Q: What is MMLU benchmark

The MMLU (Massive Multitask Language Understanding) benchmark evaluates large language models on 57 diverse academic and professional subjects to measure their knowledge and reasoning abilities. It tests models with multiple-choice questions requiring factual recall and problem-solving across domains like history, math, and law.

Quick answer

The MMLU (Massive Multitask Language Understanding) benchmark evaluates large language models on 57 diverse academic and professional subjects to measure their knowledge and reasoning abilities. It tests models with multiple-choice questions requiring factual recall and problem-solving across domains like history, math, and law.

MMLU (Massive Multitask Language Understanding) is a benchmark that measures a language model's knowledge and reasoning across multiple academic and professional subjects.

How it works

MMLU tests language models by presenting them with multiple-choice questions from 57 subjects, including STEM, humanities, and professional fields. Each question requires the model to select the correct answer from several options, assessing both factual knowledge and reasoning skills. Think of it as a comprehensive exam that covers a wide curriculum to evaluate a model's general intelligence and domain expertise.

Concrete example

Here is a simplified example of how to query a model on an MMLU-style question using the OpenAI API with gpt-4o-mini:

python

from openai import OpenAI
import os

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

question = "What is the derivative of sin(x)?"
options = ["A) cos(x)", "B) -cos(x)", "C) sin(x)", "D) -sin(x)"]

prompt = f"Question: {question}\nOptions:\n" + "\n".join(options) + "\nAnswer:" 

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[{"role": "user", "content": prompt}]
)

print("Model answer:", response.choices[0].message.content.strip())

output

Model answer: A) cos(x)

When to use it

Use MMLU to benchmark and compare language models on broad knowledge and reasoning capabilities across many domains. It is ideal for evaluating generalist models intended for academic, professional, or multi-domain tasks. Avoid using it for narrow domain-specific evaluation or tasks requiring open-ended generation, as it focuses on multiple-choice accuracy.

Key terms

Term	Definition
MMLU	Massive Multitask Language Understanding benchmark for LLM knowledge and reasoning
Multiple-choice	Question format with several answer options, only one correct
Domain	Specific academic or professional subject area tested
Reasoning	Ability to logically solve problems beyond memorization

✅

Key Takeaways

MMLU tests LLMs on 57 diverse subjects with multiple-choice questions.
It measures both factual knowledge and reasoning ability across domains.
Use MMLU for broad, multi-domain model evaluation, not narrow tasks.

Verified 2026-04 · gpt-4o-mini

Verify ↗