What is beam search in language models
language models to generate sequences by keeping track of the top k most probable candidates at each step, rather than just the single most likely token. This balances between greedy decoding and exhaustive search, improving output quality without excessive computation.How it works
Beam search works by maintaining a fixed number (beam width) of the most probable partial sequences at each step of generation. Instead of choosing only the single best next token (greedy search), it expands all candidates in the beam and keeps the top k sequences based on cumulative probability. This is like exploring multiple paths in a maze simultaneously, pruning less promising routes early to efficiently find the best overall path.
Concrete example
Suppose a language model predicts the next token probabilities at a step as follows:
beam_width = 2
# Current beams with their cumulative log probabilities
beams = [(['I'], -0.1), (['You'], -0.2)]
# Next token probabilities for each beam (token: log prob)
next_tokens = {
'I': {'am': -0.3, 'like': -1.0, 'love': -0.5},
'You': {'are': -0.2, 'like': -0.4, 'love': -1.2}
}
# Expand beams
all_candidates = []
for seq, score in beams:
last_word = seq[-1]
for token, token_score in next_tokens[last_word].items():
candidate = (seq + [token], score + token_score)
all_candidates.append(candidate)
# Select top beam_width candidates
all_candidates.sort(key=lambda x: x[1], reverse=True) # higher log prob better
beams = all_candidates[:beam_width]
for seq, score in beams:
print(f"Sequence: {' '.join(seq)}, Score: {score}") Sequence: You are, Score: -0.4 Sequence: I am, Score: -0.4
When to use it
Use beam search when you want higher-quality generated text than greedy decoding but cannot afford the computational cost of exhaustive search. It is ideal for tasks like machine translation, summarization, and dialogue generation where multiple plausible outputs exist. Avoid beam search if you need very fast, low-latency generation or if diversity in outputs is more important than likelihood.
Key terms
| Term | Definition |
|---|---|
| Beam width | Number of candidate sequences kept at each step during beam search. |
| Greedy search | Decoding method selecting the single most probable token at each step. |
| Heuristic search | An approach that explores a subset of possibilities to find a good solution efficiently. |
| Cumulative probability | The combined probability of a sequence of tokens, often in log space for numerical stability. |
Key Takeaways
- Beam search balances quality and efficiency by exploring multiple candidate sequences simultaneously.
- Increasing beam width improves output quality but increases computation cost exponentially.
- Use beam search for tasks requiring coherent, high-probability text generation like translation or summarization.