Concept Intermediate · 3 min read

What is positional encoding in transformers

Q: What is positional encoding in transformers

Positional encoding in transformers is a method to add information about the order of tokens in a sequence since transformers process input tokens in parallel without inherent order. It encodes token positions as vectors added to token embeddings, allowing the model to capture sequence order.

Quick answer

Positional encoding in transformers is a method to add information about the order of tokens in a sequence since transformers process input tokens in parallel without inherent order. It encodes token positions as vectors added to token embeddings, allowing the model to capture sequence order.

Positional encoding is a technique in transformers that injects token position information into input embeddings to enable sequence order understanding.

How it works

Transformers process all tokens simultaneously, unlike RNNs that read tokens sequentially. This parallelism means the model has no built-in sense of token order. Positional encoding solves this by creating a unique vector for each token position and adding it to the token's embedding vector. This combined vector carries both the token's meaning and its position.

Think of it like adding a colored tag to each word in a sentence to mark its place. The model then learns patterns not just from the words but also from their positions.

Concrete example

Here is a simplified Python example generating sinusoidal positional encodings as used in the original Transformer paper:

python

import numpy as np

def get_positional_encoding(seq_len, d_model):
    pos = np.arange(seq_len)[:, np.newaxis]
    i = np.arange(d_model)[np.newaxis, :]
    angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
    angle_rads = pos * angle_rates

    # apply sin to even indices, cos to odd indices
    pos_encoding = np.zeros(angle_rads.shape)
    pos_encoding[:, 0::2] = np.sin(angle_rads[:, 0::2])
    pos_encoding[:, 1::2] = np.cos(angle_rads[:, 1::2])
    return pos_encoding

# Example usage
seq_len = 5
embedding_dim = 6
pos_encoding = get_positional_encoding(seq_len, embedding_dim)
print(np.round(pos_encoding, 3))

output

[[ 0.     1.     0.     1.     0.     1.   ]
 [ 0.841  0.54   0.008  1.     0.     1.   ]
 [ 0.909 -0.416  0.016  0.999  0.     1.   ]
 [ 0.141 -0.99   0.024  0.999  0.     1.   ]
 [-0.757 -0.653  0.032  0.999  0.     1.   ]]

When to use it

Use positional encoding whenever you apply transformer models to sequence data like text, audio, or time series, where token order matters. It is essential because transformers lack recurrence or convolution to capture order.

Do not use positional encoding if your model or task inherently encodes order (e.g., RNNs) or if the sequence order is irrelevant.

Key terms

Term	Definition
Transformer	A neural network architecture that processes input tokens in parallel using self-attention.
Positional encoding	Vectors added to token embeddings to provide information about token positions in a sequence.
Token embedding	A vector representation of a token's meaning.
Sinusoidal encoding	A type of positional encoding using sine and cosine functions of different frequencies.

✅

Key Takeaways

Transformers require positional encoding to understand token order since they process tokens in parallel.
Positional encodings are added to token embeddings to combine meaning and position information.
Sinusoidal positional encoding is a common, fixed method that generalizes to sequences longer than seen during training.

Verified 2026-04 · gpt-4o, claude-3-5-sonnet-20241022

Verify ↗