What is positional encoding in transformers
transformers is a method to add information about the order of tokens in a sequence since transformers process input tokens in parallel without inherent order. It encodes token positions as vectors added to token embeddings, allowing the model to capture sequence order.How it works
Transformers process all tokens simultaneously, unlike RNNs that read tokens sequentially. This parallelism means the model has no built-in sense of token order. Positional encoding solves this by creating a unique vector for each token position and adding it to the token's embedding vector. This combined vector carries both the token's meaning and its position.
Think of it like adding a colored tag to each word in a sentence to mark its place. The model then learns patterns not just from the words but also from their positions.
Concrete example
Here is a simplified Python example generating sinusoidal positional encodings as used in the original Transformer paper:
import numpy as np
def get_positional_encoding(seq_len, d_model):
pos = np.arange(seq_len)[:, np.newaxis]
i = np.arange(d_model)[np.newaxis, :]
angle_rates = 1 / np.power(10000, (2 * (i//2)) / np.float32(d_model))
angle_rads = pos * angle_rates
# apply sin to even indices, cos to odd indices
pos_encoding = np.zeros(angle_rads.shape)
pos_encoding[:, 0::2] = np.sin(angle_rads[:, 0::2])
pos_encoding[:, 1::2] = np.cos(angle_rads[:, 1::2])
return pos_encoding
# Example usage
seq_len = 5
embedding_dim = 6
pos_encoding = get_positional_encoding(seq_len, embedding_dim)
print(np.round(pos_encoding, 3)) [[ 0. 1. 0. 1. 0. 1. ] [ 0.841 0.54 0.008 1. 0. 1. ] [ 0.909 -0.416 0.016 0.999 0. 1. ] [ 0.141 -0.99 0.024 0.999 0. 1. ] [-0.757 -0.653 0.032 0.999 0. 1. ]]
When to use it
Use positional encoding whenever you apply transformer models to sequence data like text, audio, or time series, where token order matters. It is essential because transformers lack recurrence or convolution to capture order.
Do not use positional encoding if your model or task inherently encodes order (e.g., RNNs) or if the sequence order is irrelevant.
Key terms
| Term | Definition |
|---|---|
| Transformer | A neural network architecture that processes input tokens in parallel using self-attention. |
| Positional encoding | Vectors added to token embeddings to provide information about token positions in a sequence. |
| Token embedding | A vector representation of a token's meaning. |
| Sinusoidal encoding | A type of positional encoding using sine and cosine functions of different frequencies. |
Key Takeaways
- Transformers require positional encoding to understand token order since they process tokens in parallel.
- Positional encodings are added to token embeddings to combine meaning and position information.
- Sinusoidal positional encoding is a common, fixed method that generalizes to sequences longer than seen during training.