If you have used ChatGPT, Claude, or any modern language model, you have used a transformer. It is the neural network architecture introduced in the 2017 paper Attention Is All You Need, and within two years it had replaced almost everything that came before it in natural language processing. Understanding it does not require heavy mathematics. It requires understanding one idea well: attention.
The problem before transformers
Before 2017, the dominant way to process text was the recurrent neural network, or RNN, and its more capable variant the LSTM. These models read a sentence one word at a time, left to right, holding a running summary of what they had seen so far in a hidden state.
This had two structural problems. First, because each word depended on the one before it, training could not be parallelized. You could not process word ten until you had processed words one through nine. On modern hardware built for parallel computation, this was painfully slow. Second, the running summary tended to lose information. By the time the model reached the end of a long passage, the beginning had faded. Connecting a pronoun in the last sentence to the noun it referred to five sentences earlier was exactly the kind of thing these models struggled with.
For anyone building with NLP, this meant every task was a bespoke project with its own narrow model. Nothing generalized, and nothing scaled cleanly.
The key idea: attention
The transformer threw out sequential reading entirely. Instead of processing words one at a time, it looks at every word in a passage at once and learns, for each word, which other words it should pay attention to.
Take the sentence "the animal did not cross the street because it was too tired." To resolve what "it" refers to, the model needs to connect "it" to "animal." Attention lets the model do this directly, in a single step, regardless of how far apart the two words are. Every word can attend to every other word, and the model learns which of those connections matter.
This one change has three large consequences. Training becomes parallelizable, because all words are processed together. Long-range connections stop breaking, because distance between words no longer matters. And the architecture scales, meaning that making the model bigger reliably makes it better, which had not been dependably true before.
The transformer block
A transformer is built by stacking the same block many times. Each block has two main parts, and a couple of supporting pieces that keep training stable.
The flow through a single block looks like this:
- Self-attention. Each word gathers context from every other word in the sequence. This is the part that mixes information across positions.
- Add & norm. The block adds the attention output back to its input (a residual connection) and normalizes the result. More on why below.
- Feed-forward network. A small two-layer network applied to each position independently. If attention decides what to look at, the feed-forward layer decides what to do with it.
- Add & norm again, closing the second residual connection.
Why the residual connections matter
The "add" in "add & norm" is the residual connection: the block adds its input back to its output before passing it on. This sounds minor but it is what makes deep transformers trainable at all. Without it, stacking dozens of blocks causes the training signal to vanish as it propagates back through the layers. The residual gives that signal a clean path to flow through, so a model can be 12, 48, or 96 blocks deep and still learn.
Why "N layers"
One block captures one round of "look around, then process." Stacking the block N times lets the model build progressively richer representations: early layers tend to capture surface patterns like grammar, later layers capture meaning and relationships. The "N" is a design choice. Small models use a handful of blocks; the largest use close to a hundred.
Why it took over
Within eighteen months of the paper, essentially every serious NLP system was rebuilt on transformers. The reason was not that any single task got a little better. It was that one architecture, scaled up and trained on enough text, turned out to do nearly every language task well. The bet that "bigger reliably means better" became the operating thesis of the entire field, and it held. GPT, BERT, T5, LLaMA, Claude, Gemini: different recipes, same underlying block.
Where to go from here
Once the transformer block makes sense, the rest of the modern stack is a series of refinements on top of it: how attention is computed efficiently, how position is encoded, how the model is trained and aligned, how it generates text at inference time. Each of those is its own concept, and each is one card in our deck.
Learn the whole stack, one card at a time
The LLM Flashcards are 180 hand-drawn cards covering the full picture: attention variants, tokenization, training, RAG, agents, inference, and more. The transformer card above is one of them.
See the deck →