Suppose you have a transformer that handles tokens of dimension And suppose it has ten tokens in the input. The residual stream that’s updated with each layer has dimensions We can think of this as 10 arrows (vectors) pointing toward 10 points in semantic space. As they move from layer to layer, the model computes subtle “corrections” that get added to the vectors. If we animated the input passing through the transformer we would see the arrows stretched and rotated and sheared and bent each time they pass from layer to layer.
But we have to stop and think about where these corrections are coming from. The attention mechanism in a transformer means that those corrections come from the components of the other vectors. The arrows exchange information.
I wonder if we might analogize this to the interaction of particles in quantum mechanics. The attention mechanism is like a vertex in a Feynman diagram where tokens are exchanging bosons. The key and query vectors determine a sort of strength of interaction and the value vector is the packet of information exchanged.
Before interaction, tokens contain semantic ambiguity (as when the word bank could mean river or money), a superposition of meanings. The attention mechanism – and the use of a word in a sentence – is a measurement that “forces” the word toward a particular semantic state. As the tokens move through the model, the arrows are wiggling around trying to find a low energy state that satisfies all the constraints of the use of these words by a speaker and listener of this language.
There is, I think, a wrinkle we need to ponder – in physics Newton’s third law says that for each action there is an equal and opposite reaction. A pulls on B, B pulls on A. But in a transformer the attention of A on B may not be the same as the attention of B on A. I think this is happening because query and key are separate and so, as a rule,
So maybe we have to think of semantics not so much in terms of particles as in terms of fields. Each token creates a field in the surrounding space and each token has antennae that are tuned to pick up signals from the surrounding semantic space. The field a token emits can be asymmetric with respect to the signals that it picks up. The upshot is that attention is not a force or movement of objects but rather a measurement event. Just as a quantum observer who receives information about an event does not change the event (hmm?). When “The king sits on the throne” the <
I’m reminded that in actual transformer math the subsequent tokens are masked so that
Somewhere in the above are some interesting relations between some sort of least action principle in physics, the dynamics of inference, and the phenomenology of coherence. Now I feel like I have to go and try to actually understand that free energy principle that keeps popping up in conversations about meaning and the universe and all that.
