Is Attention Obsolete?

Grassmann Flows are a new “attention‑free” way to build models like Transformers, using geometry instead of the usual self‑attention matrix, while still handling long sequences efficiently and (potentially) more interpretably.

Self‑attention sits at the heart of Transformers: for every token in a sequence, the model looks at every other token and decides how much to “attend” to it. That sounds elegant, but it causes three big issues:

Quadratic cost

For a sequence of length L, attention builds an L×L matrix that compares every pair of positions.

Memory and computation grow like L².
Very long inputs (long documents, code bases, videos) become expensive to process.

Huge hidden complexity

Attention is not just one matrix; it is:

Many heads per layer
Many layers per model
So across the network, you get a massive “cloud” of attention tensors that is hard to summarize or reason about.

Poor interpretability

Visualizing one attention map is easy. Understanding all heads across all layers in a single coherent story is almost impossible. The paper behind Grassmann Flows argues this tensor lifting into huge high‑dimensional spaces is a major reason large Transformers feel opaque.

So the key question becomes: do we really need attention, or do we just need some powerful way for token representations to evolve and interact over the sequence?

Core idea: geometry instead of attention

Grassmann Flows answer that question with: “you don’t need attention, you need a good geometric evolution of the hidden states.”

At a high level, the model:

Works on a sequence of hidden vectors (like a Transformer).
Does not build an L×L attention matrix.
Instead, it:
- Compresses token vectors into a smaller space.
- Looks at local pairs of tokens.
- Treats each pair as a 2‑dimensional subspace of that smaller space.
- Represents that subspace using a special geometric encoding (Plücker coordinates).
- Feeds this geometric information back into the token states with a mixing block.

Because it works on local windows and reduced dimensions, the overall cost scales linearly with sequence length (for fixed window and reduced dimension), rather than quadratically like standard attention.

A gentle intro to Grassmann manifolds

To understand the name Grassmann Flows, we only need one basic idea: the model manipulates subspaces, not just vectors.

A subspace is like a “flat slice” through a vector space. For example, a 2‑D plane inside 3‑D space.
A Grassmann manifold Gr(k,r) is the set of all k‑dimensional subspaces in an r‑dimensional space.
- Example: Gr(2,3) is “all 2‑D planes through the origin in 3‑D space.”

In Grassmann Flows:

For each local pair of token vectors (after reduction), the model thinks:
“These two vectors span a 2‑D subspace in an r‑dimensional space.”
That subspace is a point on the Grassmann manifold Gr(2,r).

To work with these subspaces inside a neural net, the model uses Plücker coordinates:

Given two vectors that span a plane, Plücker coordinates encode that plane as a point in a higher‑dimensional space in a way that respects its geometric structure.
This gives a fixed‑size vector representation of the “relationship” between two tokens, not just the tokens themselves.

So instead of learning a big attention score between each pair, the model learns to use geometric features derived from the subspace spanned by the pair.

How the Causal Grassmann layer works (in plain steps)

The proposed architecture follows a Transformer‑like encoder but replaces each self‑attention block with a Causal Grassmann mixing block. The block has three main stages:

Reduce token states
- Each token’s hidden vector (dimension d) is linearly projected to a smaller space of dimension r.
- This is like compressing information to a more compact representation so later operations are cheaper.
Build local subspaces (Grassmann part)
- Take local pairs of tokens (for example, within a sliding window, and respecting causality so you do not peek into the future).
- For each pair:
  - Treat the two reduced vectors as spanning a 2‑D subspace in R^r.
  - Represent that subspace as Plücker coordinates, giving a vector in R^(r2).
- These coordinates describe geometric relationships like “orientation” and “volume” of the span, capturing richer pairwise structure than a scalar attention score.
Mix geometric features back
- These Plücker vectors then go through a gated mixing block and a feed‑forward network.
- Their outputs are fused back into the original token hidden states (in the large dimension d), similar in spirit to how attention outputs are added back in Transformers.

Important aspects:

Causal: The layer is designed so each position only depends on current and past tokens, supporting autoregressive tasks.
Attention‑free: At no point is an attention matrix formed or softmaxed.
Linear complexity: For fixed reduced dimension and local window size, the cost grows linearly with sequence length L.

Why this might be more interpretable

The paper stress that their goal is not to kill attention, but to “de‑center” it. They argue that attention’s “tensor lifting” puts the model’s core behavior into an enormous high‑dimensional tensor space with no clean global invariants you can inspect.

Grassmann Flows attempt to fix this by:

Keeping the core evolution on a finite‑dimensional manifold (the Grassmann manifold) with clear geometric meaning.
Working with Plücker coordinates, which are mathematically structured features of subspaces, not arbitrary parameters.
Making it easier (in principle) to study invariants and patterns of how representations move along this manifold across layers.

This does not magically make everything interpretable, but it gives a cleaner mathematical object to analyze than a huge stack of attention maps.

How well does it work?

The paper and associated discussions report several encouraging results when comparing Grassmann Flows to Transformer baselines of similar size:

On Wikitext‑2 (a language modeling benchmark), the attention‑free Causal Grassmann architecture is competitive with Transformers at 13–18M parameters.
When plugged into a DistilBERT backbone as a classification head, Grassmann mixing slightly outperforms a Transformer‑based head on SNLI (a natural language inference task).
The architecture has linear asymptotic complexity in sequence length for fixed reduced dimension and window size, making it attractive for longer sequences.

These are early‑stage results on modest scales, but they show that explicit self‑attention is not strictly necessary for good sequence modeling performance.

Relation to other “attention alternatives”

Grassmann Flows are part of a larger trend trying to move beyond vanilla softmax attention for efficiency and scalability:

Linear attention
Many works approximate attention to get linear complexity, often by clever kernel tricks or reformulating attention as recurrent updates.
State Space Models (SSMs) / Mamba‑style models
These treat sequences as signals evolving in continuous time, using state‑space equations inspired by control theory to handle very long dependencies efficiently.

Grassmann Flows are different because they:

Focus explicitly on geometric structure (Grassmann manifolds, Plücker coordinates) rather than just algebraic tricks.
Provide a concrete example of a geometrically motivated sequence model that competes with Transformers without using attention at all.

So the big message is: attention is powerful, but not uniquely necessary; other mathematically grounded mechanisms can also drive strong sequence models.

What this means for the future

From a high‑level perspective, Grassmann Flows support three important ideas for future AI systems:

Attention is a tool, not a requirement
The key is to have a rich way for token representations to interact over the sequence. Attention is one such mechanism; geometric flows on manifolds are another.
Geometry can guide architecture design
Instead of building models purely by intuition and trial‑and‑error, one can start from a well‑understood mathematical object (like a manifold) and design layers around its structure.
Better scalability and interpretability are possible together
By avoiding quadratic attention and using structured geometric features, Grassmann Flows aim to be both more scalable and easier to analyze than giant attention stacks.

References:

Paper
Watch

Is Attention Obsolete?

Quadratic cost

Huge hidden complexity

Poor interpretability

Core idea: geometry instead of attention

A gentle intro to Grassmann manifolds

How the Causal Grassmann layer works (in plain steps)

Why this might be more interpretable

How well does it work?

Relation to other “attention alternatives”

What this means for the future

Comments

More from this blog

Operation Aurora

The SolarWinds Hack

North Korea’s Sophisticated Cyber Empire

Break Free from Overthinking: Gita’s Guide to Letting Go

Command Palette

Quadratic cost

Huge hidden complexity

Poor interpretability

Core idea: geometry instead of attention

A gentle intro to Grassmann manifolds

How the Causal Grassmann layer works (in plain steps)

Why this might be more interpretable

How well does it work?

Relation to other “attention alternatives”

What this means for the future

Comments

More from this blog