5 minutes ago

SubQ vs Transformer: Inside Subquadratic’s Sparse-Attention Architecture and 56x Speed Benchmarks

The transformer architecture has dominated large language model development since Google’s landmark 2017 paper, “Attention Is All You Need.” Nearly a decade later, a Miami-based startup called Subquadratic is making a pointed argument: that the transformer’s core mechanism is also its most significant liability, and that they have found a credible path around it.

The claim is bold. The evidence, while still incomplete, is becoming harder to dismiss.

120

8 mins read

15 sections

3 visuals

Key Highlights

SubQ’s dynamic sparse attention targets only relevant token relationships to beat dense transformer scaling
Appen reports 56x speed over FlashAttention plus 12M-token context with near-perfect long-context retrieval
Cost claims suggest RULER 128 runs could drop from thousands of dollars to single digits with SubQ

The Bottleneck: Why Dense Attention Scales Badly

To understand what Subquadratic is attempting, it helps to understand precisely what makes transformer-based LLMs computationally expensive.

The central operation inside a transformer is dense attention. When a model processes text, it assigns a numerical value to each token — a word or word fragment — and then multiplies every token’s value against every other token’s value. This cross-multiplication is how the model captures relationships between words across a document.

The cost of this operation does not scale linearly. It scales quadratically.

Double the number of tokens in a document, and you roughly quadruple the number of required computations. A 10,000-word document triggers approximately 50 million individual multiplications. Extend that to book-length or multi-document inputs, and the computational load becomes prohibitive — in terms of both processing time and energy consumption. This is the quadratic bottleneck that gives Subquadratic its name.

The Architecture: Dynamic Sparse Attention

Subquadratic’s proposed solution is sparse attention — a mechanism that selects only a subset of token relationships to compute, rather than evaluating all of them exhaustively.

The underlying intuition is straightforward: not every word in a document is meaningfully related to every other word. Evaluating all possible pairings is computationally wasteful when most of those relationships carry negligible semantic weight.

Sparse attention is not a new idea. Researchers and engineers have explored variations of it for years, and none have produced a mechanism that matches dense attention’s performance on general language understanding tasks. Fixed-pattern approaches — for example, always comparing the first token to every fifth token — impose rigid structural assumptions that language does not respect.

Subquadratic’s claimed innovation is a dynamic selection mechanism. Rather than applying a predetermined pattern, SubQ calculates on the fly which token relationships matter for each specific input. The selection differs per document, per query, per context. The company declines to disclose the precise technical implementation —

“that’s kind of where the secret sauce is,” says CTO Alex Whedon

— but the architectural principle is clear: relevance-driven sparsity rather than structurally imposed sparsity.

Whether this constitutes a genuine breakthrough or an incremental refinement remains the central open question.

The Benchmarks: What Appen’s Independent Evaluation Found

Subquadratic’s initial announcement in May 2026 was met with significant skepticism, largely because the company released only self-published test scores with limited methodological transparency. The comparison to Theranos circulated quickly on X.

One month later, the company published results from an independent evaluation conducted by Appen, a third-party model evaluation firm. The findings are notable across three dimensions.

Speed

In a raw throughput test measuring theoretical operational speed, Appen found SubQ to be 56 times faster than models using FlashAttention — itself a previous sparse-attention optimization technique widely adopted across the industry. This is a baseline speed figure, not a task-performance metric, but it establishes the architectural efficiency differential in concrete terms.

Coding Performance

On LiveCodeBench — a benchmark using competitive coding problems drawn from real programming contests — SubQ scored 89.7%. Appen’s director of generative AI research, Jeanine Sinanan-Singh, described this as “frontier-level performance in coding,” placing SubQ in the same competitive tier as leading models from OpenAI, Google DeepMind, and Anthropic on this specific task.

Long-Context Retrieval

SubQ operates with a context window of up to 12 million tokens — roughly twelve times the one-million-token ceiling of most current top-tier models. On the needle-in-a-haystack test, which measures a model’s ability to retrieve specific information buried within large volumes of text, Appen reported a 98% score at both six-million and twelve-million token context lengths. The evaluation report noted this as “sustaining near-perfect long-context retrieval at scales few models are tested at.”

The Cost Differential: $8 vs. $2,600

Perhaps the most striking figure Subquadratic has put forward is not a benchmark score but a cost comparison.

According to CEO Justin Dangel, running Anthropic’s Opus 4.6 through RULER 128 — Nvidia’s benchmark for large-dataset information retrieval — costs approximately $2,600 per run. Running SubQ through the same test cost $8.

This figure cannot be independently verified at present, since SubQ is not yet widely available. It should be treated as a directional claim rather than a confirmed data point. But if it holds under scrutiny, the cost efficiency implications for enterprise use cases involving large document sets would be substantial.

Where the Skepticism Remains Justified

Independent validation from Appen adds credibility, but it does not resolve all outstanding concerns. Several legitimate questions remain open.

Benchmark Coverage Is Narrow

Appen evaluated SubQ on a limited set of tests. Strong performance on LiveCodeBench and needle-in-a-haystack does not constitute comprehensive evidence of general capability. Benchmarks measure performance under specific, controlled conditions. Real-world deployment across diverse tasks is a different test entirely.

The Qwen Weight Inheritance Question

SubQ was not trained from scratch. Subquadratic used weights from Qwen, a Chinese open-source model, to bootstrap SubQ’s training — a common practice in the industry, but one that complicates the company’s stronger architectural claims. If the model’s language understanding capabilities derive substantially from Qwen’s pre-trained weights, then the claim of having “reinvented how LLMs work” requires more precise qualification.

Independent AI researcher Will Depue, formerly of OpenAI, put it directly: “They may have built something real and useful. But the public evidence does not yet justify the stronger claim that they have solved the quadratic attention bottleneck.”

Access Remains Restricted

Tens of thousands of users have reportedly signed up for early access, including over 500 enterprise customers. Very few have received it. Until SubQ is available for broad, independent testing, the evaluation picture remains controlled by the company and its selected partners.

What SubQ Is Actually Positioned to Do

It is worth being precise about the scope of Subquadratic’s claims, because the company itself is more measured in some respects than the surrounding coverage has been.

SubQ is not presented as a universal replacement for GPT-4o, Claude, or Gemini. It is positioned as a model optimized for two specific use cases: coding tasks and large-scale document analysis. These are domains where context window size and inference speed matter disproportionately — and where SubQ’s architectural advantages, if they hold, would translate directly into workflow value.

In a live demonstration, Whedon asked SubQ to reason across 400 documents simultaneously. It responded in seconds. The same task caused Perplexity to fail to load all documents. That is a meaningful capability gap for enterprise research, legal analysis, codebase review, and similar data-intensive workflows.

The Larger Architectural Argument

Subquadratic’s ambitions extend beyond SubQ as a product. The company’s founders believe that sparse attention represents the future direction of LLM architecture broadly.

“We don’t think anybody will be building on transformers in a few years,” says Dangel.

This is a significant claim, and one that the current evidence does not yet support at scale. The transformer architecture is deeply embedded in the infrastructure, tooling, and institutional knowledge of the AI industry. Displacing it would require not just a better mechanism, but a better mechanism that generalizes across tasks, scales reliably with compute, and integrates into existing development workflows.

What Subquadratic has demonstrated so far is that dynamic sparse attention can match dense attention on specific benchmarks while delivering substantial efficiency gains. That is a meaningful result. Whether it is the foundation of a new architectural era or a well-executed niche optimization remains to be determined.

Takeaway

SubQ’s 56x speed benchmark and 12-million-token context window are not marketing abstractions — they reflect a real architectural difference in how the model processes information. The independent Appen evaluation adds meaningful credibility to claims that were initially dismissed too quickly.

At the same time, the evidence base is still narrow. Restricted access, limited benchmark coverage, and the unresolved question of Qwen weight inheritance all warrant continued scrutiny. The honest assessment is that Subquadratic has built something that appears genuinely interesting and potentially significant for specific enterprise use cases — but has not yet made the case for the broader architectural revolution it is claiming.

For practitioners evaluating AI tools today, the practical question is simpler: if your workflows involve large-scale document retrieval, long-context reasoning, or high-volume coding tasks, SubQ belongs on your watchlist. When access opens more broadly, it will be worth testing directly. Until then, observe carefully — and reserve judgment on the grander claims.