<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0"><channel><title><![CDATA[Hiro Nakamura]]></title><description><![CDATA[Hiro Nakamura]]></description><link>https://hironakamura-ai.hashnode.dev</link><image><url>https://cdn.hashnode.com/res/hashnode/image/upload/v1593680282896/kNC7E8IR4.png</url><title>Hiro Nakamura</title><link>https://hironakamura-ai.hashnode.dev</link></image><generator>RSS for Node</generator><lastBuildDate>Thu, 18 Jun 2026 07:13:18 GMT</lastBuildDate><atom:link href="https://hironakamura-ai.hashnode.dev/rss.xml" rel="self" type="application/rss+xml"/><language><![CDATA[en]]></language><ttl>60</ttl><item><title><![CDATA[OpenRouter Fusion: the synthesis step does the work, not the panel]]></title><description><![CDATA[The headline from OpenRouter this week is the kind that makes you stop scrolling. A panel of three cheaper models, fused together, landed within one point of Claude Fable 5 on a hard research benchmar]]></description><link>https://hironakamura-ai.hashnode.dev/openrouter-fusion-the-synthesis-step-does-the-work-not-the-panel</link><guid isPermaLink="true">https://hironakamura-ai.hashnode.dev/openrouter-fusion-the-synthesis-step-does-the-work-not-the-panel</guid><dc:creator><![CDATA[Hiro Nakamura]]></dc:creator><pubDate>Wed, 17 Jun 2026 02:52:21 GMT</pubDate><content:encoded><![CDATA[<p>The headline from OpenRouter this week is the kind that makes you stop scrolling. A panel of three cheaper models, fused together, landed within one point of Claude Fable 5 on a hard research benchmark, at roughly half the cost. The interesting part sits underneath that number, and it changes how you should build with this.</p>
<p>OpenRouter's own breakdown says about three quarters of Fusion's gain comes from the synthesis step: the part where one model reads the candidate answers and writes a final one. Only about a quarter comes from running diverse models. So the panel of frontier models that everyone screenshots is the part carrying the least, and a single structured rewrite pass is carrying the most. That reframe is worth more than the leaderboard placement.</p>
<h2>What Fusion actually does</h2>
<p>When a request hits Fusion, OpenRouter sends your prompt to a panel of models in parallel, with web search and fetch enabled. A judge model reads every response and maps where the models agree and where they contradict each other. A synthesizer then writes the final answer grounded in that map. You pay for the sum of all those completions, not one model call.</p>
<pre><code class="language-mermaid">flowchart TB
  P[Your prompt] --&gt; A[Drafter A&lt;br/&gt;cheap model]
  P --&gt; B[Drafter B&lt;br/&gt;cheap model]
  P --&gt; C[Drafter C&lt;br/&gt;cheap model]
  A --&gt; J[Judge: map&lt;br/&gt;agreement and conflict]
  B --&gt; J
  C --&gt; J
  J --&gt; S[Synthesizer&lt;br/&gt;one strong model]
  S --&gt; O[Final answer]
</code></pre>
<p>The default router pulls from six models, and you can pin your own with presets or an analysis_models parameter. Most SDKs work by swapping the base URL, so wiring it up is a five minute job. The cost and the behavior, though, are very different from a single call.</p>
<h2>The number that actually matters</h2>
<p>Here are the figures OpenRouter published on its internal set of 100 hard research tasks. A panel of Fable 5 and GPT-5.5, synthesized by Opus 4.8, scored about 69%. Solo Fable 5 scored about 65%. A budget panel of Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro, none of them frontier priced, came within one point of Fable 5 at roughly half the cost.</p>
<p>Then the attribution: roughly 75% of the gain came from the synthesis step, 25% from model diversity. If you have ever bolted a "review your previous answer" pass onto a single model and watched the quality tick up, you already know this effect. Fusion productizes it and runs the drafts in parallel. The diversity is a garnish. The second pass is the meal.</p>
<h2>Why a second pass helps, and when it does nothing</h2>
<p>The theory behind ensembling is decorrelated errors. If two models are wrong in different ways, a reader holding both drafts can triangulate toward the right one. That is real, but it has a hard precondition: the errors have to be independent, and the judge has to be able to tell which draft is right.</p>
<p>Both conditions break more often than the demo suggests. The Hacker News thread on Fusion is full of engineers who got marginal or zero gains. Frontier models trained on similar data tend to agree with each other, so a panel of them produces correlated answers, and the judge rubber stamps the consensus instead of catching a mistake. One commenter put it bluntly: asking one model to judge another often just feels like turning up the temperature. The 75/25 split lines up with that. Diversity on its own is weak. The real lever is the synthesis prompt, and a vague "pick the best answer" instruction barely moves the needle.</p>
<h2>Where it pays off, and where it burns money</h2>
<p>The pattern that holds up in practice: Fusion helps on verifiable, decomposable work and stalls on ambiguous judgment. Research with citations, code checked against tests, a resume scored against a job description. In all of those the synthesizer has something concrete to verify, so the second pass adds signal. Architecture decisions, trading calls, anything that is mostly taste: the judge has nothing to check against and amplifies whatever priors the models already share. Several developers reported solo Fable beating fused Fable on exactly these fuzzy tasks.</p>
<p>So the only number that should drive your decision is whether the second pass helps on your own eval, not on a leaderboard. Treat the benchmark gap the way you treat any vendor metric: as a hint to run your own measurement. I went into how to catch that kind of quality movement on your own data in <a href="https://nakamurah.substack.com/p/your-ai-quality-score-dropped-overnight">this piece on knowing when your AI got worse</a>.</p>
<h2>Build the cheap 75% yourself</h2>
<p>Because the synthesis prompt carries the gain, you can reproduce most of the result without the product and keep control of cost and latency. A couple of cheap drafters in parallel, one strong synthesizer, and an explicit rubric:</p>
<pre><code class="language-python">import asyncio, openai

client = openai.AsyncOpenAI(base_url="https://openrouter.ai/api/v1")

DRAFTERS = ["google/gemini-3-flash", "moonshotai/kimi-k2.6", "deepseek/deepseek-v4-pro"]
SYNTH = "anthropic/claude-opus-4.8"

async def call(model, prompt):
    r = await client.chat.completions.create(
        model=model,
        messages=[{"role": "user", "content": prompt}],
    )
    return r.choices[0].message.content

async def fuse(question):
    drafts = await asyncio.gather(*(call(m, question) for m in DRAFTERS))
    joined = "\n\n".join(f"## Candidate {i+1}\n{d}" for i, d in enumerate(drafts))
    synth = (
        "Three models answered the same question.\n"
        "1. List where they agree and where they conflict.\n"
        "2. For each conflict, decide which draft is right and say why.\n"
        "3. Write one final answer grounded in that check. Do not average them.\n\n"
        f"Question:\n{question}\n\n{joined}"
    )
    return await call(SYNTH, synth)
</code></pre>
<p>The active ingredient is the rubric in the synthesis prompt. Agree, conflict, decide, ground. Drop the "do not average them" line and the synthesizer tends to mush the drafts into a bland consensus, which is the exact failure mode the skeptics keep hitting. Putting cheap drafters under one strong synthesizer also matches the budget panel result: you spend the money on the step that reads everything, not on three premium drafts.</p>
<h2>The cost and latency you are signing up for</h2>
<p>Run the arithmetic before you ship this on a hot path. You pay for every panel call plus the judge plus the synthesizer. On Hacker News, people reported single prompts near a dollar, with runs roughly 4x the cost and 7x the latency of a single frontier call. Latency stacks in two stages: the slowest drafter, then the synthesis. And the synthesis call is not a cheap afterthought, because it reads every candidate, so its input is the sum of all the drafts. One user even found Opus billed silently as the judge when it had not been selected, so check what you are actually paying for.</p>
<p>This belongs behind an asynchronous boundary. A research endpoint, a batch job, an overnight report. It does not fit anything with keystroke latency, and it does not fit a high QPS path where 4x cost compounds into a real bill fast.</p>
<h2>What I would actually ship</h2>
<p>Fusion is a clean implementation of a good idea, and the honest version of the pitch is narrower than the headline. The gain is mostly a structured second pass, the second pass only earns its keep on tasks you can verify, and the bill scales with how many models you fan out to. Put cheap models in the panel, one strong model on synthesis, and a sharp rubric in the middle, and you keep most of the win at a fraction of the cost. Then prove it on your own eval before it touches production, because the only place the 69 versus 65 means anything is your task, not OpenRouter's.</p>
<p>Sources:</p>
<ul>
<li><p>OpenRouter Fusion: <a href="https://openrouter.ai/openrouter/fusion">https://openrouter.ai/openrouter/fusion</a></p>
</li>
<li><p>OpenRouter releases Fusion, combining multiple models to beat Claude Fable (GIGAZINE): <a href="https://gigazine.net/news/20260616-openrouter-ai-fusion/">https://gigazine.net/news/20260616-openrouter-ai-fusion/</a></p>
</li>
<li><p>Hacker News discussion on Fusion: <a href="https://news.ycombinator.com/item?id=48537641">https://news.ycombinator.com/item?id=48537641</a></p>
</li>
</ul>
]]></content:encoded></item><item><title><![CDATA[The real cost of running agents is the KV cache, not the tokens]]></title><description><![CDATA[When a new open-weight model lands and people say it is "cheap to run," they usually point at the price per million tokens. That number hides where the money actually goes once you put the model behin]]></description><link>https://hironakamura-ai.hashnode.dev/the-real-cost-of-running-agents-is-the-kv-cache-not-the-tokens</link><guid isPermaLink="true">https://hironakamura-ai.hashnode.dev/the-real-cost-of-running-agents-is-the-kv-cache-not-the-tokens</guid><category><![CDATA[# cybersecurity #artificial intelligence #machine learning #cyberthreats #phisingattacks #Data breach]]></category><category><![CDATA[large language models]]></category><category><![CDATA[performance]]></category><category><![CDATA[software architecture]]></category><dc:creator><![CDATA[Hiro Nakamura]]></dc:creator><pubDate>Mon, 15 Jun 2026 03:41:39 GMT</pubDate><enclosure url="https://cdn.hashnode.com/uploads/covers/6a2f45afd999fea302a500d0/bf4380dd-a68e-4979-b88e-a0ecc91fcbae.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>When a new open-weight model lands and people say it is "cheap to run," they usually point at the price per million tokens. That number hides where the money actually goes once you put the model behind an agent. For a single chat turn the token price is a fair proxy. For an agent that carries a long context, calls tools in a loop, and keeps the conversation alive for minutes, the dominant cost is not the arithmetic of generating tokens. It is the memory you have to hold to generate them.</p>
<p>That memory is the KV cache, and understanding it changes how you budget, batch, and choose models.</p>
<h2>Decode is memory-bound, not compute-bound</h2>
<p>There are two phases in transformer inference. Prefill reads your prompt and is compute-heavy: it runs in parallel across all input tokens and keeps the GPU's math units busy. Decode generates one token at a time, and each step has to read the entire key/value cache for every layer before it can produce the next token.</p>
<p>The work per decode step is small. The data you move is not. On modern accelerators the matrix units can do far more arithmetic than the memory system can feed them, so decode spends most of its time waiting on memory bandwidth, not on FLOPs. This is why a model can look fast on a benchmark of short completions and then crawl the moment you hand it a 100k-token agent transcript.</p>
<pre><code class="language-mermaid">flowchart TB
  P[Prefill: read full prompt] --&gt;|compute-bound| K[Build KV cache]
  K --&gt; D1[Decode step 1]
  D1 --&gt;|read ALL KV| D2[Decode step 2]
  D2 --&gt;|read ALL KV| D3[Decode step 3]
  D3 --&gt; Dn[... every step re-reads the cache]
  Dn --&gt;|bandwidth-bound| OUT[Tokens out]
</code></pre>
<h2>Why the cache is the bottleneck for agents</h2>
<p>The KV cache grows with three things at once: the number of tokens in context, the number of layers, and the number of attention heads that store their own keys and values. Multiply those together across a long agent session and the cache, not the weights, becomes the thing that decides how many requests fit on a GPU.</p>
<p>That has a direct business consequence. Two requests with the same token count can cost very differently if one holds 8k tokens of context and the other holds 80k. The longer one occupies far more memory for far longer, which means fewer concurrent requests per GPU and a higher real cost per useful answer. The token meter does not show this. Your infrastructure bill does.</p>
<p>For agents the effect compounds, because agents are long-context by design. Tool outputs, retrieved documents, and prior steps all stay resident so the model can reason over them. The agent that feels smart because it remembers everything is also the agent that is most expensive to keep alive.</p>
<h2>The techniques shrinking the footprint</h2>
<p>The research and systems work of the last couple of years is, viewed from one angle, a single campaign: make the KV cache smaller without making the model dumber. A few families are worth knowing because they change how you pick and serve models.</p>
<p>Grouped-query attention was the first broad win. Instead of every attention head keeping its own keys and values, heads share a smaller set of key/value groups. The model keeps most of its quality while the per-token cache drops by a large factor. Almost every recent open-weight model ships with this by default now.</p>
<p>Cross-layer KV sharing pushes the same idea up a dimension. Neighbouring layers reuse the keys and values computed below them instead of storing fresh ones at every layer. The cache stops scaling one-to-one with depth, which is exactly the axis that hurts deep models on long context.</p>
<p>Compressed and windowed attention attack the token axis. Sliding-window attention keeps a full cache only for recent tokens and a cheaper summary for the rest. Compression methods evict or merge low-value entries so the cache holds a fixed budget no matter how long the conversation runs. The trade is precision on distant context for a flat, predictable memory cost.</p>
<pre><code class="language-mermaid">flowchart LR
  A[Full KV: every head, every layer, every token] --&gt; B[Grouped-query: heads share KV]
  B --&gt; C[Cross-layer sharing: layers reuse KV]
  C --&gt; D[Windowed / compressed: bounded token budget]
  D --&gt; E[Flat, predictable memory per request]
</code></pre>
<p>None of these is free. Each trades some ability to attend perfectly to far-away tokens for a smaller, more predictable cache. The reason they keep shipping anyway is that the quality loss is usually small and the memory win is usually large, and memory is the constraint that actually limits throughput in production.</p>
<h2>How to read this as an engineer</h2>
<p>The practical takeaway is to stop treating context length as free real estate. If your agent's prompt has grown to tens of thousands of tokens because adding context was the easy fix, you are paying for that in memory on every decode step, for the whole life of the request.</p>
<p>Three habits follow from that. Measure cost per resolved task, not per token, because the cache cost lives in the gap between those two. Treat retrieved context as something to spend deliberately: a tight, relevant 6k often beats a lazy 60k on both quality and price. And when you compare models, look past the token price to the attention design, because a model with grouped-query and cross-layer sharing can be dramatically cheaper to serve at long context than a headline price would suggest.</p>
<h2>Where this goes next</h2>
<p>My read is that the next round of "cheaper" open-weight models will win less on raw quality and more on serving efficiency at long context, and that the marketing will keep pointing at token price while the engineering quietly moves to the KV cache. Expect more models that are explicitly co-designed with their serving stack, where the attention pattern and the cache budget are chosen together rather than bolted on afterwards. The teams that internalise this early will run the same agents their competitors run, on a fraction of the hardware.</p>
<p>The token price is the sticker. The KV cache is the engine. When you are building agents, watch the engine.</p>
<p>Sources: GIGAZINE, recent developments in LLM architectures including KV sharing, mHC, and compressed attention (2026-06-14): <a href="https://gigazine.net/news/20260614-recent-developments-in-llm-architectures/">https://gigazine.net/news/20260614-recent-developments-in-llm-architectures/</a></p>
]]></content:encoded></item></channel></rss>