PFlash Wants to Kill the Four-Minute First Token

Most local LLM performance discourse is still stuck on tokens per second. That is useful right up until you paste a 128K prompt into a model and spend four minutes staring at nothing.

That helps explain why a Reddit-hot post from r/LocalLLaMA landed better than the usual benchmark chest-thumping. Luce's new open-source PFlash project claims it can cut time to first token for Qwen3.6-27B at 128K context on a single RTX 3090 from about 257 seconds in llama.cpp to 24.8 seconds. If that number holds up outside the project's own writeup, the real headline is not just "10x faster." It is that long-context local inference might not have to feel broken.

The system is public, the code is public, and the design is concrete. Luce published PFlash inside its lucebox-hub repository under MIT license. The implementation sits in front of a quantized Qwen3.6-27B Q4_K_M target and uses a much smaller Qwen3-0.6B drafter to score which prompt spans matter. The heavy model then prefills only the selected spans, not the whole prompt. Luce says the pipeline runs as a single C++/CUDA process on a 24 GB RTX 3090, with no Python, Triton, or PyTorch in the runtime path.

That last part makes this more interesting than another research-paper summary. Speculative prefill is not new. The original SpecPrefill paper and later Cross-Family Speculative Prefill work already argued that a lightweight model can guess which prompt tokens deserve the expensive model's attention. The new claim here is narrower and more practical: take that idea, wire it into a GGUF-heavy local stack, fit it onto a commodity 24 GB card, and make it useful in the place where local users actually feel pain, first-token latency.

The published numbers are specific. In Luce's README, 64K context drops from 134.95 seconds to 13.5 seconds, and 128K drops from roughly 257 seconds to 24.8 seconds. The project says it kept only 5 percent of source tokens, about 2.6K spans from a 128K prompt, while still preserving Needle in a Haystack retrieval on the measured cases. The implementation also spells out the machinery instead of hand-waving it away: a daemon-resident drafter, custom CUDA kernels for scoring and sparse forward passes, block-sparse attention from MIT Han Lab's BSA project, and VRAM juggling through park and unpark commands because the drafter and target cannot comfortably coexist in 24 GB during the whole request.

That is also where the caveats start.

The repo's quality story is still narrow. The headline evidence is Needle in a Haystack retrieval, not a broad suite of long-context reasoning or instruction-following tests. Reddit commenters immediately picked up on that. One of the top replies asked the obvious question: yes, it is 10x faster, but how much dumber is it? Another commenter summarized the tradeoff more bluntly: you do not usually get a 10x gain in a mature area without paying somewhere else. That skepticism is healthy. Prefill compression is a bet that the discarded parts of the prompt were not doing useful work.

There is also a reproducibility gap. The code is open, and that matters. But open code is not the same thing as confirmed numbers. One Reddit commenter said the stack was hitting OOM on a 4090. Hacker News picked up the project, but only lightly, with little independent replication yet. For now, the performance claims should be treated as project-reported results from a real implementation, not as a settled new baseline for everyone running llama.cpp-class workloads.

Still, even with those caveats, PFlash points at a more interesting direction than most local model launches. The local ecosystem has spent the last year arguing about larger context windows as if context capacity and context usability were the same thing. They are not. A 128K prompt that takes four minutes before the first visible token is technically impressive and operationally annoying. Developers do not experience context windows as abstract limits. They experience them as waiting.

The deeper story here is about systems design, not model bragging rights. If local inference teams can turn long-context prefill into a selective, lossy, hardware-aware stage, then the bottleneck shifts. The competition stops being just "who has the bigger context window" and becomes "who can make long context cheap enough to use without regret." That is a much better race.

PFlash does not prove the problem is solved. It does show where the next serious local-inference gains may come from. Not another giant checkpoint, not another benchmark collage, but more aggressive work on what the expensive model actually needs to read.

If you care about local models as real tools instead of lab demos, that is the part worth watching.

Sources

Reddit, r/LocalLLaMA: PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090
Luce GitHub repo: Luce-Org/lucebox-hub
Luce PFlash README: pflash/README.md
arXiv: Speculative Prefill: Turbocharging TTFT with Lightweight and Training-Free Token Importance Estimation
arXiv: Cross-Family Speculative Prefill
Hacker News submission: PFlash: 10x prefill speedup over llama.cpp at 128K on a RTX 3090