DFlash: Block Diffusion for Flash Speculative Decoding

•

Original Author:Jian Chen et al.

•

February 5, 2026

DFlash: Block Diffusion for Flash Speculative Decoding

Image generated by Gemini AI

DFlash introduces a novel speculative decoding framework utilizing a lightweight block diffusion model for parallel token generation, enhancing efficiency in large language models. By producing draft tokens in one forward pass and conditioning on context from the target model, DFlash achieves over 6x acceleration and up to 2.5x faster speeds than EAGLE-3, improving inference quality and GPU utilization.

DFlash Introduces Breakthrough in Speculative Decoding for Large Language Models

A new framework, DFlash, promises significant improvements in the decoding speed of autoregressive large language models (LLMs) by leveraging a lightweight block diffusion model. This approach reduces inference latency and enhances GPU utilization.

DFlash distinguishes itself by generating draft tokens in a single forward pass. It conditions the draft model on context features derived from the target LLM, allowing for efficient drafting without sacrificing output quality. This method improves drafting efficiency and increases acceptance rates of generated outputs.

Performance Metrics

Experimental results reveal that DFlash achieves over six times lossless acceleration across various models and tasks. Moreover, it offers up to 2.5 times higher speedup compared to EAGLE-3, the current leading speculative decoding method.

Share this article

Twitter Facebook LinkedIn WhatsApp Reddit

DFlash: Block Diffusion for Flash Speculative Decoding

DFlash Introduces Breakthrough in Speculative Decoding for Large Language Models

Performance Metrics

Related Topics:

Share this article