Sol-RL

FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling

Unlocking Massive Rollout Scaling for Diffusion RL at a Fraction of the Cost

Yitong Li1,2*, Junsong Chen1,2*, Shuchen Xue1*, Pengcuo Zeren1, Siyuan Fu1, Dinghao Yang1, Yangyang Tang1,
Junjie Bai1, Ping Luo2, Song Han1,3, Enze Xie1

1NVIDIA   2HKU   3MIT

*Equal contribution

NVIDIA HKU MIT Han Lab

Demo Video

About

RL-based post-training is a promising paradigm for aligning diffusion models with human preferences, and scaling rollout group size yields clear gains — but at prohibitive cost on large models like FLUX.1-12B. We propose Sol-RL (Speed-of-light RL), an FP4-empowered two-stage RL framework that decouples exploration from optimization: high-throughput NVFP4 rollouts generate a massive candidate pool, from which only the most contrastive subset is regenerated in BF16 for policy updates. This algorithm-hardware co-design accelerates rollouts while preserving training integrity. Experiments across SANA, FLUX.1, and SD3.5-L show superior alignment with up to 4.64× faster convergence.

Sol-RL head figure

Sol-RL enables efficient and high-fidelity text-to-image alignment.

4.64×
Convergence Speedup
~1%
Performance Gap vs. BF16
3×4
Foundation Models & Rewards Validated

Core Design: Decoupled Two-Stage Pipeline

We decouple high-throughput FP4 exploration from selective BF16 training, achieving up to 2.4× rollout acceleration with merely 2% computational overhead while avoiding quantization-induced corruption.

Pipeline overview

Methodology

Scaling rollouts improves alignment but creates an inference bottleneck, and naive FP4 quantization compromises visual fidelity. Sol-RL introduces a decoupled two-stage architecture that uses FP4 exclusively for exploration while preserving BF16 for optimization.

Rollout Scaling: Promise and Bottleneck

Under selective training, only the best-K and worst-K samples drive gradient updates, so larger rollout groups yield better learning signals at fixed training cost. However, the vast majority of generated candidates are discarded — revealing massive redundancy in computing the full pool at high precision.

Key Insight: Proxy Reward Ranking via FP4

In ODE-style diffusion sampling, the semantic outcome is dictated by the initial noise seed. NVFP4 preserves this structure despite pixel-level deviations, so FP4 rollouts serve as reliable proxies for reward ranking — enabling us to identify the most informative seeds cheaply, then regenerate only those in BF16.

Two-Stage Pipeline

  • Stage 1: FP4 Explore — Sample N=96 noises and generate candidates via the NVFP4-quantized solver (up to 4× TFLOPs vs. BF16) to compute proxy rewards. Filter to isolate the top/bottom-K most contrastive seeds.
  • Stage 2: BF16 Train — Regenerate the selected K=24 seeds in BF16 with the full step budget. Optimize the policy on these high-fidelity samples, then re-quantize weights to NVFP4 for the next iteration.

In summary, Sol-RL harnesses FP4 throughput for massive-scale exploration and reserves BF16 compute strictly for the K samples that dictate the policy update — introducing merely ~2% overhead.

Overall Performance

Main Results on FLUX.1

Method ImageReward
(Base w/o CFG: 0.455)
CLIPScore
(Base w/o CFG: 0.2630)
PickScore
(Base w/o CFG: 0.8096)
HPSv2
(Base w/o CFG: 0.2566)
ScoreΔ (↑) ScoreΔ (↑) ScoreΔ (↑) ScoreΔ (↑)
DanceGRPO 1.4937+1.0387 0.2898+0.0268 0.8807+0.0711 0.3552+0.0986
FlowGRPO 1.5331+1.0781 0.2884+0.0254 0.8743+0.0647 0.3501+0.0935
AWM 1.6693+1.2143 0.3039+0.0409 0.8842+0.0746 0.3664+0.1098
DiffusionNFT 1.6707+1.2157 0.2991+0.0361 0.8852+0.0756 0.3613+0.1047
Sol-RL (Ours) 1.7636+1.3086 0.3089+0.0459 0.8932+0.0836 0.3688+0.1122
Learning curves

Learning curves across metrics and models.

Efficiency Comparison (seconds/iteration)

Base Model Rollout Naive Rollout Ours Speedup E2E Naive E2E Ours Speedup
FLUX.1 184 79 2.33× 274 169 1.62×
SD3.5-Large 451 187 2.41× 691 427 1.61×
SANA 65 46 1.41× 95 76 1.25×

BibTeX

@article{solrl2026, title={FP4 Explore, BF16 Train: Diffusion Reinforcement Learning via Efficient Rollout Scaling}, author={Li, Yitong and Chen, Junsong and Xue, Shuchen and Zeren, Pengcuo and Fu, Siyuan and Yang, Dinghao and Tang, Yangyang and Bai, Junjie and Luo, Ping and Han, Song and Xie, Enze}, year={2026}, }