Criterion bench numbers — FGN CPU vs CUDA, distribution sampling speedups, and the all-backends matrix (CPU / cubecl / Metal / Accelerate).

Benchmarks

The library uses criterion for performance tracking, with named baselines per release for regression detection.

Bench suites

The workspace ships 28 bench files under benches/. The default-feature suites cover stochastic processes, distribution sampling, pricers, risk, microstructure, and lattice methods. The remaining suites are feature-gated to backends or optional dependencies.

Default-feature suites

distributions       fgn_fbm        process_generation   gn_batching
dist_multicore      option         risk                 instruments
microstructure      credit         cashflows            calendar
lattice             market         filtering            realized
poisson_process     mle            econometrics         slv
rl_rough

Feature-gated suites

Bench	Required features
`fgn_gpu`	`gpu`
`fgn_cuda_native`	`cuda-native`
`fgn_metal`	`metal`
`fgn_accelerate`	`accelerate`
`fgn_all_backends`	`gpu-wgpu`, `metal`, `accelerate`
`factors`	`openblas`
`hotpath_profile`	`hotpath`

FGN — CPU vs CUDA native

cuda-native backend: cudarc + cuFFT + fused Philox RNG kernel (no .cu files, no nvcc). Environment: NVIDIA GPU, CUDA 12.x, Rust nightly, --release with LTO.

cargo bench --features cuda-native --bench fgn_cuda_native

Single path (`f32`, H = 0.7)

n	CPU `sample`	CUDA `sample_cuda_native(1)`	Speedup
1,024	8.1 µs	46 µs	0.18×
4,096	35 µs	84 µs	0.42×
16,384	147 µs	110 µs	1.3×
65,536	850 µs	227 µs	3.7×

Batch (`sample_par(m)` vs `sample_cuda_native(m)`, `f32`, H = 0.7)

n, m	CPU `sample_par`	CUDA `sample_cuda_native`	Speedup
4,096, 32	147 µs	117 µs	1.3×
4,096, 512	1.78 ms	2.37 ms	0.75×
65,536, 128	12.6 ms	10.5 ms	1.2×
65,536, 1 k	102 ms	93 ms	1.1×

CUDA wins for large n (≥ 16 k) and stays competitive at n = 65 k batches. CPU rayon parallelism dominates for medium n due to zero transfer overhead.

All backends head-to-head

fgn_all_backends (requires gpu-wgpu, metal, and accelerate) compares CPU, cubecl-wgpu, Metal, and Accelerate on the same FGN sampler:

g.bench_with_input(BenchmarkId::new("cpu", n),         …);
g.bench_with_input(BenchmarkId::new("gpu_cubecl", n),  …);
g.bench_with_input(BenchmarkId::new("metal", n),       …);
g.bench_with_input(BenchmarkId::new("accelerate", n),  …);

Test grid: n ∈ {1024, 4096, 16384, 65536} for single paths and (n, m) ∈ {(4 096, 32), (4 096, 128), (4 096, 512), …} for batches. Use this bench to discover the cross-over point on your machine between scalar / SIMD CPU, CPU-FFT (Accelerate), and GPU.

Distribution sampling — multicore

Measured with cargo bench --bench dist_multicore. Configuration:

sample_matrix benchmark
1-thread vs 14-thread rayon pools
Most distributions: 1024 × 1024; heavy discrete samplers: 512 × 512

Distribution	Shape	1T (ms)	MT (ms)	Speedup
Normal `f64`	1024 × 1024	1.78	0.34	5.28×
Exp `f64`	1024 × 1024	1.73	0.33	5.25×
Uniform `f64`	1024 × 1024	0.65	0.13	5.12×
Cauchy `f64`	1024 × 1024	6.23	0.90	6.96×
LogNormal `f64`	1024 × 1024	5.07	0.81	6.25×
Gamma `f64`	1024 × 1024	5.20	0.72	7.19×
ChiSq `f64`	1024 × 1024	5.06	1.22	4.14×
StudentT `f64`	1024 × 1024	7.89	1.89	4.18×
Beta `f64`	1024 × 1024	11.85	1.68	7.04×
Weibull `f64`	1024 × 1024	13.17	1.73	7.59×
Pareto `f64`	1024 × 1024	5.48	0.80	6.87×
InvGauss `f64`	1024 × 1024	2.52	0.44	5.69×
NIG `f64`	1024 × 1024	5.93	0.90	6.62×
AlphaStable `f64`	1024 × 1024	42.52	5.36	7.94×
Poisson `i64`	1024 × 1024	2.28	0.42	5.40×
Geometric `u64`	1024 × 1024	2.75	0.44	6.30×
Binomial `u32`	512 × 512	4.43	0.70	6.32×
Hypergeo `u32`	512 × 512	20.99	2.76	7.60×

Normal single-thread kernel comparison (fill_slice, same run):

vs rand_distr + SimdRng — ≈ 1.21× to 1.35×
vs rand_distr + rand::rng() — ≈ 4.09× to 4.61×

Establishing a baseline

# Save the current build as the "v2" baseline
cargo bench --bench distributions      -- --save-baseline v2
cargo bench --bench fgn_fbm            -- --save-baseline v2
cargo bench --bench process_generation -- --save-baseline v2
cargo bench --bench option             -- --save-baseline v2
# … and so on for the rest of the suites

OpenBLAS-gated benches go to a separate baseline name to keep the linear-algebra-heavy suite out of the default comparison:

cargo bench --bench factors --features openblas -- --save-baseline v2-openblas

Comparing against a baseline

Before merging any PR with non-trivial perf impact:

cargo bench --bench <bench> -- --baseline v2

criterion prints Performance has improved / regressed per benchmark with effect size and 95% confidence intervals. Block the merge on any regression of more than 5% (the project's de-facto threshold).

Reproducing the numbers

To reproduce the FGN-vs-CUDA table you need:

NVIDIA GPU with CUDA 12.x toolkit
Rust nightly (for inline-asm in some cudarc paths)
RUSTFLAGS="-C target-cpu=native" for the CPU side (otherwise the CPU column above will look slower because wide's f32x8 falls back to scalar — see Native CPU optimization)

RUSTFLAGS="-C target-cpu=native" cargo bench \
  --features cuda-native \
  --bench fgn_cuda_native -- --save-baseline local

Adding a benchmark

See the bench-writing SKILL — group naming, parameter sweep, [[bench]] required-features gating, and the "no-println / no-dead-helper" rules.

Benchmarks

On this page