Benchmarks
Criterion bench numbers — FGN CPU vs CUDA, distribution sampling speedups, and the all-backends matrix (CPU / cubecl / Metal / Accelerate).
Benchmarks
The library uses criterion
for performance tracking, with named baselines per release for
regression detection.
Bench suites
The workspace ships 28 bench files under benches/. The
default-feature suites cover stochastic processes, distribution
sampling, pricers, risk, microstructure, and lattice methods. The
remaining suites are feature-gated to backends or optional dependencies.
Default-feature suites
distributions fgn_fbm process_generation gn_batching
dist_multicore option risk instruments
microstructure credit cashflows calendar
lattice market filtering realized
poisson_process mle econometrics slv
rl_roughFeature-gated suites
| Bench | Required features |
|---|---|
fgn_gpu | gpu |
fgn_cuda_native | cuda-native |
fgn_metal | metal |
fgn_accelerate | accelerate |
fgn_all_backends | gpu-wgpu, metal, accelerate |
factors | openblas |
hotpath_profile | hotpath |
FGN — CPU vs CUDA native
cuda-native backend: cudarc + cuFFT + fused Philox RNG kernel
(no .cu files, no nvcc). Environment: NVIDIA GPU, CUDA 12.x,
Rust nightly, --release with LTO.
cargo bench --features cuda-native --bench fgn_cuda_nativeSingle path (f32, H = 0.7)
| n | CPU sample | CUDA sample_cuda_native(1) | Speedup |
|---|---|---|---|
| 1,024 | 8.1 µs | 46 µs | 0.18× |
| 4,096 | 35 µs | 84 µs | 0.42× |
| 16,384 | 147 µs | 110 µs | 1.3× |
| 65,536 | 850 µs | 227 µs | 3.7× |
Batch (sample_par(m) vs sample_cuda_native(m), f32, H = 0.7)
| n, m | CPU sample_par | CUDA sample_cuda_native | Speedup |
|---|---|---|---|
| 4,096, 32 | 147 µs | 117 µs | 1.3× |
| 4,096, 512 | 1.78 ms | 2.37 ms | 0.75× |
| 65,536, 128 | 12.6 ms | 10.5 ms | 1.2× |
| 65,536, 1 k | 102 ms | 93 ms | 1.1× |
CUDA wins for large n (≥ 16 k) and stays competitive at n = 65 k
batches. CPU rayon parallelism dominates for medium n due to zero
transfer overhead.
All backends head-to-head
fgn_all_backends (requires gpu-wgpu, metal, and accelerate)
compares CPU, cubecl-wgpu, Metal, and Accelerate on the
same FGN sampler:
g.bench_with_input(BenchmarkId::new("cpu", n), …);
g.bench_with_input(BenchmarkId::new("gpu_cubecl", n), …);
g.bench_with_input(BenchmarkId::new("metal", n), …);
g.bench_with_input(BenchmarkId::new("accelerate", n), …);Test grid: n ∈ {1024, 4096, 16384, 65536} for single paths and
(n, m) ∈ {(4 096, 32), (4 096, 128), (4 096, 512), …} for batches.
Use this bench to discover the cross-over point on your machine
between scalar / SIMD CPU, CPU-FFT (Accelerate), and GPU.
Distribution sampling — multicore
Measured with cargo bench --bench dist_multicore. Configuration:
sample_matrixbenchmark- 1-thread vs 14-thread rayon pools
- Most distributions:
1024 × 1024; heavy discrete samplers:512 × 512
| Distribution | Shape | 1T (ms) | MT (ms) | Speedup |
|---|---|---|---|---|
Normal f64 | 1024 × 1024 | 1.78 | 0.34 | 5.28× |
Exp f64 | 1024 × 1024 | 1.73 | 0.33 | 5.25× |
Uniform f64 | 1024 × 1024 | 0.65 | 0.13 | 5.12× |
Cauchy f64 | 1024 × 1024 | 6.23 | 0.90 | 6.96× |
LogNormal f64 | 1024 × 1024 | 5.07 | 0.81 | 6.25× |
Gamma f64 | 1024 × 1024 | 5.20 | 0.72 | 7.19× |
ChiSq f64 | 1024 × 1024 | 5.06 | 1.22 | 4.14× |
StudentT f64 | 1024 × 1024 | 7.89 | 1.89 | 4.18× |
Beta f64 | 1024 × 1024 | 11.85 | 1.68 | 7.04× |
Weibull f64 | 1024 × 1024 | 13.17 | 1.73 | 7.59× |
Pareto f64 | 1024 × 1024 | 5.48 | 0.80 | 6.87× |
InvGauss f64 | 1024 × 1024 | 2.52 | 0.44 | 5.69× |
NIG f64 | 1024 × 1024 | 5.93 | 0.90 | 6.62× |
AlphaStable f64 | 1024 × 1024 | 42.52 | 5.36 | 7.94× |
Poisson i64 | 1024 × 1024 | 2.28 | 0.42 | 5.40× |
Geometric u64 | 1024 × 1024 | 2.75 | 0.44 | 6.30× |
Binomial u32 | 512 × 512 | 4.43 | 0.70 | 6.32× |
Hypergeo u32 | 512 × 512 | 20.99 | 2.76 | 7.60× |
Normal single-thread kernel comparison (fill_slice, same run):
- vs
rand_distr + SimdRng— ≈ 1.21× to 1.35× - vs
rand_distr + rand::rng()— ≈ 4.09× to 4.61×
Establishing a baseline
# Save the current build as the "v2" baseline
cargo bench --bench distributions -- --save-baseline v2
cargo bench --bench fgn_fbm -- --save-baseline v2
cargo bench --bench process_generation -- --save-baseline v2
cargo bench --bench option -- --save-baseline v2
# … and so on for the rest of the suitesOpenBLAS-gated benches go to a separate baseline name to keep the linear-algebra-heavy suite out of the default comparison:
cargo bench --bench factors --features openblas -- --save-baseline v2-openblasComparing against a baseline
Before merging any PR with non-trivial perf impact:
cargo bench --bench <bench> -- --baseline v2criterion prints Performance has improved / regressed per
benchmark with effect size and 95% confidence intervals. Block the
merge on any regression of more than 5% (the project's de-facto
threshold).
Reproducing the numbers
To reproduce the FGN-vs-CUDA table you need:
- NVIDIA GPU with CUDA 12.x toolkit
- Rust nightly (for inline-asm in some
cudarcpaths) RUSTFLAGS="-C target-cpu=native"for the CPU side (otherwise the CPU column above will look slower becausewide'sf32x8falls back to scalar — see Native CPU optimization)
RUSTFLAGS="-C target-cpu=native" cargo bench \
--features cuda-native \
--bench fgn_cuda_native -- --save-baseline localAdding a benchmark
See the
bench-writing
SKILL — group naming, parameter sweep, [[bench]] required-features
gating, and the "no-println / no-dead-helper" rules.
Tutorials
End-to-end walkthroughs — Heston calibration, fBm Hurst estimation, vol-surface from quotes, AI surrogate training, pairs, risk, execution, interop.
Contributing
How to contribute to stochastic-rs — coding conventions, the SKILL system that automates per-feature recipes, and the per-PR docs/tests/bench rule.