stochastic-rs

Benchmarks

Criterion bench numbers — FGN CPU vs CUDA, distribution sampling speedups, and the all-backends matrix (CPU / cubecl / Metal / Accelerate).

Benchmarks

The library uses criterion for performance tracking, with named baselines per release for regression detection.

Bench suites

The workspace ships 28 bench files under benches/. The default-feature suites cover stochastic processes, distribution sampling, pricers, risk, microstructure, and lattice methods. The remaining suites are feature-gated to backends or optional dependencies.

Default-feature suites

distributions       fgn_fbm        process_generation   gn_batching
dist_multicore      option         risk                 instruments
microstructure      credit         cashflows            calendar
lattice             market         filtering            realized
poisson_process     mle            econometrics         slv
rl_rough

Feature-gated suites

BenchRequired features
fgn_gpugpu
fgn_cuda_nativecuda-native
fgn_metalmetal
fgn_accelerateaccelerate
fgn_all_backendsgpu-wgpu, metal, accelerate
factorsopenblas
hotpath_profilehotpath

FGN — CPU vs CUDA native

cuda-native backend: cudarc + cuFFT + fused Philox RNG kernel (no .cu files, no nvcc). Environment: NVIDIA GPU, CUDA 12.x, Rust nightly, --release with LTO.

cargo bench --features cuda-native --bench fgn_cuda_native

Single path (f32, H = 0.7)

nCPU sampleCUDA sample_cuda_native(1)Speedup
1,0248.1 µs46 µs0.18×
4,09635 µs84 µs0.42×
16,384147 µs110 µs1.3×
65,536850 µs227 µs3.7×

Batch (sample_par(m) vs sample_cuda_native(m), f32, H = 0.7)

n, mCPU sample_parCUDA sample_cuda_nativeSpeedup
4,096, 32147 µs117 µs1.3×
4,096, 5121.78 ms2.37 ms0.75×
65,536, 12812.6 ms10.5 ms1.2×
65,536, 1 k102 ms93 ms1.1×

CUDA wins for large n (≥ 16 k) and stays competitive at n = 65 k batches. CPU rayon parallelism dominates for medium n due to zero transfer overhead.

All backends head-to-head

fgn_all_backends (requires gpu-wgpu, metal, and accelerate) compares CPU, cubecl-wgpu, Metal, and Accelerate on the same FGN sampler:

g.bench_with_input(BenchmarkId::new("cpu", n),         …);
g.bench_with_input(BenchmarkId::new("gpu_cubecl", n),  …);
g.bench_with_input(BenchmarkId::new("metal", n),       …);
g.bench_with_input(BenchmarkId::new("accelerate", n),  …);

Test grid: n ∈ {1024, 4096, 16384, 65536} for single paths and (n, m) ∈ {(4 096, 32), (4 096, 128), (4 096, 512), …} for batches. Use this bench to discover the cross-over point on your machine between scalar / SIMD CPU, CPU-FFT (Accelerate), and GPU.

Distribution sampling — multicore

Measured with cargo bench --bench dist_multicore. Configuration:

  • sample_matrix benchmark
  • 1-thread vs 14-thread rayon pools
  • Most distributions: 1024 × 1024; heavy discrete samplers: 512 × 512
DistributionShape1T (ms)MT (ms)Speedup
Normal f641024 × 10241.780.345.28×
Exp f641024 × 10241.730.335.25×
Uniform f641024 × 10240.650.135.12×
Cauchy f641024 × 10246.230.906.96×
LogNormal f641024 × 10245.070.816.25×
Gamma f641024 × 10245.200.727.19×
ChiSq f641024 × 10245.061.224.14×
StudentT f641024 × 10247.891.894.18×
Beta f641024 × 102411.851.687.04×
Weibull f641024 × 102413.171.737.59×
Pareto f641024 × 10245.480.806.87×
InvGauss f641024 × 10242.520.445.69×
NIG f641024 × 10245.930.906.62×
AlphaStable f641024 × 102442.525.367.94×
Poisson i641024 × 10242.280.425.40×
Geometric u641024 × 10242.750.446.30×
Binomial u32512 × 5124.430.706.32×
Hypergeo u32512 × 51220.992.767.60×

Normal single-thread kernel comparison (fill_slice, same run):

  • vs rand_distr + SimdRng — ≈ 1.21× to 1.35×
  • vs rand_distr + rand::rng() — ≈ 4.09× to 4.61×

Establishing a baseline

# Save the current build as the "v2" baseline
cargo bench --bench distributions      -- --save-baseline v2
cargo bench --bench fgn_fbm            -- --save-baseline v2
cargo bench --bench process_generation -- --save-baseline v2
cargo bench --bench option             -- --save-baseline v2
# … and so on for the rest of the suites

OpenBLAS-gated benches go to a separate baseline name to keep the linear-algebra-heavy suite out of the default comparison:

cargo bench --bench factors --features openblas -- --save-baseline v2-openblas

Comparing against a baseline

Before merging any PR with non-trivial perf impact:

cargo bench --bench <bench> -- --baseline v2

criterion prints Performance has improved / regressed per benchmark with effect size and 95% confidence intervals. Block the merge on any regression of more than 5% (the project's de-facto threshold).

Reproducing the numbers

To reproduce the FGN-vs-CUDA table you need:

  • NVIDIA GPU with CUDA 12.x toolkit
  • Rust nightly (for inline-asm in some cudarc paths)
  • RUSTFLAGS="-C target-cpu=native" for the CPU side (otherwise the CPU column above will look slower because wide's f32x8 falls back to scalar — see Native CPU optimization)
RUSTFLAGS="-C target-cpu=native" cargo bench \
  --features cuda-native \
  --bench fgn_cuda_native -- --save-baseline local

Adding a benchmark

See the bench-writing SKILL — group naming, parameter sweep, [[bench]] required-features gating, and the "no-println / no-dead-helper" rules.

On this page