stochastic-rs

Benchmarks

Criterion bench numbers — FGN CPU vs CUDA, distribution sampling speedups, and the all-backends matrix (CPU / cubecl / Metal / Accelerate).

Benchmarks

The library uses criterion for performance tracking, with named baselines per release for regression detection.

Bench suites

The workspace ships 28 bench files under benches/. The default-feature suites cover stochastic processes, distribution sampling, pricers, risk, microstructure, and lattice methods. The remaining suites are feature-gated to backends or optional dependencies.

Default-feature suites

distributions       fgn_fbm        process_generation   gn_batching
dist_multicore      option         risk                 instruments
microstructure      credit         cashflows            calendar
lattice             market         filtering            realized
poisson_process     mle            econometrics         slv
rl_rough

Feature-gated suites

BenchRequired features
fgn_gpugpu
fgn_cuda_nativecuda-native
fgn_metalmetal
fgn_accelerateaccelerate
fgn_all_backendsgpu-wgpu, metal, accelerate
factorsopenblas
hotpath_profilehotpath

FGN — CPU vs CUDA native

cuda-native backend: cudarc + cuFFT + fused Philox RNG kernel (no .cu files, no nvcc). Environment: Intel i9-285K + NVIDIA RTX 4070 SUPER, CUDA 12.x, Rust nightly, --release with LTO.

cargo bench --features cuda-native --bench fgn_cuda_native

Single path (f32, H = 0.7)

nCPU sampleCUDA sample_cuda_native(1)Speedup
1,0248.1 µs46 µs0.18×
4,09635 µs84 µs0.42×
16,384147 µs110 µs1.3×
65,536850 µs227 µs3.7×

Batch (sample_par(m) vs sample_cuda_native(m), f32, H = 0.7)

n, mCPU sample_parCUDA sample_cuda_nativeSpeedup
4,096, 32147 µs117 µs1.3×
4,096, 5121.78 ms2.37 ms0.75×
65,536, 12812.6 ms10.5 ms1.2×
65,536, 1 k102 ms93 ms1.1×

CUDA wins for large n (≥ 16 k) and stays competitive at n = 65 k batches. CPU rayon parallelism dominates for medium n due to zero transfer overhead.

All backends head-to-head

fgn_all_backends (requires gpu-wgpu, metal, and accelerate) compares CPU, cubecl-wgpu, Metal, and Accelerate on the same FGN sampler:

g.bench_with_input(BenchmarkId::new("cpu", n),         …);
g.bench_with_input(BenchmarkId::new("gpu_cubecl", n),  …);
g.bench_with_input(BenchmarkId::new("metal", n),       …);
g.bench_with_input(BenchmarkId::new("accelerate", n),  …);

Test grid: n ∈ {1024, 4096, 16384, 65536} for single paths and (n, m) ∈ {(4 096, 32), (4 096, 128), (4 096, 512), …} for batches. Use this bench to discover the cross-over point on your machine between scalar / SIMD CPU, CPU-FFT (Accelerate), and GPU.

Single path (f32, H = 0.7)

Single-path latency is the only dimension where the two test machines share a grid, so the backends are directly comparable. The i9-285K / RTX 4070 SUPER figures are converted from the Melem/s values in CUDA_BENCHMARK.md (fgn_cuda_compare).

Apple M4 Max (10 P + 4 E cores, 36 GB unified):

nCPUAccelerateMetalCubeCL-wgpu
1,02411 µs11.6 µs148 µs230 µs
4,09649 µs45 µs175 µs295 µs
16,384202 µs165 µs220 µs543 µs
65,536838 µs677 µs456 µs1.66 ms

Intel i9-285K + RTX 4070 SUPER (cudarc + cuFFT, hand-written cubecl):

nCPU (i9-285K)cuFFT (cuda_native)cubecl (gpu_cuda)
1,0247.9 µs80 µs148 µs
4,09634 µs84 µs212 µs
16,384144 µs110 µs514 µs
65,536847 µs189 µs1.62 ms

Both SIMD CPUs dominate short paths (the i9-285K is the fastest of all up to n = 4 k); cuFFT overtakes at n ≈ 16 k and reaches 4.4× over the CPU at n = 65 k (189 µs vs ~840 µs). On the Apple side the vDSP (Accelerate) FFT edges the ndrustfft CPU for mid-size single paths (n = 4 k16 k), and Metal passes both only for the largest. The hand-written cubecl FFT (gpu_cuda / CubeCL-wgpu) is never competitive.

Batch throughput (sample_par, f32, H = 0.7) — Melem/s, higher is better

Apple M4 Max — vDSP/Accelerate and the rayon CPU run neck-and-neck (both far ahead of the GPU backends):

n × mCPUAccelerateMetalCubeCL-wgpu
4,096 × 3253254833258
4,096 × 51274479056576
16,384 × 12867074948862
16,384 × 51267678743362

Intel i9-285K + RTX 4070 SUPER — small n stays cache-resident on the 24-core i9; batched cuFFT wins once n ≥ 4 k:

n × mCPU (i9-285K)cuFFTcubecl
1,024 × 16,384175155749
4,096 × 16,38426958148
4,096 × 65,53611838342
16,384 × 16,3845918332
65,536 × 4,09650423628

The two machines' batch grids do not overlap (the Mac bench tops out at m = 512; the CUDA batch grid starts at m = 1024), so they are shown separately. Takeaway: for FGN the discrete-GPU win is real but narrow — it needs both a long path (n ≥ 16 k single, or n ≥ 4 k × m ≥ 16 k batch) and an NVIDIA-class FFT. On Apple Silicon the CPU-side FFTs win: ndrustfft + rayon and vDSP/Accelerate are interchangeable (both ~700–800 Melem/s in batch), while Metal only helps for the largest single paths.

Normal sampling vs rand_distr (single-thread, fill_slice)

Measured with cargo bench --bench dist_multicore. Single-thread fill_slice, median of 7 runs. Compares three implementations of the same workload (write n N(0, 1) samples into a pre-allocated Vec<f64>):

  • SimdNormal::fill_slice — our SIMD Ziggurat with the wide-based RNG.
  • rand_distr + SimdRngrand_distr::Normal::sample consuming our SimdRng. Isolates the Normal algorithm — both paths share the same uniform stream.
  • rand_distr + rand::rng() — out-of-box upstream: rand_distr::Normal on rand::rng() (ThreadRng-backed ChaCha).
nSimdNormal (µs)rand_distr + SimdRng (µs)speeduprand_distr + rand::rng() (µs)speedup
40.0080.0131.73×0.0324.22×
80.0140.0261.78×0.0654.52×
160.0290.0511.79×0.1284.47×
640.1090.2081.90×0.5084.64×
2560.4320.8401.94×2.0294.70×
4 0966.97513.1761.89×32.3824.64×
65 536113.458212.4061.87×520.2194.59×

The rand_distr + SimdRng column shows the pure algorithmic win at the distribution level (our SIMD Ziggurat fast-path beats the scalar reference Ziggurat ~1.7–1.9×). The rand_distr + rand::rng() column shows the full pipeline win including the SIMD RNG itself (~4.2–4.7×).

v2.4 distribution sampler upgrades

Algorithm-level fixes measured on Apple Silicon (single thread, --release, ns per sample, 2026-06-11). The first three replace per-draw construction or O(n) loops with constant-cost algorithms, so the speedups grow with the parameters:

SamplerBeforeAfterSpeedup
Noncentral χ² in a per-step loop (SimdNonCentralChiSquared)359 ns8.2 ns~44×
SimdBinomial n = 1000 (BTRS rejection, was O(n) per-trial)542 ns18.0 ns~30×
SimdHypergeometric N = 500, n = 100 (CDF-table inversion)O(n)2.6 ns~40×
SimdGev (internal SIMD RNG, was thread-RNG per draw)~20 ns10.0 ns
SimdTruncatedExp (cached CDF bounds + internal RNG)~15 ns5.0 ns~3×
SimdGed (SIMD bulk fill + buffered sampling)20.2 ns16.3 ns1.2×
SimdWeibull sample_fast (buffered, single-pass fill)10.7 ns8.4 ns1.3×
SimdStudentT fill (64-element sub-fills above the SIMD threshold)8.9 ns7.0 ns1.3×
SimdBeta fill12.9 ns10.6 ns1.2×

Context anchors from the same session: SimdNormal fill ≈ 1.7–2.0 ns, SimdExpZig fill ≈ 1.5 ns, uniform fill ≈ 0.35 ns per sample; SimdGamma (scalar Marsaglia-Tsang squeeze over the buffered SIMD sources) ≈ 5–6.3 ns vs 14–17.7 ns for rand_distr::Gamma on a thread RNG. The binomial BTRS path also beats rand_distr's BTPE (29.5 ns at n = 1000). Two correctness fixes shipped alongside: SimdPoisson no longer hangs for λ ≳ 745 (log-space CDF build), and SimdGev now honours its Deterministic seed.

An 8-lane SIMD-batched Marsaglia-Tsang gamma kernel was prototyped and measured slower than the scalar squeeze loop on 128-bit NEON (9.3 vs 6.3 ns — lane bookkeeping dominates), so the scalar loop stays; the experiment is recorded in the gamma.rs module docs.

v2.1 SIMD RNG / Ziggurat speedups

Criterion deltas vs the wide 1.3.0 baseline (single-sample dist.sample(rng) loop, cargo bench --bench distributions -- --baseline before):

Distributionf32 / largef64 / largef64 / small
Uniform/simd−57% (≈ 2.3×)−77% (≈ 4.4×)−58% (≈ 2.4×)
Normal/simd−51% (≈ 2.0×)−75% (≈ 4.0×)−63% (≈ 2.7×)
Exp/simd N=64−3% (n.s.)−73% (≈ 3.7×)
LogNormal/simd−71% (≈ 3.4×)−70% (≈ 3.4×)−66% (≈ 2.9×)

Drivers (sub-crate paths):

  • stochastic-rs-core::simd_rng uniform core — 8 scalar shift+cast+mul loops replaced by SIMD shift + magic-number bit-cast + subtract (52-bit / 23-bit precision; sub-ULP impact on Monte Carlo). The fill_uniform_f64 / fill_uniform_f32 direct-write APIs store vmovupd directly into the caller's slice with no [f64; 8] / [f32; 8] return-by-value round-trip (the interim next_f64_array / next_f32_array accessors were removed in 2.4). next_f64 / next_f32 refill their internal buffers via the same inlined SIMD store.
  • stochastic-rs-distributions::normal / expfill_ziggurat main loop runs 8 lanes per iteration so copy_from_slice(&scaled_arr) inlines to stp stores instead of compiling to a memcpy call; the final 0–7-element tail is fed through scalar sample_one.
  • stochastic-rs-distributions::exp — fused 1/λ scaling into the SIMD store inside fill_exp_scaled, removing the second-pass scale_in_place.
  • stochastic-rs-distributions::uniformfill_slice_fast short-circuits to rng.fill_uniform_f64(out) for the [0, 1) case (the dominant pattern) and falls back to a single SIMD affine pass otherwise.
  • stochastic-rs-distributions::lib<f64 as SimdFloatExt>::simd_from_i32x8 splits i32x8 into two i32x4 halves and calls f64x4::from_i32x4 per half, routing through vcvtdq2pd on AVX / sshll + scvtf on NEON instead of wide's 8-scalar fallback.
  • simd_rng::fill_bytes batches 32-byte writes per engine call and the Xoshiro256++ / Xoshiro128++ seed paths use u64::from_le_bytes for big-endian portability.

Opt-in: dual-stream RNG (dual-stream-rng feature)

[dependencies]
stochastic-rs = { version = "2.2", features = ["dual-stream-rng"] }

Unlocks stochastic_rs_core::simd_rng_dual::SimdRngDual (two parallel xoshiro engines) and the stochastic_rs::distributions::SimdNormalDual / SimdExpZigDual aliases. The Ziggurat main loop consumes two independent engine batches per iteration (SimdRngExt::next_i32x8_pair, gated on the HAS_PAIR_ILP const — single-stream codegen is unchanged). Measured against the single-stream SimdNormal::fill_slice on Apple Silicon, 2026-06-11 (cargo bench --bench dual_stream_compare --features dual-stream-rng):

nsingle (SimdNormal)dual (SimdNormalDual)Δ
64120.6 ns117.5 ns−2.6%
256494.2 ns467.9 ns−5.3%
4 0967.87 µs7.42 µs−5.7%
65 536126.2 µs122.4 µs−3.0%
1 048 5762.02 ms1.97 ms−2.2%

The win is that a modern out-of-order core hides the 16 scalar kn / wn table-lookup latencies behind the second engine's xoshiro state update; it is modest on 128-bit NEON because the gather chain dominates the total cost. Exponential fills measure at parity. Uniform f64 fills are not engine-bound (direct SIMD stores, no table lookups), so they show no speedup beyond noise (0 to −7% on L1-resident sizes). Trade-off: SimdRngDual::from_seed does not reproduce SimdRng::from_seed's bit-exact sequence (statistical properties are identical and KS-validated, including the engine-B lanes of the pair path).

Establishing a baseline

# Save the current build as the "v2" baseline
cargo bench --bench distributions      -- --save-baseline v2
cargo bench --bench fgn_fbm            -- --save-baseline v2
cargo bench --bench process_generation -- --save-baseline v2
cargo bench --bench option             -- --save-baseline v2
# … and so on for the rest of the suites

OpenBLAS-gated benches go to a separate baseline name to keep the linear-algebra-heavy suite out of the default comparison:

cargo bench --bench factors --features openblas -- --save-baseline v2-openblas

Comparing against a baseline

Before merging any PR with non-trivial perf impact:

cargo bench --bench <bench> -- --baseline v2

criterion prints Performance has improved / regressed per benchmark with effect size and 95% confidence intervals. Block the merge on any regression of more than 5% (the project's de-facto threshold).

Reproducing the numbers

To reproduce the FGN-vs-CUDA table you need:

  • NVIDIA GPU with CUDA 12.x toolkit
  • Rust nightly (for inline-asm in some cudarc paths)
  • RUSTFLAGS="-C target-cpu=native" for the CPU side (otherwise the CPU column above will look slower because wide's f32x8 falls back to scalar — see Native CPU optimization)
RUSTFLAGS="-C target-cpu=native" cargo bench \
  --features cuda-native \
  --bench fgn_cuda_native -- --save-baseline local

Adding a benchmark

See the bench-writing SKILL — group naming, parameter sweep, [[bench]] required-features gating, and the "no-println / no-dead-helper" rules.

On this page