Benchmarks
Criterion bench numbers — FGN CPU vs CUDA, distribution sampling speedups, and the all-backends matrix (CPU / cubecl / Metal / Accelerate).
Benchmarks
The library uses criterion
for performance tracking, with named baselines per release for
regression detection.
Bench suites
The workspace ships 28 bench files under benches/. The
default-feature suites cover stochastic processes, distribution
sampling, pricers, risk, microstructure, and lattice methods. The
remaining suites are feature-gated to backends or optional dependencies.
Default-feature suites
distributions fgn_fbm process_generation gn_batching
dist_multicore option risk instruments
microstructure credit cashflows calendar
lattice market filtering realized
poisson_process mle econometrics slv
rl_roughFeature-gated suites
| Bench | Required features |
|---|---|
fgn_gpu | gpu |
fgn_cuda_native | cuda-native |
fgn_metal | metal |
fgn_accelerate | accelerate |
fgn_all_backends | gpu-wgpu, metal, accelerate |
factors | openblas |
hotpath_profile | hotpath |
FGN — CPU vs CUDA native
cuda-native backend: cudarc + cuFFT + fused Philox RNG kernel
(no .cu files, no nvcc). Environment: Intel i9-285K + NVIDIA RTX 4070
SUPER, CUDA 12.x, Rust nightly, --release with LTO.
cargo bench --features cuda-native --bench fgn_cuda_nativeSingle path (f32, H = 0.7)
| n | CPU sample | CUDA sample_cuda_native(1) | Speedup |
|---|---|---|---|
| 1,024 | 8.1 µs | 46 µs | 0.18× |
| 4,096 | 35 µs | 84 µs | 0.42× |
| 16,384 | 147 µs | 110 µs | 1.3× |
| 65,536 | 850 µs | 227 µs | 3.7× |
Batch (sample_par(m) vs sample_cuda_native(m), f32, H = 0.7)
| n, m | CPU sample_par | CUDA sample_cuda_native | Speedup |
|---|---|---|---|
| 4,096, 32 | 147 µs | 117 µs | 1.3× |
| 4,096, 512 | 1.78 ms | 2.37 ms | 0.75× |
| 65,536, 128 | 12.6 ms | 10.5 ms | 1.2× |
| 65,536, 1 k | 102 ms | 93 ms | 1.1× |
CUDA wins for large n (≥ 16 k) and stays competitive at n = 65 k
batches. CPU rayon parallelism dominates for medium n due to zero
transfer overhead.
All backends head-to-head
fgn_all_backends (requires gpu-wgpu, metal, and accelerate)
compares CPU, cubecl-wgpu, Metal, and Accelerate on the
same FGN sampler:
g.bench_with_input(BenchmarkId::new("cpu", n), …);
g.bench_with_input(BenchmarkId::new("gpu_cubecl", n), …);
g.bench_with_input(BenchmarkId::new("metal", n), …);
g.bench_with_input(BenchmarkId::new("accelerate", n), …);Test grid: n ∈ {1024, 4096, 16384, 65536} for single paths and
(n, m) ∈ {(4 096, 32), (4 096, 128), (4 096, 512), …} for batches.
Use this bench to discover the cross-over point on your machine
between scalar / SIMD CPU, CPU-FFT (Accelerate), and GPU.
Single path (f32, H = 0.7)
Single-path latency is the only dimension where the two test machines share a
grid, so the backends are directly comparable. The i9-285K / RTX 4070 SUPER
figures are converted from the Melem/s values in CUDA_BENCHMARK.md
(fgn_cuda_compare).
Apple M4 Max (10 P + 4 E cores, 36 GB unified):
| n | CPU | Accelerate | Metal | CubeCL-wgpu |
|---|---|---|---|---|
| 1,024 | 11 µs | 11.6 µs | 148 µs | 230 µs |
| 4,096 | 49 µs | 45 µs | 175 µs | 295 µs |
| 16,384 | 202 µs | 165 µs | 220 µs | 543 µs |
| 65,536 | 838 µs | 677 µs | 456 µs | 1.66 ms |
Intel i9-285K + RTX 4070 SUPER (cudarc + cuFFT, hand-written cubecl):
| n | CPU (i9-285K) | cuFFT (cuda_native) | cubecl (gpu_cuda) |
|---|---|---|---|
| 1,024 | 7.9 µs | 80 µs | 148 µs |
| 4,096 | 34 µs | 84 µs | 212 µs |
| 16,384 | 144 µs | 110 µs | 514 µs |
| 65,536 | 847 µs | 189 µs | 1.62 ms |
Both SIMD CPUs dominate short paths (the i9-285K is the fastest of all up to
n = 4 k); cuFFT overtakes at n ≈ 16 k and reaches 4.4× over the CPU at
n = 65 k (189 µs vs ~840 µs). On the Apple side the vDSP (Accelerate) FFT
edges the ndrustfft CPU for mid-size single paths (n = 4 k–16 k), and Metal
passes both only for the largest. The hand-written cubecl FFT
(gpu_cuda / CubeCL-wgpu) is never competitive.
Batch throughput (sample_par, f32, H = 0.7) — Melem/s, higher is better
Apple M4 Max — vDSP/Accelerate and the rayon CPU run neck-and-neck (both far ahead of the GPU backends):
| n × m | CPU | Accelerate | Metal | CubeCL-wgpu |
|---|---|---|---|---|
| 4,096 × 32 | 532 | 548 | 332 | 58 |
| 4,096 × 512 | 744 | 790 | 565 | 76 |
| 16,384 × 128 | 670 | 749 | 488 | 62 |
| 16,384 × 512 | 676 | 787 | 433 | 62 |
Intel i9-285K + RTX 4070 SUPER — small n stays cache-resident on the
24-core i9; batched cuFFT wins once n ≥ 4 k:
| n × m | CPU (i9-285K) | cuFFT | cubecl |
|---|---|---|---|
| 1,024 × 16,384 | 1751 | 557 | 49 |
| 4,096 × 16,384 | 269 | 581 | 48 |
| 4,096 × 65,536 | 118 | 383 | 42 |
| 16,384 × 16,384 | 59 | 183 | 32 |
| 65,536 × 4,096 | 504 | 236 | 28 |
The two machines' batch grids do not overlap (the Mac bench tops out at
m = 512; the CUDA batch grid starts at m = 1024), so they are shown
separately. Takeaway: for FGN the discrete-GPU win is real but narrow — it
needs both a long path (n ≥ 16 k single, or n ≥ 4 k × m ≥ 16 k batch) and
an NVIDIA-class FFT. On Apple Silicon the CPU-side FFTs win: ndrustfft + rayon
and vDSP/Accelerate are interchangeable (both ~700–800 Melem/s in batch), while
Metal only helps for the largest single paths.
Normal sampling vs rand_distr (single-thread, fill_slice)
Measured with cargo bench --bench dist_multicore. Single-thread fill_slice,
median of 7 runs. Compares three implementations of the same workload
(write n N(0, 1) samples into a pre-allocated Vec<f64>):
SimdNormal::fill_slice— our SIMD Ziggurat with thewide-based RNG.rand_distr + SimdRng—rand_distr::Normal::sampleconsuming ourSimdRng. Isolates the Normal algorithm — both paths share the same uniform stream.rand_distr + rand::rng()— out-of-box upstream:rand_distr::Normalonrand::rng()(ThreadRng-backedChaCha).
| n | SimdNormal (µs) | rand_distr + SimdRng (µs) | speedup | rand_distr + rand::rng() (µs) | speedup |
|---|---|---|---|---|---|
| 4 | 0.008 | 0.013 | 1.73× | 0.032 | 4.22× |
| 8 | 0.014 | 0.026 | 1.78× | 0.065 | 4.52× |
| 16 | 0.029 | 0.051 | 1.79× | 0.128 | 4.47× |
| 64 | 0.109 | 0.208 | 1.90× | 0.508 | 4.64× |
| 256 | 0.432 | 0.840 | 1.94× | 2.029 | 4.70× |
| 4 096 | 6.975 | 13.176 | 1.89× | 32.382 | 4.64× |
| 65 536 | 113.458 | 212.406 | 1.87× | 520.219 | 4.59× |
The rand_distr + SimdRng column shows the pure algorithmic win at the
distribution level (our SIMD Ziggurat fast-path beats the scalar reference
Ziggurat ~1.7–1.9×). The rand_distr + rand::rng() column shows the
full pipeline win including the SIMD RNG itself (~4.2–4.7×).
v2.4 distribution sampler upgrades
Algorithm-level fixes measured on Apple Silicon (single thread,
--release, ns per sample, 2026-06-11). The first three replace
per-draw construction or O(n) loops with constant-cost algorithms, so
the speedups grow with the parameters:
| Sampler | Before | After | Speedup |
|---|---|---|---|
Noncentral χ² in a per-step loop (SimdNonCentralChiSquared) | 359 ns | 8.2 ns | ~44× |
SimdBinomial n = 1000 (BTRS rejection, was O(n) per-trial) | 542 ns | 18.0 ns | ~30× |
SimdHypergeometric N = 500, n = 100 (CDF-table inversion) | O(n) | 2.6 ns | ~40× |
SimdGev (internal SIMD RNG, was thread-RNG per draw) | ~20 ns | 10.0 ns | 2× |
SimdTruncatedExp (cached CDF bounds + internal RNG) | ~15 ns | 5.0 ns | ~3× |
SimdGed (SIMD bulk fill + buffered sampling) | 20.2 ns | 16.3 ns | 1.2× |
SimdWeibull sample_fast (buffered, single-pass fill) | 10.7 ns | 8.4 ns | 1.3× |
SimdStudentT fill (64-element sub-fills above the SIMD threshold) | 8.9 ns | 7.0 ns | 1.3× |
SimdBeta fill | 12.9 ns | 10.6 ns | 1.2× |
Context anchors from the same session: SimdNormal fill ≈ 1.7–2.0 ns,
SimdExpZig fill ≈ 1.5 ns, uniform fill ≈ 0.35 ns per sample;
SimdGamma (scalar Marsaglia-Tsang squeeze over the buffered SIMD
sources) ≈ 5–6.3 ns vs 14–17.7 ns for rand_distr::Gamma on a thread
RNG. The binomial BTRS path also beats rand_distr's BTPE (29.5 ns at
n = 1000). Two correctness fixes shipped alongside: SimdPoisson no
longer hangs for λ ≳ 745 (log-space CDF build), and SimdGev now
honours its Deterministic seed.
An 8-lane SIMD-batched Marsaglia-Tsang gamma kernel was prototyped and
measured slower than the scalar squeeze loop on 128-bit NEON
(9.3 vs 6.3 ns — lane bookkeeping dominates), so the scalar loop stays;
the experiment is recorded in the gamma.rs module docs.
v2.1 SIMD RNG / Ziggurat speedups
Criterion deltas vs the wide 1.3.0 baseline (single-sample dist.sample(rng)
loop, cargo bench --bench distributions -- --baseline before):
| Distribution | f32 / large | f64 / large | f64 / small |
|---|---|---|---|
Uniform/simd | −57% (≈ 2.3×) | −77% (≈ 4.4×) | −58% (≈ 2.4×) |
Normal/simd | −51% (≈ 2.0×) | −75% (≈ 4.0×) | −63% (≈ 2.7×) |
Exp/simd N=64 | −3% (n.s.) | −73% (≈ 3.7×) | — |
LogNormal/simd | −71% (≈ 3.4×) | −70% (≈ 3.4×) | −66% (≈ 2.9×) |
Drivers (sub-crate paths):
stochastic-rs-core::simd_rnguniform core — 8 scalar shift+cast+mul loops replaced by SIMD shift + magic-number bit-cast + subtract (52-bit / 23-bit precision; sub-ULP impact on Monte Carlo). Thefill_uniform_f64/fill_uniform_f32direct-write APIs storevmovupddirectly into the caller's slice with no[f64; 8]/[f32; 8]return-by-value round-trip (the interimnext_f64_array/next_f32_arrayaccessors were removed in 2.4).next_f64/next_f32refill their internal buffers via the same inlined SIMD store.stochastic-rs-distributions::normal/exp—fill_zigguratmain loop runs 8 lanes per iteration socopy_from_slice(&scaled_arr)inlines tostpstores instead of compiling to amemcpycall; the final 0–7-element tail is fed through scalarsample_one.stochastic-rs-distributions::exp— fused1/λscaling into the SIMD store insidefill_exp_scaled, removing the second-passscale_in_place.stochastic-rs-distributions::uniform—fill_slice_fastshort-circuits torng.fill_uniform_f64(out)for the[0, 1)case (the dominant pattern) and falls back to a single SIMD affine pass otherwise.stochastic-rs-distributions::lib—<f64 as SimdFloatExt>::simd_from_i32x8splitsi32x8into twoi32x4halves and callsf64x4::from_i32x4per half, routing throughvcvtdq2pdon AVX /sshll + scvtfon NEON instead ofwide's 8-scalar fallback.simd_rng::fill_bytesbatches 32-byte writes per engine call and the Xoshiro256++ / Xoshiro128++ seed paths useu64::from_le_bytesfor big-endian portability.
Opt-in: dual-stream RNG (dual-stream-rng feature)
[dependencies]
stochastic-rs = { version = "2.2", features = ["dual-stream-rng"] }Unlocks stochastic_rs_core::simd_rng_dual::SimdRngDual (two parallel
xoshiro engines) and the stochastic_rs::distributions::SimdNormalDual /
SimdExpZigDual aliases. The Ziggurat main loop consumes two
independent engine batches per iteration (SimdRngExt::next_i32x8_pair,
gated on the HAS_PAIR_ILP const — single-stream codegen is
unchanged). Measured against the single-stream SimdNormal::fill_slice
on Apple Silicon, 2026-06-11
(cargo bench --bench dual_stream_compare --features dual-stream-rng):
| n | single (SimdNormal) | dual (SimdNormalDual) | Δ |
|---|---|---|---|
| 64 | 120.6 ns | 117.5 ns | −2.6% |
| 256 | 494.2 ns | 467.9 ns | −5.3% |
| 4 096 | 7.87 µs | 7.42 µs | −5.7% |
| 65 536 | 126.2 µs | 122.4 µs | −3.0% |
| 1 048 576 | 2.02 ms | 1.97 ms | −2.2% |
The win is that a modern out-of-order core hides the 16 scalar
kn / wn table-lookup latencies behind the second engine's xoshiro
state update; it is modest on 128-bit NEON because the gather chain
dominates the total cost. Exponential fills measure at parity. Uniform
f64 fills are not engine-bound (direct SIMD stores, no table
lookups), so they show no speedup beyond noise (0 to −7% on L1-resident
sizes). Trade-off: SimdRngDual::from_seed does not reproduce
SimdRng::from_seed's bit-exact sequence (statistical properties are
identical and KS-validated, including the engine-B lanes of the pair
path).
Establishing a baseline
# Save the current build as the "v2" baseline
cargo bench --bench distributions -- --save-baseline v2
cargo bench --bench fgn_fbm -- --save-baseline v2
cargo bench --bench process_generation -- --save-baseline v2
cargo bench --bench option -- --save-baseline v2
# … and so on for the rest of the suitesOpenBLAS-gated benches go to a separate baseline name to keep the linear-algebra-heavy suite out of the default comparison:
cargo bench --bench factors --features openblas -- --save-baseline v2-openblasComparing against a baseline
Before merging any PR with non-trivial perf impact:
cargo bench --bench <bench> -- --baseline v2criterion prints Performance has improved / regressed per
benchmark with effect size and 95% confidence intervals. Block the
merge on any regression of more than 5% (the project's de-facto
threshold).
Reproducing the numbers
To reproduce the FGN-vs-CUDA table you need:
- NVIDIA GPU with CUDA 12.x toolkit
- Rust nightly (for inline-asm in some
cudarcpaths) RUSTFLAGS="-C target-cpu=native"for the CPU side (otherwise the CPU column above will look slower becausewide'sf32x8falls back to scalar — see Native CPU optimization)
RUSTFLAGS="-C target-cpu=native" cargo bench \
--features cuda-native \
--bench fgn_cuda_native -- --save-baseline localAdding a benchmark
See the
bench-writing
SKILL — group naming, parameter sweep, [[bench]] required-features
gating, and the "no-println / no-dead-helper" rules.
Tutorials
End-to-end walkthroughs — Heston calibration, fBm Hurst estimation, vol-surface from quotes, AI surrogate training, pairs, risk, execution, interop.
Contributing
How to contribute to stochastic-rs — coding conventions, the SKILL system that automates per-feature recipes, and the per-PR docs/tests/bench rule.