Side-by-Side: scikit-allel vs pg_gpu

Packaged script: examples/scikit_allel_comparison.py

Run it from the repo root:

pixi run python examples/scikit_allel_comparison.py
pixi run python examples/scikit_allel_comparison.py --small
pixi run python examples/scikit_allel_comparison.py --no-plot

Background

A common workflow on phased haplotype data is a windowed scan producing several diversity statistics along a chromosome, plus an LD-decay curve summarising linkage disequilibrium as a function of physical distance. In scikit-allel each diversity stat needs a separate call, and the LD-decay scan requires manually pairwise-r², distance binning, and median-per-bin aggregation. pg_gpu collapses the diversity stats into a single windowed_analysis(...) call and exposes the LD-decay path as HaplotypeMatrix.windowed_r_squared(estimator='rogers_huff'), both running in single GPU kernel passes.

This tutorial puts the two implementations next to each other on real Anopheles gambiae X-chromosome data (n = 100 phased haplotypes, ~1M segregating sites), checks that they agree numerically, and reports the wall-clock speedup.

What the script does

  1. Loads examples/data/gamb.X.phased.n100.zarr via HaplotypeMatrix.from_zarr and builds the views needed downstream: a (n_variants, 2) allele-count array for scikit-allel plus the HaplotypeMatrix itself for pg_gpu.

  2. Pins the windowing by computing the window edges once with allel.position_windows; the resulting window array is passed explicitly to every scikit-allel call, and the per-window start/end columns of windowed_analysis’s result are asserted to match.

  3. Times both implementations with time.perf_counter, after a warmup run that absorbs CuPy/JIT initialization. Three stats from pg_gpu: windowed_analysis(statistics=['pi', 'theta_w', 'tajimas_d']). Three stats from scikit-allel: allel.windowed_diversity, allel.windowed_watterson_theta, allel.windowed_tajima_d.

  4. Verifies strict numerical agreement on the diversity statistics (NaN-aware, rtol=1e-5 / atol=1e-8). When the chromosome end isn’t an exact multiple of window_size, the trailing partial window is silently NaN-masked from the comparison (the two libraries normalize partial windows slightly differently – a single-window cosmetic effect, not a real numerical disagreement).

  5. Selects a contiguous block of --ld-snps SNPs (default 500,000) centered on the chromosome midpoint, drops variants below --ld-mac-min (default 20, i.e. MAF 0.05 in n=100 diploids), and computes three LD-decay curves on the same pairs:

    • scikit-allel Rogers-Huff \(r^2\) (allel.rogers_huff_r, squared, distance-binned, per-bin median)

    • pg_gpu Rogers-Huff \(r^2\) (one call to hm_sub.windowed_r_squared(bp_bins, percentile=50, estimator='rogers_huff'))

    • pg_gpu unbiased \(\sigma_D^2\) (Ragsdale & Gravel 2019; one call to sigma_d2_geno(hm_sub) on the diploid genotype view, distance-binned, per-bin median)

    The first two are the same statistic computed by two different libraries – they agree to floating-point precision (the float32 precision floor scikit-allel sets internally) and serve as the parity check. The third is a different statistic on the same SNPs: rather than per-pair \(r^2 = D^2/\pi^2\) where \(\pi^2\) is computed from the same pair, it’s the unbiased moments-LD polynomial estimator that subtracts a finite-sample bias term derived from the multinomial projection. On panmictic data with many rare variants it tracks below \(r^2\) at short distances and converges to it at long distances.

    A contiguous block (vs. a random subsample) gives dense coverage of short pair distances, which is where the LD-decay signal lives – a random subsample leaves only a handful of pairs in each short-distance bin and the curve washes out into noise. The MAC filter is equally important: pairs of near-singleton variants take a tiny set of tied \(r^2\) values (e.g., two non-overlapping singletons give exactly \(1/(n-1)^2\)) regardless of distance, and those tied values pin the per-bin median to a constant – erasing any decay signal.

  6. Plots a 5-panel figure: pi / theta_W / Tajima’s D traces overlaid between the two implementations (agreement = visual identity), the three LD-decay curves overlaid on a log distance axis, and a bottom panel of horizontal timing bars annotated with speedup ratios for both the windowed scan and the LD-decay scan. The \(\sigma_D^2\) runtime is also reported but isn’t on the speedup bar (it has no scikit-allel counterpart).

Why it’s useful as a template

The recipe – load once, build the right shapes for each library, pass identical windows in, compare – is the right shape for any multi-statistic windowed analysis. To adapt it:

  • Swap the statistics list. windowed_analysis accepts any combination of single-pop summaries; the scikit-allel side needs one named call per statistic.

  • Run on your own data. The script accepts any zarr store loadable by HaplotypeMatrix.from_zarr – point ZARR_FULL at it.

  • Time at different window sizes. The default 10 kb is a typical diversity-scan resolution; --window-size accepts any positive integer.

  • Tune the LD scan for your scale of interest. --ld-snps controls the size of the contiguous block (centered on the chromosome midpoint); larger values widen the block’s bp footprint and give a less-noisy decay curve at the cost of quadratic compute. The bin edges (LD_BP_BINS in the script) span 100 bp to 10 kb in 24 log-spaced steps – appropriate for Anopheles-like populations where LD decays within a few kilobases. For organisms with longer LD scales (humans, livestock) widen the range and use a larger --ld-snps; for tighter LD scales (recombining viruses) shrink it.