Side-by-Side: scikit-allel vs pg_gpu
Packaged script: examples/scikit_allel_comparison.py
Run it from the repo root:
pixi run python examples/scikit_allel_comparison.py
pixi run python examples/scikit_allel_comparison.py --small
pixi run python examples/scikit_allel_comparison.py --no-plot
Background
A common workflow on phased haplotype data is a windowed scan
producing several diversity statistics along a chromosome, plus an
LD-decay curve summarising linkage disequilibrium as a function of
physical distance. In scikit-allel each diversity stat needs a
separate call, and the LD-decay scan requires manually pairwise-r²,
distance binning, and median-per-bin aggregation. pg_gpu
collapses the diversity stats into a single windowed_analysis(...)
call and exposes the LD-decay path as HaplotypeMatrix.windowed_r_squared(estimator='rogers_huff'),
both running in single GPU kernel passes.
This tutorial puts the two implementations next to each other on real Anopheles gambiae X-chromosome data (n = 100 phased haplotypes, ~1M segregating sites), checks that they agree numerically, and reports the wall-clock speedup.
What the script does
Loads
examples/data/gamb.X.phased.n100.zarrviaHaplotypeMatrix.from_zarrand builds the views needed downstream: a(n_variants, 2)allele-count array forscikit-allelplus theHaplotypeMatrixitself forpg_gpu.Pins the windowing by computing the window edges once with
allel.position_windows; the resulting window array is passed explicitly to everyscikit-allelcall, and the per-windowstart/endcolumns ofwindowed_analysis’s result are asserted to match.Times both implementations with
time.perf_counter, after a warmup run that absorbs CuPy/JIT initialization. Three stats from pg_gpu:windowed_analysis(statistics=['pi', 'theta_w', 'tajimas_d']). Three stats from scikit-allel:allel.windowed_diversity,allel.windowed_watterson_theta,allel.windowed_tajima_d.Verifies strict numerical agreement on the diversity statistics (NaN-aware,
rtol=1e-5/atol=1e-8). When the chromosome end isn’t an exact multiple ofwindow_size, the trailing partial window is silently NaN-masked from the comparison (the two libraries normalize partial windows slightly differently – a single-window cosmetic effect, not a real numerical disagreement).Selects a contiguous block of
--ld-snpsSNPs (default 500,000) centered on the chromosome midpoint, drops variants below--ld-mac-min(default 20, i.e. MAF 0.05 in n=100 diploids), and computes three LD-decay curves on the same pairs:scikit-allel Rogers-Huff \(r^2\) (
allel.rogers_huff_r, squared, distance-binned, per-bin median)pg_gpu Rogers-Huff \(r^2\) (one call to
hm_sub.windowed_r_squared(bp_bins, percentile=50, estimator='rogers_huff'))pg_gpu unbiased \(\sigma_D^2\) (Ragsdale & Gravel 2019; one call to
sigma_d2_geno(hm_sub)on the diploid genotype view, distance-binned, per-bin median)
The first two are the same statistic computed by two different libraries – they agree to floating-point precision (the float32 precision floor scikit-allel sets internally) and serve as the parity check. The third is a different statistic on the same SNPs: rather than per-pair \(r^2 = D^2/\pi^2\) where \(\pi^2\) is computed from the same pair, it’s the unbiased moments-LD polynomial estimator that subtracts a finite-sample bias term derived from the multinomial projection. On panmictic data with many rare variants it tracks below \(r^2\) at short distances and converges to it at long distances.
A contiguous block (vs. a random subsample) gives dense coverage of short pair distances, which is where the LD-decay signal lives – a random subsample leaves only a handful of pairs in each short-distance bin and the curve washes out into noise. The MAC filter is equally important: pairs of near-singleton variants take a tiny set of tied \(r^2\) values (e.g., two non-overlapping singletons give exactly \(1/(n-1)^2\)) regardless of distance, and those tied values pin the per-bin median to a constant – erasing any decay signal.
Plots a 5-panel figure: pi / theta_W / Tajima’s D traces overlaid between the two implementations (agreement = visual identity), the three LD-decay curves overlaid on a log distance axis, and a bottom panel of horizontal timing bars annotated with speedup ratios for both the windowed scan and the LD-decay scan. The \(\sigma_D^2\) runtime is also reported but isn’t on the speedup bar (it has no scikit-allel counterpart).
Why it’s useful as a template
The recipe – load once, build the right shapes for each library, pass identical windows in, compare – is the right shape for any multi-statistic windowed analysis. To adapt it:
Swap the statistics list.
windowed_analysisaccepts any combination of single-pop summaries; thescikit-allelside needs one named call per statistic.Run on your own data. The script accepts any zarr store loadable by
HaplotypeMatrix.from_zarr– pointZARR_FULLat it.Time at different window sizes. The default 10 kb is a typical diversity-scan resolution;
--window-sizeaccepts any positive integer.Tune the LD scan for your scale of interest.
--ld-snpscontrols the size of the contiguous block (centered on the chromosome midpoint); larger values widen the block’s bp footprint and give a less-noisy decay curve at the cost of quadratic compute. The bin edges (LD_BP_BINSin the script) span 100 bp to 10 kb in 24 log-spaced steps – appropriate for Anopheles-like populations where LD decays within a few kilobases. For organisms with longer LD scales (humans, livestock) widen the range and use a larger--ld-snps; for tighter LD scales (recombining viruses) shrink it.