Side-by-Side: scikit-allel vs pg_gpu ==================================== Packaged script: ``examples/scikit_allel_comparison.py`` Run it from the repo root: .. code-block:: bash pixi run python examples/scikit_allel_comparison.py pixi run python examples/scikit_allel_comparison.py --small pixi run python examples/scikit_allel_comparison.py --no-plot Background ---------- A common workflow on phased haplotype data is a windowed scan producing several diversity statistics along a chromosome, plus an LD-decay curve summarising linkage disequilibrium as a function of physical distance. In ``scikit-allel`` each diversity stat needs a separate call, and the LD-decay scan requires manually pairwise-r², distance binning, and median-per-bin aggregation. ``pg_gpu`` collapses the diversity stats into a single ``windowed_analysis(...)`` call and exposes the LD-decay path as ``HaplotypeMatrix.windowed_r_squared(estimator='rogers_huff')``, both running in single GPU kernel passes. This tutorial puts the two implementations next to each other on real Anopheles gambiae X-chromosome data (n = 100 phased haplotypes, ~1M segregating sites), checks that they agree numerically, and reports the wall-clock speedup. What the script does -------------------- 1. Loads ``examples/data/gamb.X.phased.n100.zarr`` via ``HaplotypeMatrix.from_zarr`` and builds the views needed downstream: a ``(n_variants, 2)`` allele-count array for ``scikit-allel`` plus the ``HaplotypeMatrix`` itself for ``pg_gpu``. 2. Pins the windowing by computing the window edges once with ``allel.position_windows``; the resulting window array is passed explicitly to every ``scikit-allel`` call, and the per-window ``start``/``end`` columns of ``windowed_analysis``'s result are asserted to match. 3. Times both implementations with ``time.perf_counter``, after a warmup run that absorbs CuPy/JIT initialization. Three stats from pg_gpu: ``windowed_analysis(statistics=['pi', 'theta_w', 'tajimas_d'])``. Three stats from scikit-allel: ``allel.windowed_diversity``, ``allel.windowed_watterson_theta``, ``allel.windowed_tajima_d``. 4. Verifies strict numerical agreement on the diversity statistics (NaN-aware, ``rtol=1e-5`` / ``atol=1e-8``). When the chromosome end isn't an exact multiple of ``window_size``, the trailing partial window is silently NaN-masked from the comparison (the two libraries normalize partial windows slightly differently -- a single-window cosmetic effect, not a real numerical disagreement). 5. Selects a contiguous block of ``--ld-snps`` SNPs (default 500,000) centered on the chromosome midpoint, drops variants below ``--ld-mac-min`` (default 20, i.e. MAF 0.05 in n=100 diploids), and computes *three* LD-decay curves on the same pairs: - scikit-allel Rogers-Huff :math:`r^2` (``allel.rogers_huff_r``, squared, distance-binned, per-bin median) - pg_gpu Rogers-Huff :math:`r^2` (one call to ``hm_sub.windowed_r_squared(bp_bins, percentile=50, estimator='rogers_huff')``) - pg_gpu unbiased :math:`\sigma_D^2` (Ragsdale & Gravel 2019; one call to ``sigma_d2_geno(hm_sub)`` on the diploid genotype view, distance-binned, per-bin median) The first two are the *same* statistic computed by two different libraries -- they agree to floating-point precision (the float32 precision floor scikit-allel sets internally) and serve as the parity check. The third is a *different* statistic on the same SNPs: rather than per-pair :math:`r^2 = D^2/\pi^2` where :math:`\pi^2` is computed from the same pair, it's the unbiased moments-LD polynomial estimator that subtracts a finite-sample bias term derived from the multinomial projection. On panmictic data with many rare variants it tracks below :math:`r^2` at short distances and converges to it at long distances. A contiguous block (vs. a random subsample) gives dense coverage of short pair distances, which is where the LD-decay signal lives -- a random subsample leaves only a handful of pairs in each short-distance bin and the curve washes out into noise. The MAC filter is equally important: pairs of near-singleton variants take a tiny set of tied :math:`r^2` values (e.g., two non-overlapping singletons give exactly :math:`1/(n-1)^2`) regardless of distance, and those tied values pin the per-bin median to a constant -- erasing any decay signal. 6. Plots a 5-panel figure: pi / theta_W / Tajima's D traces overlaid between the two implementations (agreement = visual identity), the three LD-decay curves overlaid on a log distance axis, and a bottom panel of horizontal timing bars annotated with speedup ratios for both the windowed scan and the LD-decay scan. The :math:`\sigma_D^2` runtime is also reported but isn't on the speedup bar (it has no scikit-allel counterpart). Why it's useful as a template ------------------------------ The recipe -- *load once, build the right shapes for each library, pass identical windows in, compare* -- is the right shape for any multi-statistic windowed analysis. To adapt it: * Swap the statistics list. ``windowed_analysis`` accepts any combination of single-pop summaries; the ``scikit-allel`` side needs one named call per statistic. * Run on your own data. The script accepts any zarr store loadable by ``HaplotypeMatrix.from_zarr`` -- point ``ZARR_FULL`` at it. * Time at different window sizes. The default 10 kb is a typical diversity-scan resolution; ``--window-size`` accepts any positive integer. * Tune the LD scan for your scale of interest. ``--ld-snps`` controls the size of the contiguous block (centered on the chromosome midpoint); larger values widen the block's bp footprint and give a less-noisy decay curve at the cost of quadratic compute. The bin edges (``LD_BP_BINS`` in the script) span 100 bp to 10 kb in 24 log-spaced steps -- appropriate for *Anopheles*-like populations where LD decays within a few kilobases. For organisms with longer LD scales (humans, livestock) widen the range and use a larger ``--ld-snps``; for tighter LD scales (recombining viruses) shrink it.