Changelog

v0.1.0 (Current)

First public release of pg_gpu.

Core Data Structures

HaplotypeMatrix – phased haplotype data (0/1 with -1 for missing). Loaders: from_vcf (with region= and samples= subsetting), from_zarr (auto-detects VCZ / scikit-allel / chromosome-grouped layouts), from_ts, and direct NumPy construction. to_zarr writes VCZ by default. vcf_to_zarr provides multicore VCF-to-zarr conversion. Sample names from VCFs are preserved; load_pop_file('pops.txt') assigns populations using stored sample names.
GenotypeMatrix – diploid genotypes (0/1/2). Same loaders and zarr round-trip as HaplotypeMatrix. Many public functions auto-dispatch on input type (haplotype vs genotype).

Linkage Disequilibrium

Core statistics: r, r_squared, dd (D-squared), dz, pi2 (Ragsdale & Gravel 2019), zns (Kelly), omega (Kim & Nielsen), mu_ld (RAiSD).
LD pruning: locate_unlinked; windowed \(r^2\) decay: windowed_r_squared.
Two-population LD via compute_ld_statistics_gpu_two_pops with chunked GPU execution.

Diversity Statistics

Theta estimators: pi, theta_w, theta_h, theta_l, eta1, eta1_star, minus_eta1, minus_eta1_star.
Neutrality tests: tajimas_d, fay_wus_h, normalized_fay_wus_h, zeng_e, zeng_dh.
Heterozygosity / inbreeding: heterozygosity_expected, heterozygosity_observed, inbreeding_coefficient.
Haplotype-level: haplotype_diversity, haplotype_count, daf_histogram, diplotype_frequency_spectrum.
FrequencySpectrum class for custom weight functions, SFS projection, and the Achaz (2009) variance framework.
All statistics accept missing_data='include' | 'exclude' and a unified span_normalize parameter that auto-detects the best denominator (accessible bases if mask set, else genomic span).

Divergence Statistics

FST estimators: fst_hudson (ratio of averages), fst_weir_cockerham, fst_nei; pairwise_fst for multiple populations.
Absolute / net divergence: dxy, da.
Population Branch Statistic: pbs (normalized PBSn1).
Distance-based two-population statistics (Schrider et al. 2018 and related): snn, dxy_min, gmin, dd, dd_rank, zx. Callers can pre-compute pairwise_distance_matrix once and pass it to multiple stats, or use the combined distance_based_stats.

Selection Scans

Haplotype-based: ihs (fused CUDA kernel, bitmask pair tracking, block-level EHH reductions), nsl, xpehh, xpnsl, ehh_decay.
Garud’s H: garud_h (H1, H12, H123, H2/H1) via GPU dot-product hashing of haplotypes; moving_garud_h uses cumulative prefix sums for O(1) per-window hash computation.
Standardization: standardize, standardize_by_allele_count.
Diploid variants: zns_diploid, omega_diploid, garud_h_diploid, daf_histogram_diploid.

Site Frequency Spectrum

Unfolded and folded SFS: sfs, sfs_folded, sfs_scaled, sfs_folded_scaled.
Two-population joint SFS: joint_sfs, joint_sfs_folded, joint_sfs_scaled, joint_sfs_folded_scaled.
Folding utilities: fold_sfs, fold_joint_sfs.

Admixture / F-Statistics

Per-variant: patterson_f2 (F2 branch length), patterson_f3 (admixture test), patterson_d (ABBA-BABA).
Windowed: moving_patterson_f3, moving_patterson_d.
Block-jackknife with standard error: average_patterson_f3, average_patterson_d.

Resampling

Public pg_gpu.resampling module with block_jackknife and block_bootstrap for block-resampled standard errors / CIs on any scalar genome-wide statistic (genome-wide mean Tajima’s D, ratio-of-sums estimators, etc.). Promotes the previously private _jackknife helper from admixture. The weighted jackknife follows the Busing et al. (1999) delete-\(m_j\) formulation for unequal block sizes.
examples/sweep_tajimas_d_bootstrap.py – 95% bootstrap CI on Tajima’s D under a completed sweep, showing sweep-local vs distal mean difference CIs that exclude zero.

Dimensionality Reduction and Distance

pca – GPU-accelerated SVD PCA.
randomized_pca – truncated-SVD approximation for large datasets.
pairwise_distance – GPU-accelerated with memory-safe batching.
pcoa – classical MDS from a distance matrix.
local_pca – GPU port of Li & Ralph (2019) lostruct for detecting regions where population structure differs from the chromosome-wide pattern. Per-window top-k eigendecomposition via a single batched cp.linalg.eigh over a stacked (n_windows, n_samples, n_samples) tensor.
pc_dist – Frobenius distance between per-window low-rank covariance reps via the trace identity (no cov-matrix re-materialization). L1, L2, or no normalization.
corners – extreme-cluster selection in a 2D MDS embedding via Welzl’s minimum enclosing circle.
local_pca_jackknife – delete-1 block jackknife standard error of local PCs, also GPU-batched with sign-aligned replicates.
LocalPCAResult dataclass with .windows / .eigvals / .eigvecs / .sumsq plus .to_lostruct_matrix() for compatibility with the R lostruct::eigen_windows layout.

Relatedness and Kinship

grm – Genetic Relationship Matrix (Yang et al. 2011).
ibs – pairwise Identity-By-State proportions.

Fused Windowed Analysis

The windowed_analysis() convenience function routes through fused CUDA kernels (one kernel launch for all windows) when using non-overlapping windows with missing_data='include':

Single-population: pi, theta_w, tajimas_d, segregating_sites, singletons.
Two-population: fst, fst_hudson, fst_wc, dxy, da.
Selection: garud_h1, garud_h12, garud_h123, garud_h2h1, mean_nsl.
Structure: local_pca (returns a LocalPCAResult; scalar stats requested alongside are merged onto result.windows).
Structure: local_pca_jackknife computes delete-1 block jackknife standard error and populates LocalPCAResult.jackknife_se. When both are requested together, per-window matrix preparation is shared.

Lower-level windowed entry points: windowed_statistics (scatter-add aggregation) and windowed_statistics_fused (custom bin edges, one thread block per window).

Distance Distribution Statistics

pairwise_diffs – Hamming distance distributions (haploid or diploid).
dist_var, dist_skew, dist_kurt – moments of the pairwise-distance distribution (Schrider et al. 2018).
dist_moments – all three in one call.

diploSHIC / RAiSD Statistics

mu_var, mu_sfs – SNP density and SFS edge fraction (RAiSD).
max_daf – maximum derived allele frequency.

Visualization (`pg_gpu.plotting`)

SFS: plot_sfs, plot_joint_sfs.
LD: plot_pairwise_ld, plot_ld_decay.
PCA / structure: plot_pca, plot_pairwise_distance.
Windowed statistics: plot_windowed, plot_windowed_panel.
Haplotypes: plot_haplotype_frequencies, plot_variant_locator.

Missing Data Handling

Missing values are encoded as -1 (haplotype) or -1 sentinel (genotype).
Every statistic accepts missing_data='include' (per-site valid data, default) or missing_data='exclude' (only fully genotyped sites). Simulation testing confirms include is unbiased under MCAR.
LD projection estimator available via estimator='sigma_d2' on zns / omega. The default is estimator='auto', which resolves to 'sigma_d2' on HaplotypeMatrix inputs (the recommended path; uses the unbiased Ragsdale & Gravel 2019 estimators) and falls back to 'r2' for pre-computed r² arrays or GenotypeMatrix inputs. windowed_analysis follows the same default for windowed zns and omega.

Moments Integration

pg_gpu.moments_ld.compute_ld_statistics is a GPU drop-in for moments.LD.Parsing.compute_ld_statistics. Returns the 15 two-population LD statistics and 3 heterozygosity statistics in the exact layout moments expects for demographic inference. Requires the moments pixi environment: pixi install -e moments.

Validation

Cross-validation script (tests/validate_against_allel.py) comparing 31 statistics against scikit-allel on real Ag1000G data (1M variants, 200 haplotypes). Divergence, diversity, and selection statistics match scikit-allel at machine precision; a timing table is included.
Local PCA (lostruct) outputs validated against the R lostruct package via frozen JSON references committed under tests/data/. R is not a dependency of the pixi env or CI – the comparison runs against the committed JSON. An optional requires_r test regenerates the references via rpy2 when R + lostruct are available locally.

Performance

All statistics run on CuPy with custom CUDA kernels for compute-bound paths.

Scalar statistics at 1M variants, 200 haplotypes:

Statistic	allel (s)	pg_gpu (s)	Speedup
Weir-Cockerham FST	9.85	0.02	468x
Patterson F2	0.15	0.009	18x
nSL (255k variants)	8.1	0.56	15x
Patterson F3	0.14	0.016	9x
EHH decay (255k)	0.06	0.008	8x
Hudson FST	0.12	0.017	7x
iHS (255k variants)	9.9	1.5	7x
Dxy	0.07	0.016	4x

Windowed statistics at 5.3M variants, 100kb windows, 200 haplotypes:

Statistic	allel (s)	pg_gpu (s)	Speedup
pi + theta_w + tajimas_d	0.81	0.013	60x
All 5 single-pop stats	0.81	0.013	60x
FST (Hudson)	0.59	0.18	3x
All 12 stats together	n/a	0.66	single call

Examples

End-to-end demo scripts in examples/:

pg_gpu_tour.ipynb – interactive tour using Anopheles gambiae X chromosome data.
admixture_detection.py – block-jackknife ABBA-BABA on simulated null and admixed msprime scenarios.
accessibility_mask.py – windowed \(\pi\) with and without an accessibility mask over a low-\(\mu\) “exon” region.
ld_blocks.py – LD-block partitioning via \(r^2\) bridging scores.
local_pca.py – lostruct pipeline on a simulated partial selective sweep (SweepGenicSelection with end frequency 0.5).

Infrastructure

pixi-based environment management; moments integration lives in a separate pixi feature.
Shared _utils.py module for population extraction.
Public API returns NumPy arrays (not CuPy) – no need to call .get() on results.