Features
pg_gpu provides GPU-accelerated computation of population genetics statistics using CuPy. All statistics return NumPy arrays and handle missing data automatically. Below is a comprehensive catalog of every implemented statistic.
Diversity Statistics
Function |
Description |
Reference |
|---|---|---|
|
Nucleotide diversity |
Nei & Li (1979) |
|
Watterson’s theta |
Watterson (1975) |
|
Tajima’s D neutrality test |
Tajima (1989) |
|
Fay & Wu’s H (excess high-frequency derived alleles) |
Fay & Wu (2000) |
|
Normalized H (H*) |
Zeng et al. (2006) |
|
Fay & Wu’s theta_H |
Fay & Wu (2000) |
|
Theta_L |
Zeng et al. (2006) |
|
Zeng’s E neutrality test |
Zeng et al. (2006) |
|
Zeng’s DH joint test |
Zeng et al. (2006) |
|
Count of segregating sites |
|
|
Count of singletons |
|
|
Haplotype diversity (1 - sum of squared frequencies) |
|
|
Number of distinct haplotypes |
|
|
Expected heterozygosity (gene diversity) per variant |
|
|
Observed heterozygosity per variant |
|
|
Wright’s F per variant |
Wright (1951) |
|
Allele frequency spectrum |
|
|
Maximum derived allele frequency |
|
|
Derived allele frequency histogram |
|
|
Diplotype (multi-locus genotype) frequency spectrum |
|
|
All core diversity statistics in one call |
Divergence Statistics
Function |
Description |
Reference |
|---|---|---|
|
Hudson’s FST |
Hudson et al. (1992) |
|
Weir & Cockerham’s FST (method of moments) |
Weir & Cockerham (1984) |
|
Nei’s GST |
Nei (1973) |
|
Absolute divergence (mean pairwise differences between pops) |
Nei (1987) |
|
Net divergence (Dxy minus mean within-pop pi) |
Nei & Li (1979) |
|
Population Branch Statistic |
Yi et al. (2010) |
|
Pairwise FST matrix for multiple populations |
Distance-Based Two-Population Statistics
Function |
Description |
Reference |
|---|---|---|
|
Nearest-neighbor statistic |
Hudson (2000) |
|
Minimum pairwise distance between populations |
Geneva et al. (2015) |
|
Gmin ratio (Dxy_min / Dxy_mean) |
Geneva et al. (2015) |
|
Relative minimum divergence (dd1, dd2) |
Schrider et al. (2018) |
|
Rank of minimum between-pop distance in within-pop distribution |
Schrider et al. (2018) |
|
ZnS ratio (within-pop LD / total LD) |
Schrider et al. (2018) |
Linkage Disequilibrium
Function |
Description |
Reference |
|---|---|---|
|
Pearson correlation between variant pairs |
|
|
Squared correlation (r-squared) between variant pairs |
|
|
D-squared (two-locus LD statistic) |
Ragsdale & Gravel (2019) |
|
Dz statistic (multi-population LD) |
Ragsdale & Gravel (2019) |
|
Two-locus nucleotide diversity |
Ragsdale & Gravel (2019) |
|
Kelly’s ZnS (mean pairwise LD); defaults to the unbiased
\(\sigma_D^2\) estimator on |
Kelly (1997); Ragsdale & Gravel (2019) |
|
Kim & Nielsen’s Omega (partitioned LD); same default policy
as |
Kim & Nielsen (2004); Ragsdale & Gravel (2019) |
|
Haplotype pattern exclusivity (RAiSD LD component) |
Alachiotis & Pavlidis (2018) |
Selection Scans
Function |
Description |
Reference |
|---|---|---|
|
Integrated Haplotype Score |
Voight et al. (2006) |
|
Number of Segregating Sites by Length |
Ferrer-Admetlla et al. (2014) |
|
Cross-population Extended Haplotype Homozygosity |
Sabeti et al. (2007) |
|
Cross-population nSL |
Szpiech et al. (2021) |
|
Garud’s H1, H12, H123, H2/H1 |
Garud et al. (2015) |
|
Garud’s H in moving windows |
Garud et al. (2015) |
|
Extended Haplotype Homozygosity decay |
Sabeti et al. (2002) |
Site Frequency Spectrum
Function |
Description |
Reference |
|---|---|---|
|
Unfolded SFS |
|
|
Folded SFS (minor allele counts) |
|
|
Scaled unfolded SFS |
|
|
Scaled folded SFS |
|
|
Joint SFS (two populations) |
|
|
Folded joint SFS |
|
|
Scaled joint SFS |
|
|
Scaled folded joint SFS |
|
|
Fold an unfolded SFS |
|
|
Fold a joint SFS |
Admixture and F-Statistics
Function |
Description |
Reference |
|---|---|---|
|
F2 branch length between two populations |
Patterson et al. (2012) |
|
F3 admixture test |
Patterson et al. (2012) |
|
Patterson’s D (ABBA-BABA) |
Patterson et al. (2012) |
|
Windowed F3 |
Patterson et al. (2012) |
|
Windowed D |
Patterson et al. (2012) |
|
F3 with block-jackknife standard error |
Patterson et al. (2012) |
|
D with block-jackknife standard error |
Patterson et al. (2012) |
Resampling (Block Jackknife and Bootstrap)
Function |
Description |
Reference |
|---|---|---|
|
Delete-1 block jackknife standard error; supports unequal block sizes |
Busing et al. (1999) |
|
Block bootstrap standard error and replicate distribution |
Efron & Tibshirani (1993) |
Both operate on pre-binned per-block values and a user-supplied statistic, so any scalar aggregate (genome-wide mean Tajima’s D, per-population \(\pi\), ratio-of-sums estimators like normed F3 / D) can get a calibrated standard error / CI with a single call.
FrequencySpectrum (Power-User SFS Interface)
The FrequencySpectrum class provides direct access to SFS-based estimation
for custom weight functions, SFS projection, and the general Achaz (2009)
variance framework.
Method |
Description |
Reference |
|---|---|---|
|
Any theta estimator as weighted SFS dot product |
Achaz (2009) |
|
Generalized neutrality test from any two theta estimators |
Achaz (2009) |
|
SFS projection via hypergeometric sampling |
Gutenkunst et al. (2009) |
Built-in estimators: pi, watterson, theta_h, theta_l,
eta1, eta1_star, minus_eta1, minus_eta1_star. Custom weight
functions are also supported.
Dimensionality Reduction and Distance
Function |
Description |
Reference |
|---|---|---|
|
Principal Component Analysis (GPU-accelerated SVD) |
Patterson et al. (2006) |
|
Randomized PCA (truncated SVD approximation) |
Halko et al. (2011) |
|
Pairwise genetic distance (Euclidean, cityblock, etc.) |
|
|
Principal Coordinate Analysis (classical MDS) |
|
|
Per-window PCA (lostruct); GPU-batched |
Li & Ralph (2019) |
|
Delete-1 block jackknife standard error of local PCs (batched) |
Li & Ralph (2019) |
|
Frobenius distance between per-window low-rank covariance reps |
Li & Ralph (2019) |
|
Extreme-cluster selection in a 2D MDS embedding (Welzl MEC) |
Li & Ralph (2019) |
Distance Distribution Statistics
Function |
Description |
Reference |
|---|---|---|
|
Pairwise Hamming distances (haploid or diploid) |
|
|
Variance of pairwise distance distribution |
Schrider et al. (2018) |
|
Skewness of pairwise distance distribution |
Schrider et al. (2018) |
|
Excess kurtosis of pairwise distance distribution |
Schrider et al. (2018) |
|
Variance, skewness, and kurtosis in one call |
Schrider et al. (2018) |
Biobank-Scale Streaming
A VCZ store is a Zarr encoding of a VCF stored on disk: the genotype
matrix is split into compressed chunks, each chunk a small array of
samples by variants. pg_gpu reads VCZ stores; if your data
is in VCF you can convert it with the bio2zarr tools
(vcf2zarr explode then vcf2zarr encode). The streaming
codepath needs that VCZ layout because it relies on a fast
per-chunk decode, an operation that can’t be done on a VCF. See
Biobank-Scale Streaming from VCZ for the VCF→VCZ conversion
and a worked end-to-end example.
A VCZ store too large for the GPU (tens to hundreds of thousands
of haplotypes) opens via HaplotypeMatrix.from_zarr /
GenotypeMatrix.from_zarr as a streaming view that walks the
chromosome chunk by chunk; every kernel listed below dispatches
on the streaming object the same way it would on a fully loaded
matrix – the calling code is identical to the in-memory path.
What runs on a streaming matrix:
Entry point |
How it streams |
|---|---|
|
Each chunk’s windows are computed on the GPU as the chunk arrives, then rows are concatenated; window grids are aligned to the chunk grid so no window straddles a chunk boundary. |
|
Each chunk contributes its own per-site frequency-spectrum counts; the chromosome-wide answer is the sum across chunks. |
|
Same per-chunk accumulation, but the joint SFS is projected (via hypergeometric sampling) to a small target grid as it is built, so the full |
|
Variant-pair statistics within |
|
Streamed along the variant axis; the individual axis is tiled into row blocks. |
|
Pulls one sub-region (and optionally a subset of haplotypes) of the chromosome into GPU memory for kernels that need every variant simultaneously – |
|
Streaming converter from scikit-allel layout to VCZ for stores that pre-date bio2zarr. |
Fused Windowed Statistics
The windowed_analysis() function computes statistics across all genomic
windows in a single GPU pass via fused CUDA kernels.
Statistic |
Description |
|---|---|
|
Nucleotide diversity per window |
|
Watterson’s theta per window |
|
Tajima’s D per window |
|
Segregating site count per window |
|
Singleton count per window |
|
Hudson’s FST per window |
|
Weir-Cockerham FST per window |
|
Absolute divergence per window |
|
Net divergence per window |
|
Garud’s H statistics per window |
|
Mean nSL per window |
|
Per-window local PCA (lostruct); returns a |