pg_gpu Documentation
GPU-accelerated population genetics statistics for Python.
Contents:
- Introduction
- Installation
- Features
- Diversity Statistics
- Divergence Statistics
- Linkage Disequilibrium
- Selection Scans
- Site Frequency Spectrum
- Admixture and F-Statistics
- Resampling (Block Jackknife and Bootstrap)
- FrequencySpectrum (Power-User SFS Interface)
- Dimensionality Reduction and Distance
- Relatedness and Kinship
- Distance Distribution Statistics
- Biobank-Scale Streaming
- Fused Windowed Statistics
- Quick Start Guide
- API Reference
- HaplotypeMatrix
- GenotypeMatrix
- Biobank-Scale Streaming
- LD Statistics
- Diversity Statistics
- Divergence Statistics
- Selection Scan Statistics
- Site Frequency Spectrum
- Resampling (Block Jackknife and Bootstrap)
- Admixture and F-Statistics
- Dimensionality Reduction and Distance
- Relatedness and Kinship
- Windowed Statistics (GPU-Native)
- Moments Integration (LD Inference)
- Distance Distribution Statistics
- Visualization
- FrequencySpectrum (SFS-Based Estimation)
- Missing Data Handling
- Missing Data Modes
- Basic Usage
- How It Works
- Supported Statistics
- Multiallelic Sites
- LD Estimator Choice
- Haplotype Identity and Missing Data
- HaplotypeMatrix and GenotypeMatrix Utilities
- Accessible Site Masks
- Site Count Properties
- Span Normalization
- SFS Projection
- Component-Level Access
- Best Practices
- Example Workflow
- Examples
- Tutorials
- Bootstrap CI on Tajima’s D under a Sweep
- Admixture Detection End-to-End
- Accessibility Masks and Windowed Diversity
- Quality-Aware Filtering: GQ, DP, MQ from VCF/VCZ
- Local PCA / lostruct
- LD Block Partitioning
- Side-by-Side: scikit-allel vs pg_gpu
- Demographic Inference with moments.LD
- Biobank-Scale Streaming from VCZ
- pg_gpu Skills Demo
- Workflows
- Changelog
Overview
pg_gpu provides GPU-accelerated computation of population genetics statistics using CuPy. It covers linkage disequilibrium, diversity, divergence, selection scans, site frequency spectra, admixture statistics, and dimensionality reduction (PCA, PCoA, local PCA / lostruct).
Key Features
Fast GPU computation using CuPy with fused CUDA kernels for compute-intensive operations
Comprehensive statistics: LD (D, D-squared, Dz, pi2, r/r-squared), diversity (pi, theta, Tajima’s D, heterozygosity, Fay & Wu’s H), divergence (FST Hudson/Weir-Cockerham/Nei, Dxy, Da, Snn, Gmin, dd, dd_rank, Zx), selection scans (iHS, XP-EHH, nSL, XP-nSL, Garud’s H, EHH decay), SFS (unfolded, folded, joint, scaled), admixture (Patterson’s F2, F3, D)
Fused windowed analysis: compute all statistics across all genomic windows in a single GPU pass – up to 60x faster than scikit-allel
Automatic missing data handling across all modules
Quality-aware filtering – load VCF FORMAT / INFO arrays (
GQ,DP,MQ, …) withfields=, mask variants and genotypes from them, and round-trip the survivors into a clean VCZ. See Quality-Aware Filtering: GQ, DP, MQ from VCF/VCZ.Multi-population analyses with flexible population specification
8 theta estimators and 4 neutrality tests (pi, theta_w, theta_h, theta_l, eta1, eta1_star, minus_eta1, minus_eta1_star, Tajima’s D, Fay-Wu’s H, Zeng’s E, DH)
Validated against scikit-allel – 29 statistics verified at machine precision using real Ag1000G data
Biobank-scale streaming – VCZ stores too large to fit on the GPU open as a streaming view that walks the chromosome chunk by chunk; every per-window / SFS / moments-LD / pairwise relatedness kernel dispatches transparently. See Biobank-Scale Streaming from VCZ.
Installation
pixi install
pixi shell
Quick Example
from pg_gpu import HaplotypeMatrix, diversity, selection
# Load data
h = HaplotypeMatrix.from_vcf("data.vcf")
# Diversity
pi_val = diversity.pi(h)
tajd = diversity.tajimas_d(h)
# Selection scans
ihs_scores = selection.ihs(h)
# LD r-squared
r2 = h.pairwise_r2()