Examples
Short, copy-paste-ready snippets that each demonstrate one feature.
For longer walk-throughs that pair simulated data with narrative
explanation – one page per packaged script under examples/ – see
Tutorials.
Minimal Workflow
A small end-to-end example: load a VCF, assign populations, and compute one statistic from each major module.
from pg_gpu import HaplotypeMatrix, ld_statistics, diversity, selection
import numpy as np
# Load VCF data
h = HaplotypeMatrix.from_vcf("example.vcf")
# Define populations
h.sample_sets = {
"CEU": [0, 1, 2, 3, 4],
"YRI": [5, 6, 7, 8, 9]
}
# Diversity
pi_ceu = diversity.pi(h, population="CEU")
pi_yri = diversity.pi(h, population="YRI")
# Divergence
from pg_gpu import divergence
fst = divergence.fst(h, "CEU", "YRI")
# LD
counts, n_valid = h.tally_gpu_haplotypes()
r2 = ld_statistics.r_squared(counts, n_valid=n_valid)
# Selection scans
ihs_scores = selection.ihs(h, population="CEU")
Two-Population LD
Computes the Hill-Robertson moments-LD statistics
(\(D^2\), \(Dz\), \(\pi_2\), and the related two-locus
moments) binned by physical distance, using all
within-population and between-population sample pairs from pop1 and
pop2. Returns a per-bin DataFrame with one column per statistic.
# Memory-efficient chunked computation
stats = h.compute_ld_statistics_gpu_two_pops(
bp_bins=np.array([0, 1000, 5000, 10000, 50000]),
pop1="CEU",
pop2="YRI",
chunk_size='auto'
)
Batch Statistics
Fused operations may be used to compute multiple summary statistics in a single GPU pass, more efficiently than if they were computed independently. Pass an iterable of statistic names; the kernel reuses the intermediate haplotype-tally counts across all of them.
# Multiple LD statistics in one call (single GPU launch)
results = ld_statistics.compute_ld_statistics(
counts,
statistics=['dd', 'dz', 'pi2', 'r_squared'],
)
Integration with moments
The two-locus moments-LD statistics computed below are exactly the
quantities used by moments.LD for
demographic-model fitting (Ragsdale & Gravel 2019). Because they are
pairwise across loci they are expensive to compute on a CPU – which
is exactly where GPU acceleration shines: pg_gpu can replace
moments.LD.Parsing.compute_ld_statistics() as a drop-in backend
that returns the same dictionary structure.
from pg_gpu import HaplotypeMatrix
h = HaplotypeMatrix.from_vcf("data.vcf")
h.sample_sets = {"pop1": list(range(10))}
# Compute LD statistics with GPU acceleration
# chunk_size='auto' adapts to available GPU memory
ld_stats = h.compute_ld_statistics_gpu_single_pop(
bp_bins=[0, 1000, 5000],
chunk_size='auto'
)
For a complete end-to-end workflow using pg_gpu as a drop-in replacement
for moments.LD.Parsing.compute_ld_statistics(), see the
Demographic Inference with moments.LD tutorial.
LD Pruning
LD pruning thins a SNP set down to (approximately) unlinked variants –
useful before PCA, GRM construction, or any analysis whose theory
assumes independent sites. locate_unlinked slides a window of
size variants across the matrix and, within each window, drops
variants whose pairwise \(r^2\) exceeds threshold; step
controls the window stride. The companion windowed_r_squared
returns the empirical LD-decay curve as a function of physical distance,
which is useful for choosing a sensible threshold.
# Find variants in approximate linkage equilibrium
unlinked = h.locate_unlinked(size=100, step=20, threshold=0.1)
h_pruned = h.get_subset(np.where(unlinked)[0])
print(f"Kept {np.sum(unlinked)} of {h.num_variants} variants")
# Windowed r-squared decay (median r^2 in each bp bin)
bins = np.arange(0, 100001, 1000)
r2_decay, counts = h.windowed_r_squared(bins, percentile=50)
See examples/ld_blocks.py for an end-to-end demo that uses pairwise
\(r^2\) to detect LD-block boundaries.
Selection Scan Pipeline
from pg_gpu import selection
# iHS with standardization by allele count
ihs_raw = selection.ihs(h)
# Get allele counts for binned standardization
dac = np.sum(h.haplotypes.get(), axis=0)
ihs_std, bins = selection.standardize_by_allele_count(ihs_raw, dac)
# Cross-population scans
xpehh_scores = selection.xpehh(h, "CEU", "YRI")
xpehh_std = selection.standardize(xpehh_scores)
# Garud's H in sliding windows
h1, h12, h123, h2_h1 = selection.moving_garud_h(h, size=200, step=50)
SFS and Admixture
from pg_gpu import sfs, admixture
# Joint SFS
jsfs = sfs.joint_sfs(h, "CEU", "YRI")
# Patterson's D with block-jackknife
d, se, z, vb, vj = admixture.average_patterson_d(
h, "popA", "popB", "popC", "popD", blen=100
)
print(f"D = {d:.4f}, standard error = {se:.4f}, Z = {z:.2f}")
PBS (Population Branch Statistic)
from pg_gpu import divergence
# Detect selection in pop1 relative to pop2 and pop3
pbs_vals = divergence.pbs(h, "pop1", "pop2", "pop3", window_size=50)
# Un-normalized PBS
pbs_raw = divergence.pbs(h, "pop1", "pop2", "pop3",
window_size=50, normed=False)
PCA and Dimensionality Reduction
from pg_gpu import decomposition
# GPU-accelerated PCA (up to 56x faster than allel)
coords, var_ratio = decomposition.pca(h, n_components=10,
scaler='patterson')
# Randomized PCA for very large datasets
coords, var_ratio = decomposition.randomized_pca(
h, n_components=10, random_state=42)
# Pairwise genetic distance (batched for memory safety)
dist = decomposition.pairwise_distance(h, metric='euclidean')
# PCoA from distance matrix
coords, var_ratio = decomposition.pcoa(dist, n_components=5)
# Population-specific PCA
coords_ceu, _ = decomposition.pca(h, n_components=5, population="CEU")
GPU-Native Windowed Statistics
Compute multiple statistics across thousands of windows without Python loops:
from pg_gpu.windowed_analysis import windowed_statistics
import numpy as np
# Define windows across 1Mb region
bp_bins = np.arange(0, 1_000_001, 10_000) # 100 windows of 10kb
# Compute 4 diversity stats in one GPU pass
result = windowed_statistics(
h, bp_bins,
statistics=('pi', 'theta_w', 'tajimas_d', 'segregating_sites')
)
# Results are numpy arrays, one value per window
print(f"Mean pi: {np.nanmean(result['pi']):.6f}")
print(f"Mean Tajima's D: {np.nanmean(result['tajimas_d']):.3f}")
# Windowed FST between populations
result = windowed_statistics(
h, bp_bins,
statistics=('pi', 'fst', 'dxy'),
pop1='CEU', pop2='YRI'
)
For end-to-end command-line workflows (vcf_to_zarr.py,
genome_scan.py), see Workflows.
Quality-Aware Filtering
Load VCF / VCZ FORMAT and INFO arrays alongside the genotype matrix, build per-variant and per-genotype masks from them, and write the filtered survivors back as a clean VCZ that round-trips the same arrays. The full walk-through is in Quality-Aware Filtering: GQ, DP, MQ from VCF/VCZ; the recipe in one block:
from pg_gpu import HaplotypeMatrix
# Bare VCF tags; pg_gpu auto-resolves INFO vs FORMAT from the source.
hm = HaplotypeMatrix.from_vcf("cohort.vcf.gz",
fields=["MQ", "QD", "GQ", "DP"])
# or, equivalently, from a bio2zarr-encoded VCZ:
# hm = HaplotypeMatrix.from_zarr("cohort.vcz",
# fields=["MQ", "QD", "GQ", "DP"])
# Shape disambiguates per-variant from per-genotype.
hm.fields["MQ"] # (n_var,) — INFO
hm.fields["GQ"] # (n_var, n_samples) — FORMAT
# Plot directly: plt.hist(hm.fields["GQ"].ravel(), bins=50)
# Mask-based filter. ``variants`` drops rows; ``genotypes`` sets
# per-cell calls to -1 (both allele rows on the haplotype side).
hm_clean = hm.filter(
variants=(hm.fields["MQ"] >= 40.0) & (hm.fields["QD"] >= 2.0),
genotypes=(hm.fields["GQ"] >= 20) & (hm.fields["DP"] >= 10),
drop_all_missing=True,
)
# Persist. The surviving QC arrays round-trip into the new store.
hm_clean.to_zarr("cohort.clean.vcz", format="vcz", contig_name="chr1")
Missing Data
from pg_gpu import diversity
# Check missing data summary
summary = h.summarize_missing_data()
print(f"Missing: {summary['missing_freq_overall']:.1%}")
# Different strategies
pi_include = diversity.pi(h, missing_data='include')
pi_exclude = diversity.pi(h, missing_data='exclude')
# LD statistics handle missing data automatically
counts, n_valid = h.tally_gpu_haplotypes()
dd_vals = ld_statistics.dd(counts, n_valid=n_valid)