API Reference
HaplotypeMatrix
- class pg_gpu.HaplotypeMatrix(genotypes, positions, chrom_start=None, chrom_end=None, sample_sets=None, n_total_sites=None, samples=None, accessible_mask=None, fields=None)[source]
Bases:
objectHaplotype matrix for population genetics analysis.
Stores phased haplotype data with variant positions and population labels. Supports GPU-accelerated computation of diversity, divergence, selection, and LD statistics.
- Parameters:
genotypes (ndarray, shape (n_haplotypes, n_variants)) – Haplotype data (0/1 with -1 for missing).
positions (ndarray, shape (n_variants,)) – Variant positions.
chrom_start (int, optional) – Chromosome boundaries for span normalization.
chrom_end (int, optional) – Chromosome boundaries for span normalization.
sample_sets (dict, optional) – Maps population names to lists of haplotype indices.
n_total_sites (int, optional) – Total callable sites for span normalization.
samples (list, optional) – Diploid sample names (from VCF).
accessible_mask (AccessibleMask, optional) – Genome accessibility mask.
fields (dict)
- property haplotypes
- property positions
- property accessible_mask
- property device
Returns the current device (CPU or GPU).
- property sample_sets
Defines groups of haplotypes that belong to populations.
- Returns:
- A dictionary mapping population names to lists of haplotype indices.
If _sample_sets was not specified at construction, returns a default dictionary with a single key ‘all’ containing all haplotype indices.
- Return type:
- property has_accessible_mask
Whether an accessible site mask is attached.
- set_accessible_mask(mask_or_path, chrom=None)[source]
Attach an accessible site mask (non-destructive).
The mask filters which variants are visible through the haplotypes and positions properties. Original data is preserved and a different mask can be applied later. Returns self for chaining.
- Parameters:
mask_or_path (str, path-like, numpy.ndarray, or AccessibleMask) – If a string/path, treated as a BED file path defining accessible regions. If an ndarray, wrapped as an AccessibleMask with offset equal to chrom_start. If already an AccessibleMask, assigned directly.
chrom (str, optional) – Chromosome name to extract from the BED file. Required when mask_or_path is a BED file path.
- Returns:
For method chaining.
- Return type:
self
- remove_accessible_mask()[source]
Remove the accessible mask, restoring all original variants.
Returns self for chaining.
- property has_invariant_info
Whether invariant site information is available for span normalization.
- property n_callable_sites
Total callable sites in the analysis universe.
Alias for
n_total_sites. When an accessible mask is set, this is the BED span (mask.total_accessible). When no mask is set but the matrix was loaded withinclude_invariant=True, this is the matrix length at construction. Otherwise None.This is the denominator for per-base span normalization, and
n_callable_sites = n_segregating_sites + n_invariant_sites.
- property n_segregating_sites
Number of polymorphic sites in the matrix.
Counts sites where 0 < derived_count < n_valid (i.e. polymorphic among observed haplotypes), with at least 2 valid samples.
- property n_invariant_sites
Number of invariant sites in the callable span, or None if unknown.
Computed as
n_callable_sites - n_segregating_sites. The matrix may physically contain fewer thann_callable_sitesrows (e.g. for a variants-only VCF the matrix has only the polymorphic rows and the rest are implied invariants from the accessible-mask span); in that case the returned count includes those implied invariants.Note that
num_variants(matrix row count) is generally not equal ton_segregating_sitesand not equal ton_callable_sites– it is just the physical matrix length.
- classmethod from_vcf(path, region=None, samples=None, include_invariant=False, accessible_bed=None, fields=None)[source]
Construct a HaplotypeMatrix from a VCF file.
- Parameters:
path (str) – Path to VCF/BCF file (optionally gzipped + tabix-indexed).
region (str, optional) – Genomic region to load, e.g. ‘chr1:1000000-2000000’. Requires the VCF to be bgzipped and tabix-indexed.
samples (list of str, optional) – Subset of samples to load. If None, loads all samples.
include_invariant (bool) – If True, set n_total_sites from the loaded variant count.
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions.
fields (list of str, optional) – VCF FORMAT/INFO tags to load alongside GT (e.g.
['GQ', 'DP', 'MQ']). Each tag is auto-resolved: INFO matches land inhm.fields[tag]as a(n_var,)array, FORMAT matches as a(n_var, n_samples)array. Tags missing from the VCF are warned and dropped silently.
- Returns:
Phased haplotype data with sample names stored.
- Return type:
- classmethod from_zarr(path, region=None, accessible_bed=None, pop_assignment=None, streaming='auto', chunk_bp=1500000, prefetch=1, backend='auto', fields=None)[source]
Construct a HaplotypeMatrix from a Zarr store.
Supports both VCZ (bio2zarr) and scikit-allel zarr layouts. Layout is auto-detected. For multi-chromosome stores, region must be specified.
- Parameters:
path (str) – Path to Zarr store directory.
region (str, optional) – Genomic region ‘chrom:start-end’ to load a subset.
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions.
pop_assignment (str, numpy.ndarray, list, dict, or False, optional) –
Sample-to-population assignments. Accepted forms:
str– path to a tab-delimited file with a header row andsample popcolumns, in the format thatHaplotypeMatrix.load_pop_fileexpects.numpy.ndarrayorlistof lengthn_samples– one population label per sample, in the same order as the store’s sample axis.dictmapping sample id to population label. Any sample not in the dict is left unassigned.False– skip the auto-load.
Default (
None) looks for<path>.pops.tsvnext to the store and loads it if present, announcing the auto-load to stderr.streaming ({'auto', 'always', 'never'}, optional) – Whether to return a
StreamingHaplotypeMatrixthat iterates the store chunk-by-chunk through the GPU (suitable for large-scale stores that do not fit entirely in GPU memory).'always'forces streaming;'never'forces a single-shot load (and raisesMemoryErrorif the matrix would not fit in free GPU memory).'auto'(default) checks the projected footprint of the haplotype matrix against free GPU memory and picks streaming when the full matrix would consume more than half the device. Scikit-allel-formatted zarr always routes to the single-shot path because the only allowed format for'streaming'is VCZ.chunk_bp (int, optional) – Genomic chunk size in bp for the streaming path. Ignored when the data are loaded into GPU memory in one shot.
prefetch (int, optional) – Number of chunks to read ahead on a background thread while the current chunk is being processed – 0 disables read-ahead; 1 (default) overlaps one chunk’s disk read with the previous chunk’s GPU work. Ignored when the data are loaded into GPU memory in one shot.
backend ({'auto', 'host', 'kvikio'}, optional) – Streaming chunk-fetch backend.
'kvikio'decodes the store’s codec on the GPU viakvikio + nvCOMP; only works whencall_genotype’s codec is in the nvCOMP-supported list (zstd, blosc, lz4, deflate) and the sample-axis chunking is bio2zarr-shaped (sample chunk smaller than the full sample axis).'host'is the host-buffer fallback.'auto'(default) picks'kvikio'when both conditions hold and warns +'host'when chunks are whole-sample-axis (the kvikio path gives no speedup at that chunking).fields (list of str, optional) – VCF FORMAT/INFO tags to load alongside genotypes (e.g.
['GQ', 'DP', 'MQ']). Each tag is auto-resolved from the store:variant_<tag>lands inhm.fields[tag]as a(n_var,)array,call_<tag>as(n_var, n_samples). Tags missing from the store are warned and dropped silently. Not supported on the streaming path.
- Return type:
- static vcf_to_zarr(vcf_paths, zarr_path, worker_processes=None, icf_path=None, max_memory='4GB', show_progress=True)[source]
Convert VCF file(s) to VCZ-format Zarr store using bio2zarr.
- Parameters:
vcf_paths (str or list of str) – Path(s) to VCF/BCF files (bgzipped + indexed).
zarr_path (str) – Output Zarr store path.
worker_processes (int, optional) – Number of worker processes. Defaults to os.cpu_count().
icf_path (str, optional) – Directory for intermediate columnar format files. Defaults to
<zarr_path>.icf_tmp(same filesystem as output). ICF files can be 1-3x the VCF size.max_memory (str) – Maximum memory for encoding step. Default ‘4GB’.
show_progress (bool) – Show progress bars. Default True.
- load_pop_file(pop_assignment, pops=None)[source]
Load population assignments from a tab-delimited file or an already-resolved sample->population mapping.
Sets sample_sets from a file (or dict) mapping sample names to populations. Requires that sample names were stored during
from_vcf/from_zarr.- Parameters:
- classmethod from_ts(ts, device='CPU', include_invariant=False, accessible_bed=None, chrom=None)[source]
Create a HaplotypeMatrix from a tskit.TreeSequence.
- Parameters:
ts (TreeSequence) – A tskit.TreeSequence object
device (str) – ‘CPU’ or ‘GPU’
include_invariant (bool) – If True, set n_total_sites from the sequence length so that calculations can account for invariant sites analytically (no extra rows stored).
accessible_bed (str) – Path to a BED file defining accessible regions.
chrom (str) – Chromosome name for BED file filtering.
- Returns:
A new HaplotypeMatrix instance
- Return type:
Notes
Populations declared in the tree sequence (with a name in metadata) are automatically registered in
sample_sets: each pop name maps to the haplotype row indices of its samples. The no-demography case (sim_ancestry(samples=N)with a single unnamed population) leavessample_setsunset. Users who need custom groupings can overwritehm.sample_setsafter construction.
- get_matrix()[source]
Returns the haplotype matrix.
- Returns:
The array representing the haplotype/genotype matrix.
- Return type:
cp.ndarray
- get_positions()[source]
Returns the variant positions.
- Returns:
The array of positions.
- Return type:
cp.ndarray
- property shape
Returns the shape of the haplotype matrix.
- Returns:
- A tuple representing the dimensions (variants, samples)
of the haplotype matrix.
- Return type:
- property num_variants
Returns the number of variants in the haplotype matrix.
- property num_haplotypes
Returns the number of haplotypes in the haplotype matrix.
- get_subset(positions)[source]
Get a subset of the haplotype matrix based on the provided positions.
- Parameters:
positions – A one-dimensional array of indices to select from the haplotype matrix. This can be either a NumPy array or a CuPy array.
- Returns:
A new instance containing the subset of the haplotype matrix.
- Return type:
- get_subset_from_range(low, high)[source]
Get a subset of the haplotype matrix based on a range of positions.
- Parameters:
- Returns:
A new instance containing the subset of the haplotype matrix.
- Return type:
- apply_biallelic_filter()[source]
Apply biallelic filter to remove variants that are not strictly biallelic.
This filter matches the behavior of moments’ get_genotypes function, which uses is_biallelic_01() to remove variants that: 1. Have more than 2 alleles present in the data 2. Don’t have both reference (0) and alternate (1) alleles present
This is the actual filtering that moments does by default, not an AC filter.
- Returns:
A new HaplotypeMatrix instance with filtered variants.
- Return type:
Note
This replicates moments’ is_biallelic_01() filtering behavior.
- is_missing(axis=None)[source]
Detect missing calls (-1 values).
- Parameters:
axis (int, optional) – Axis along which to compute. If None, returns boolean array same shape as haplotypes. If 0, returns missing per variant. If 1, returns missing per sample.
- Returns:
missing – Boolean array indicating missing data
- Return type:
array
- is_called(axis=None)[source]
Detect valid (non-missing) calls.
- Parameters:
axis (int, optional) – Axis along which to compute. If None, returns boolean array same shape as haplotypes. If 0, returns called per variant. If 1, returns called per sample.
- Returns:
called – Boolean array indicating valid data
- Return type:
array
- get_span(mode='auto')[source]
Get the genomic span for normalization calculations.
- Parameters:
mode (str) – ‘auto’ - Use best available: accessible mask > n_total_sites > total genomic span > callable span ‘accessible’ - Use accessible base count from mask (error if no mask set) ‘per_base’ - Use total genomic span (chrom_end - chrom_start) ‘per_variant’ - Use number of variant sites ‘callable’ - Use span from first to last variant position ‘total’ - Alias for ‘per_base’ (backward compatibility) ‘sites’ - Alias for ‘per_variant’ (backward compatibility)
- Returns:
span – The span to use for normalization
- Return type:
- exclude_missing_sites(populations=None)[source]
Return a new matrix with only sites that have no missing data.
- Parameters:
populations (list of (str or list), optional) – If given, only require completeness within these populations. Each element is a population name (looked up in sample_sets) or a list of sample indices.
- Returns:
Filtered matrix. Returns self if no missing data.
- Return type:
- Raises:
ValueError – If no sites remain after filtering.
- filter_variants_by_missing(max_missing_freq=0.1)[source]
Return a new HaplotypeMatrix with variants filtered by missing data frequency.
- Parameters:
max_missing_freq (float) – Maximum allowed frequency of missing data per variant
- Returns:
filtered – New HaplotypeMatrix with filtered variants
- Return type:
- filter(variants=None, genotypes=None, drop_all_missing=True)[source]
Return a new HaplotypeMatrix with quality filters applied.
- Parameters:
variants (array-like of bool, shape (n_variants,), optional) – Per-variant keep mask. Variants where
Falseare dropped from every array on the returned matrix (haplotypes, positions, all per-variant and per-genotypefields).genotypes (array-like of bool, shape (n_variants, n_samples), optional) – Per-genotype keep mask. Genotypes where
Falseare set to-1(missing) on both haplotype rows of the affected sample.fieldsarrays are not zeroed out – the per-genotype QC values remain available for downstream inspection.drop_all_missing (bool, default True) – After applying both masks, drop variants whose every haplotype call is
-1. Default keeps the output free of rows where the genotype mask blanked every call.
- Returns:
Fresh matrix; the underlying haplotype + position arrays are new allocations (not views of
self). Sample names,sample_sets, andchrom_start/chrom_endare propagated.n_total_sitesis dropped because the variant axis has changed; if the caller wants span normalization on the result they should re-attach an accessibility mask explicitly. Any accessibility mask previously attached toselfis not inherited for the same reason.- Return type:
Notes
filteroperates on the underlying haplotype matrix (self._haplotypes), not the view exposed when an accessibility mask is attached. The QCfieldsarrays this method reads from were loaded against the underlying axis, so masks built from them stay aligned automatically.
- summarize_missing_data()[source]
Get summary statistics about missing data patterns.
- Returns:
summary – Dictionary with missing data statistics
- Return type:
- allele_frequency_spectrum()[source]
Calculate the allele frequency spectrum for a haplotype matrix.
Note: This method is deprecated. Use diversity.allele_frequency_spectrum() instead.
- Return type:
- diversity(span_normalize=True)[source]
Calculate the nucleotide diversity (π) for the haplotype matrix.
Note: This method is deprecated. Use diversity.pi() instead.
- watersons_theta(span_normalize=True)[source]
Calculate Waterson’s theta for the haplotype matrix.
Note: This method is deprecated. Use diversity.theta_w() instead.
- Return type:
- Tajimas_D()[source]
Calculate Tajima’s D for the haplotype matrix.
Note: This method is deprecated. Use diversity.tajimas_d() instead.
- Return type:
- pairwise_LD_v()[source]
Pairwise linkage disequilibrium (D statistic) via matrix multiply.
- Return type:
- pairwise_r2(estimator='r2')[source]
Pairwise r-squared for all variant pairs.
- Parameters:
estimator (str) –
'r2'(default) – naive haplotype r² from frequency-based formula.'rogers_huff'– Rogers-Huff r² computed on diploid 0/1/2 dosages obtained by pairing adjacent haplotypes (sample 0 = haplotypes 0,1; sample 1 = haplotypes 2,3; …). Matchesscikit-allel.rogers_huff_r()** 2.- Return type:
cupy.ndarray, float64, shape (n_variants, n_variants)
- locate_unlinked(size=100, step=20, threshold=0.1)[source]
Locate variants in approximate linkage equilibrium.
Uses a sliding window approach to identify variants whose r-squared with all other variants in the window is below the threshold.
- Parameters:
- Returns:
True for variants in approximate linkage equilibrium.
- Return type:
- windowed_r_squared(bp_bins, percentile=50, pop=None, estimator='r2')[source]
Compute percentiles of r-squared in genomic distance bins.
- Parameters:
bp_bins (array_like) – Bin edges for genomic distances in base pairs.
percentile (float or array_like) – Percentile(s) to compute within each bin.
pop (str, optional) – Population key to use.
estimator (str) –
'r2'(default) – naive haplotype r² from frequency counts.'rogers_huff'– Rogers-Huff r² on diploid 0/1/2 dosages obtained by pairing adjacent haplotypes; matchesscikit-allel.rogers_huff_r()** 2.
- Returns:
result (ndarray, shape (n_bins,) or (n_bins, n_percentiles)) – Percentile(s) of r-squared per bin.
counts (ndarray, int, shape (n_bins,)) – Number of variant pairs per bin.
- tally_gpu_haplotypes(pop=None)[source]
GPU implementation of computing pairwise haplotype tallies. Automatically detects and handles missing data if present.
- Parameters:
pop (str, optional) – Population key from sample_sets to use. If None, uses all samples.
- Returns:
- (counts, n_valid) where:
counts: Array of shape (#pairs, 4) containing [n11, n10, n01, n00] for each variant pair
- n_valid: Array of shape (#pairs,) containing the number of valid haplotypes for each pair
or None if no missing data is present
- Return type:
- tally_gpu_haplotypes_two_pops_with_missing(pop1, pop2)[source]
GPU implementation of computing pairwise haplotype tallies for two populations with missing data support.
For each variant pair, only counts haplotypes where both variants are non-missing in both populations. Missing data is encoded as -1 in the haplotype matrix.
- Parameters:
- Returns:
- (counts, n_valid1, n_valid2) where:
counts: Array of shape (#pairs, 8) containing counts for both populations [n11_1, n10_1, n01_1, n00_1, n11_2, n10_2, n01_2, n00_2]
n_valid1: Array of shape (#pairs,) with valid haplotypes for pop1
n_valid2: Array of shape (#pairs,) with valid haplotypes for pop2
- Return type:
- tally_gpu_haplotypes_two_pops(pop1, pop2)[source]
GPU version of tallying haplotype counts between all pairs of variants for two populations. Automatically detects and handles missing data if present.
- Returns:
- (counts, n_valid1, n_valid2) where:
counts: Array of shape (#pairs, 8) containing counts for both populations
n_valid1: Array of shape (#pairs,) with valid haplotypes for pop1 (or None if no missing data)
n_valid2: Array of shape (#pairs,) with valid haplotypes for pop2 (or None if no missing data)
- Return type:
- Parameters:
- compute_ld_statistics_gpu_single_pop(bp_bins, raw=False, ac_filter=True, chunk_size='auto')[source]
GPU-based LD statistics computation for a single population.
Computes DD, Dz, and pi2 statistics for variant pairs binned by distance. Only processes pairs within max(bp_bins) distance for memory efficiency.
- Parameters:
bp_bins (array-like) – Array of bin boundaries in base pairs. Pairs are binned by distance into intervals [bp_bins[i], bp_bins[i+1]).
raw (bool, optional) – If True, return raw sums of statistics across pairs in each bin. If False (default), return means.
ac_filter (bool, optional) – If True (default), apply biallelic filtering before computation.
chunk_size (int or 'auto', optional) – Number of pairs to process per chunk. If ‘auto’ (default), automatically estimates optimal size based on available GPU memory. Can specify an integer for manual control.
- Returns:
Dictionary mapping (bin_start, bin_end) tuples to tuples of statistics. Each bin contains (DD, Dz, pi2) values.
- Return type:
Examples
>>> bp_bins = [0, 10000, 50000, 100000] >>> stats = hm.compute_ld_statistics_gpu_single_pop(bp_bins) >>> stats[(0.0, 10000.0)] # (DD, Dz, pi2) for first bin
- compute_ld_statistics_gpu_two_pops(bp_bins, pop1, pop2, raw=False, ac_filter=True, chunk_size='auto')[source]
GPU-based LD statistics computation for two populations.
Computes DD, Dz, and pi2 statistics for variant pairs binned by distance. Only processes pairs within max(bp_bins) distance for memory efficiency.
- Parameters:
bp_bins (array-like) – Array of bin boundaries in base pairs. Pairs are binned by distance into intervals [bp_bins[i], bp_bins[i+1]).
pop1 (str) – Name of first population (must exist in sample_sets)
pop2 (str) – Name of second population (must exist in sample_sets)
raw (bool, optional) – If True, return raw sums of statistics across pairs in each bin. If False (default), return means.
ac_filter (bool, optional) – If True (default), apply biallelic filtering before computation.
chunk_size (int or 'auto', optional) – Number of pairs to process per chunk. If ‘auto’ (default), automatically estimates optimal size based on available GPU memory. Can specify an integer for manual control.
- Returns:
Dictionary mapping (bin_start, bin_end) tuples to OrderedDict of statistics. Each bin contains 15 statistics: - DD_0_0, DD_0_1, DD_1_1 (D squared) - Dz_0_0_0, Dz_0_0_1, Dz_0_1_1, Dz_1_0_0, Dz_1_0_1, Dz_1_1_1 - pi2_0_0_0_0, pi2_0_0_0_1, pi2_0_0_1_1, pi2_0_1_0_1, pi2_0_1_1_1, pi2_1_1_1_1
- Return type:
Examples
>>> bp_bins = [0, 10000, 50000, 100000] >>> stats = hm.compute_ld_statistics_gpu_two_pops(bp_bins, 'pop1', 'pop2') >>> stats[(0.0, 10000.0)]['DD_0_0'] # D^2 for pop1 in first bin
GenotypeMatrix
- class pg_gpu.GenotypeMatrix(genotypes, positions, chrom_start=None, chrom_end=None, sample_sets=None, n_total_sites=None, samples=None, accessible_mask=None, fields=None)[source]
Bases:
objectDiploid genotype matrix with values 0 (hom ref), 1 (het), 2 (hom alt).
Shape: (n_individuals, n_variants). Missing data encoded as -1.
- Parameters:
genotypes (ndarray, int8, shape (n_individuals, n_variants)) – Alt allele count per individual per variant.
positions (ndarray, shape (n_variants,)) – Variant positions.
chrom_start (int, optional) – Chromosome boundaries.
chrom_end (int, optional) – Chromosome boundaries.
sample_sets (dict, optional) – Maps population names to lists of individual indices.
- property genotypes
- property positions
- property accessible_mask
- property device
- property sample_sets
- property shape
- property num_variants
- property num_individuals
- property has_accessible_mask
Whether an accessible site mask is attached.
- set_accessible_mask(mask_or_path, chrom=None)[source]
Attach an accessible site mask (non-destructive).
Returns self for chaining.
- Parameters:
mask_or_path (str, path-like, numpy.ndarray, or AccessibleMask) – BED file path, boolean array, or AccessibleMask instance.
chrom (str, optional) – Chromosome name (required for BED file input).
- remove_accessible_mask()[source]
Remove the accessible mask, restoring all original variants.
Returns self for chaining.
- property has_invariant_info
Whether invariant site information is available for span normalization.
- property n_callable_sites
Total callable sites in the analysis universe.
Alias for
n_total_sites. SeeHaplotypeMatrix.n_callable_sitesfor full semantics.
- property n_segregating_sites
Number of polymorphic sites in the matrix.
Counts sites where 0 < alt_count < 2 * n_valid (polymorphic among observed diploid genotypes), with at least 2 valid samples.
- property n_invariant_sites
Number of invariant sites in the callable span, or None if unknown.
Computed as
n_callable_sites - n_segregating_sites. SeeHaplotypeMatrix.n_invariant_sitesfor full semantics.
- classmethod from_haplotype_matrix(hap_matrix)[source]
Convert a HaplotypeMatrix to a GenotypeMatrix.
Pairs consecutive haplotypes (0,1), (2,3), … as diploid individuals. Genotype = sum of paired haplotypes (0, 1, or 2).
- Parameters:
hap_matrix (HaplotypeMatrix) – Haploid data. Must have even number of haplotypes.
- Return type:
- to_haplotype_matrix()[source]
Convert back to HaplotypeMatrix (expand diploid to haploid).
Each individual becomes two consecutive haplotypes. Het sites (genotype=1) are assigned as (0,1).
- Return type:
- classmethod from_vcf(path, include_invariant=False, accessible_bed=None, fields=None)[source]
Construct from a VCF file.
- Parameters:
path (str) – Path to VCF file.
include_invariant (bool) – If True, include invariant sites and set n_total_sites.
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions.
fields (list of str, optional) – VCF FORMAT/INFO tags to load (e.g.
['GQ', 'DP', 'MQ']); seeHaplotypeMatrix.from_vcffor the shape contract. Arrays are sliced down to the biallelic-only variant set this constructor keeps, so per-variant fields end up shape(n_kept,)and per-genotype fields(n_kept, n_samples).
- Return type:
- classmethod from_zarr(path, region=None, accessible_bed=None, include_invariant=False, pop_assignment=None, streaming='auto', chunk_bp=1500000, prefetch=1, backend='auto', fields=None)[source]
Construct a GenotypeMatrix from a Zarr store.
Identical interface to
HaplotypeMatrix.from_zarr; see that method’s docstring for the meaning of every kwarg (streaming,pop_assignmentflexibility,chunk_bp,prefetch,backend,fields). The only difference is the returned type: this returnsGenotypeMatrix(orStreamingGenotypeMatrixon the streaming path) where each row is a (n_indiv, ploidy) genotype call, versusHaplotypeMatrixwhich presents the same data as a (n_haplotypes, n_variants) phased matrix.- Parameters:
include_invariant (bool) – If True, set
n_total_sitesfrom the loaded variant count. Unique to this entry point (the haplotype matrix does not need it).path – See
HaplotypeMatrix.from_zarr.region – See
HaplotypeMatrix.from_zarr.accessible_bed – See
HaplotypeMatrix.from_zarr.pop_assignment – See
HaplotypeMatrix.from_zarr.streaming (str) – See
HaplotypeMatrix.from_zarr.chunk_bp (int) – See
HaplotypeMatrix.from_zarr.prefetch (int) – See
HaplotypeMatrix.from_zarr.backend (str) – See
HaplotypeMatrix.from_zarr.fields (list) – See
HaplotypeMatrix.from_zarr.
- Return type:
- filter(variants=None, genotypes=None, drop_all_missing=True)[source]
Return a new GenotypeMatrix with quality filters applied.
- Parameters:
variants (array-like of bool, shape (n_variants,), optional) – Per-variant keep mask; variants where
Falseare dropped from every array on the returned matrix.genotypes (array-like of bool, shape (n_variants, n_samples), optional) – Per-genotype keep mask; genotypes where
Falseare set to-1(missing).fieldsentries keep their original per-genotype QC values.drop_all_missing (bool, default True) – After applying both masks, drop variants where every individual’s call is
-1.
- Returns:
Fresh matrix; underlying arrays are new allocations.
samples,sample_sets,chrom_start/chrom_endare preserved. SeeHaplotypeMatrix.filterfor the accessibility-mask interaction (this method behaves the same way).- Return type:
- load_pop_file(pop_assignment, pops=None)[source]
Load population assignments from a tab-delimited file or an already-resolved sample->population mapping.
Biobank-Scale Streaming
A VCZ store is bio2zarr’s Zarr-on-disk encoding of a VCF: the
genotype matrix is split into compressed chunks, each chunk a small
array of samples by variants. For VCZ stores that do not fit
entirely in GPU memory, HaplotypeMatrix.from_zarr /
GenotypeMatrix.from_zarr can return a streaming view
(streaming='always', or streaming='auto' with a too-large
projected footprint). The streaming object iterates the chromosome
chunk by chunk through every kernel that accepts a fully loaded
matrix; see Biobank-Scale Streaming from VCZ for the end-to-end
pattern and the VCF-to-VCZ conversion step. The reader requires
diploid genotypes; haploid and polyploid stores are rejected.
- class pg_gpu.streaming_matrix.StreamingHaplotypeMatrix(source, fetcher, chunk_bp, prefetch, *, align_bp=None)[source]
Bases:
_StreamingMatrixBaseChunked view over a
ZarrGenotypeSource, yielding per-chunkHaplotypeMatrixinstances.Returned by
HaplotypeMatrix.from_zarrwhen the requested matrix does not fit eagerly on the GPU. See_StreamingMatrixBasefor the iteration / materialize / sample_sets contract.- property align_bp
Chunk-boundary alignment in bp.
Every chunk has a width that is a multiple of this, so a window whose size also divides
align_bpis guaranteed to fit inside a single chunk. Streaming-aware kernels read this property to validate that the caller’swindow_sizedivides it.
- property chrom_end
Chunk-grid right edge (exclusive). Not the last variant position; per-chunk windows are uniform width within their chunk, so the right edge here is the chunk grid’s exclusive upper bound rather than HaplotypeMatrix’s last-inclusive convention.
- property chrom_start
Chunk-grid origin. Not the first variant position – per-chunk windows are anchored to the chunk grid, so reporting the variant-based origin would be misleading for callers building a comparable eager matrix.
- iter_gpu_chunks()
Yield
(left, right, eager_matrix)tuples covering the source.Each yielded eager matrix lives on the GPU and represents one genomic chunk. Empty chunks (regions with no variants, e.g. an acrocentric arm) are skipped – callers see only chunks with at least one variant.
- materialize(*, region=None, sample_subset=None)
Build an eager matrix over a sub-region.
Pairwise / cross-window kernels (
pairwise_r2,grm,ibs,locate_unlinked, the r^2 heatmap path) can’t be evaluated chunk-by-chunk because they need every (variant, variant) pair simultaneously. Pull the slice you want into one device-resident eager matrix and run those kernels on it.
- property sample_sets
Population -> sample-axis indices. Falls back to a single ‘all’ set when no pop file was resolved.
- class pg_gpu.streaming_matrix.StreamingGenotypeMatrix(source, fetcher, chunk_bp, prefetch, *, align_bp=None)[source]
Bases:
_StreamingMatrixBaseChunked view over a
ZarrGenotypeSource, yielding per-chunkGenotypeMatrix(dosage-coded(n_indiv, n_var)) instances.Returned by
GenotypeMatrix.from_zarrwhen the requested matrix does not fit eagerly on the GPU. Sample sets index the diploid axis (0..num_individuals-1), not the haplotype axis.ibsaccepts this class directly;grmstill needs an eager sub-region via.materialize(region=...).- property align_bp
Chunk-boundary alignment in bp.
Every chunk has a width that is a multiple of this, so a window whose size also divides
align_bpis guaranteed to fit inside a single chunk. Streaming-aware kernels read this property to validate that the caller’swindow_sizedivides it.
- property chrom_end
Chunk-grid right edge (exclusive). Not the last variant position; per-chunk windows are uniform width within their chunk, so the right edge here is the chunk grid’s exclusive upper bound rather than HaplotypeMatrix’s last-inclusive convention.
- property chrom_start
Chunk-grid origin. Not the first variant position – per-chunk windows are anchored to the chunk grid, so reporting the variant-based origin would be misleading for callers building a comparable eager matrix.
- iter_gpu_chunks()
Yield
(left, right, eager_matrix)tuples covering the source.Each yielded eager matrix lives on the GPU and represents one genomic chunk. Empty chunks (regions with no variants, e.g. an acrocentric arm) are skipped – callers see only chunks with at least one variant.
- materialize(*, region=None, sample_subset=None)
Build an eager matrix over a sub-region.
Pairwise / cross-window kernels (
pairwise_r2,grm,ibs,locate_unlinked, the r^2 heatmap path) can’t be evaluated chunk-by-chunk because they need every (variant, variant) pair simultaneously. Pull the slice you want into one device-resident eager matrix and run those kernels on it.
- property sample_sets
Population -> sample-axis indices. Falls back to a single ‘all’ set when no pop file was resolved.
- class pg_gpu.zarr_source.ZarrGenotypeSource(path, *, region=None, pop_assignment=None, contig_id=None)[source]
Bases:
objectOpen a VCZ store and expose chunked + subset read primitives.
- Parameters:
path (str) – Filesystem path to a VCZ zarr store.
region (str, optional) –
'chrom:start-end'to restrict the source to a single sub-region. The source is single-contig regardless: a multi-contig store withoutregionraises.pop_assignment (str, optional) – Tab-delimited file mapping
sample->pop(same format asHaplotypeMatrix.load_pop_file). DefaultNonelooks for<path>.pops.tsvnext to the store and uses it if present; emits a one-line stderr note so the auto-load isn’t invisible. Passpop_assignment=Falseto disable the auto-load entirely.contig_id (str, optional) – Pick a contig by name when
regionis not given and the store has multiple contigs. Ignored otherwise.
- site_pos
Variant positions (host), length
num_variants.- Type:
ndarray
- pop_cols
{pop_name: ndarray[hap_axis_indices]}when a pop file was resolved, else None.- Type:
dict or None
- slice_region(left, right)[source]
Read every haplotype for variants in
[left, right).- Returns:
gt (ndarray, shape (n_var, num_diploids, 2), dtype int8) – Raw call_genotype block,
-1for multiallelic rows.pos (ndarray, shape (n_var,), dtype int64) – Variant positions.
- slice_region_gpu(left, right, *, cg)[source]
Like
slice_regionbut reads through a caller-providedcg– thecall_genotypezarr array on a group opened from akvikio.zarr.GDSStorewith the GPU buffer prototype active. The chunk lands on the GPU via the active buffer prototype.cgis passed in (not opened here) because zarr caches the codec pipeline on the group at open time; the caller opens the group once after togglingzarr.config.enable_gpu()and reuses the resulting array across every chunk in the iteration. Per-call opens would cost ~ms per chunk and add nothing.
- slice_subsample_gpu(left, right, hap_cols, *, cg)[source]
Like
slice_subsample(to_gpu=True)but reads chunks through the caller-providedcg– acall_genotypezarr array on a group opened from akvikio.zarr.GDSStorewith the GPU buffer prototype active. Decompression happens on the GPU (nvCOMP) rather than via the synchronous host-sideoindexcodec pipeline, which at biobank-scale sample subsets is minutes per call.Sample-axis subsetting works by grouping the requested diploids into contiguous runs (typically one or two for a population-subset read), reading each run as a contiguous
cg[zlo:zhi, rlo:rhi, :]slab that kvikio decompresses directly to GPU memory, then doing one GPU fancy-index to assemble the(n_var, len(hap_cols))result. Pathologically scattered subsets (every dip its own run) fall back toslice_subsample(to_gpu=True).- Returns:
gm (cupy.ndarray, shape (n_var, len(hap_cols)), dtype int8)
pos (ndarray, shape (n_var,), dtype int64)
- slice_subsample(left, right, hap_cols, *, to_gpu=False)[source]
Read variants in
[left, right)restricted tohap_cols.hap_colsis an iterable of haplotype-axis indices in[0, 2 * num_diploids). Uses zarr’soindexfor the sample-axis subset; with bio2zarr-style sample chunking only the chunks containing the requested diploids are decompressed.- Parameters:
to_gpu (bool) – When
Truethe returnedgmis a cupy array on the GPU. The ploidy gather is done on the GPU regardless – this kwarg only controls whethergmgets downloaded back to host before returning. Callers that immediately upload the result (e.g.materialize) should passto_gpu=Trueto avoid the round-trip.- Returns:
gm (ndarray (host) or cupy.ndarray (device), shape) –
(n_var, len(hap_cols)), dtypeint8-1for multiallelic rows.pos (ndarray, shape (n_var,), dtype int64) – Variant positions.
- iter_chunks(chunk_bp, align_bp=None, start=None)[source]
Yield
(left, right)genomic intervals tiling the source.Intervals are sized as multiples of
align_bp(defaults tochunk_bp) so a caller running window-based stats can guarantee windows never straddle a chunk boundary. Empty regions at the start of the contig (e.g. acrocentric arm) are yielded as well – the caller can detect and skip them via the cheapgt.shape[0] == 0check.
- class pg_gpu.streaming_matrix.KvikioChunkFetcher(source, *, compat_mode='ON', num_threads=8)[source]
Bases:
ChunkFetcherRead chunks via
kvikio.zarr.GDSStorewith on-GPU codec decode.Hardware / software requirements (the kvikio path falls back to
HostChunkFetcherif these are not met):NVIDIA driver and a CUDA-capable GPU; pg_gpu already requires both.
kvikioandnvidia-nvcompPython packages in the same environment aspg_gpu(the project’s pixi env ships these by default).The VCZ store’s
call_genotypechunks compressed with a codec in nvCOMP’s GPU-decode list – currently zstd, blosc, lz4, or deflate. bio2zarr defaults to zstd. Other codecs surface asValueErroron construction so the caller can re-encode.Sample-axis chunking smaller than the full sample axis. A store with one whole-sample-axis chunk per variant chunk is still readable but gets no kvikio speedup, so
backend='auto'warns and falls back to the host path.
No special filesystem support is needed. The fetcher opens the store through
kvikio.zarr.GDSStoreand switches zarr 3 to its GPU buffer prototype so nvCOMP decodes the codec directly on the device –build_haplotype_matrixthen skips the host-to-device copy incp.asarray(gt)becausegtis already a cupy array. The win is in that on-GPU decode, not in GPU Direct Storage.By default this fetcher sets
kvikio.defaults.compat_mode=ONso kvikio reads bytes through posix into bounce buffers rather than trying to use GPU Direct Storage (which would be slower on hosts without/etc/cufile.jsonconfigured because of repeated failed cufile probes). Passcompat_mode=AUTOif the storage path is known to be GDS-capable.Construction probes the store’s
call_genotypecodec and raisesValueErrorif it’s not in the nvCOMP-supported list, rather than silently returning incorrect data through a CPU fallback. On close the zarr buffer prototype is reset so subsequent eager calls in the same process get numpy buffers back.- close()[source]
Reset the zarr buffer prototype.
Called automatically at the end of
iter_chunks; exposed for tests and for callers that build a fetcher without immediately iterating it. There is no__del__– Python’s GC timing is nondeterministic, so cleanup is via try/finally insideiter_chunksand explicitclose()for callers that construct without iterating.
- class pg_gpu.streaming_matrix.HostChunkFetcher(source)[source]
Bases:
ChunkFetcherRead chunks through the source’s host-buffer
slice_region.prefetch >= 1spawns a daemon producer thread that fills a bounded queue with the next chunk while the consumer is computing on the current chunk. A GPU-codec-decode fetcher (kvikio + nvidia-nvcomp) is a drop-in replacement for the same ABC when the store’s codec is GPU-decodable.
- class pg_gpu.MemoryLimitedWarning[source]
Bases:
UserWarningA load that is at risk of exhausting memory or taking far longer than the VCZ path would.
Emitted before parsing by
HaplotypeMatrix.from_vcfandGenotypeMatrix.from_vcfwhen the VCF on disk is large enough, or has enough samples, that VCF text parsing will be slow and the full-matrix host load may not even fit. The warning text includes a copy-pastable recipe for converting the VCF to VCZ viaHaplotypeMatrix.vcf_to_zarr. Silence with:import warnings from pg_gpu import MemoryLimitedWarning warnings.filterwarnings("ignore", category=MemoryLimitedWarning)
- pg_gpu.zarr_io.allel_zarr_to_vcz(allel_path, vcz_path, *, contig=None, region=None, variant_chunk=10000, sample_chunk=1000, progress=False)[source]
Convert a scikit-allel zarr store to vcz (bio2zarr) layout.
Streams
calldata/GTin variant blocks so chromosome-scale allel stores – which the eagerread_genotypes_allelwould OOM on – can be converted without materializing the full genotype matrix. Writescall_genotypewith bio2zarr-style sample-axis chunking so the pg_gpu streaming reader’s kvikio backend can decode chunks on the GPU.- Parameters:
allel_path (str) – Source allel store. Either a flat layout with
calldata/GTat the root, or a chromosome-grouped layout where each<contig>/calldata/GTis one chromosome. Both are accepted.vcz_path (str) – Destination vcz store path. Overwritten if it exists.
contig (str, optional) – For grouped stores, the chromosome key to read. Required when the source is grouped and
regiondoes not name a contig. For flat stores, used as thecontig_idlabel in the output.region (str, optional) –
"chrom:start-end"(or"start-end"on a flat store) restricting the conversion to a position range.variant_chunk (int) – Source variants read per streaming pass. Larger amortizes zarr read overhead; smaller bounds host RAM at biobank sample counts.
sample_chunk (int) – Sample-axis chunk size in the output
call_genotype. Smaller chunks let the streaming reader’s kvikio + nvCOMP decode overlap with compute on biobank-scale stores.progress (bool) – Print a one-line status per variant chunk to stderr.
Notes
Only the fields pg_gpu’s streaming reader requires are written:
call_genotype,call_genotype_mask,variant_position,sample_id,contig_id,variant_contig. Other allel columns (REF, ALT, FILTER, etc.) are not preserved.
LD Statistics
Core Functions
- pg_gpu.ld_statistics.dd(counts, populations=None, n_valid=None)[source]
Compute D² statistic for any population configuration.
- Parameters:
counts (cp.ndarray) – Haplotype counts array: - Single population: shape (N, 4) - Two populations: shape (N, 8) - Multi-population: shape (N, 4*P)
populations (tuple of int, optional) – Population indices. None for single population, (i, j) for between populations i and j
n_valid (cp.ndarray, optional) – Valid sample counts per population. Shape depends on configuration: - Single pop: shape (N,) - Two pops: shape (N, 2) or tuple of (N,) arrays
- Returns:
D² values for each locus
- Return type:
cp.ndarray
- pg_gpu.ld_statistics.dz(counts, populations=None, n_valid=None)[source]
Compute Dz statistic for any population configuration.
- Parameters:
- Returns:
Dz values for each locus
- Return type:
cp.ndarray
- pg_gpu.ld_statistics.pi2(counts, populations=None, n_valid=None)[source]
Compute π₂ statistic for any population configuration.
- Parameters:
- Returns:
π₂ values for each locus
- Return type:
cp.ndarray
- pg_gpu.ld_statistics.r(counts, n_valid=None)[source]
Compute Pearson correlation coefficient r between variant pairs from haplotype counts.
- Parameters:
counts (cp.ndarray, shape (N, 4)) – Haplotype counts [n11, n10, n01, n00] for each variant pair.
n_valid (cp.ndarray, optional) – Valid sample counts per pair. Shape (N,).
- Returns:
Pearson r values. NaN where computation is undefined (monomorphic at either locus).
- Return type:
cp.ndarray, float64, shape (N,)
Convenience Functions
- pg_gpu.ld_statistics.dd_within(counts, n_valid=None)[source]
Compute D² within a single population.
Convenience function equivalent to dd(counts, populations=None)
- pg_gpu.ld_statistics.dd_between(counts, pop1_idx, pop2_idx, n_valid=None)[source]
Compute D² between two populations.
Convenience function equivalent to dd(counts, populations=(pop1_idx, pop2_idx))
- pg_gpu.ld_statistics.compute_ld_statistics(counts, statistics=['dd', 'dz', 'pi2'], populations=None, n_valid=None)[source]
Compute multiple LD statistics in one pass.
- Parameters:
counts (cp.ndarray) – Haplotype counts array
statistics (list of str) – Statistics to compute (‘dd’, ‘dz’, ‘pi2’)
populations (dict, optional) – Population configurations for each statistic. E.g., {‘dd’: (0, 1), ‘dz’: (0, 0, 1), ‘pi2’: (0, 0, 1, 1)}
n_valid (cp.ndarray, optional) – Valid sample counts per population
- Returns:
Dictionary mapping statistic names to computed values
- Return type:
- pg_gpu.ld_statistics.zns(r2_matrix_or_matrix, missing_data='include', estimator='auto')[source]
Kelly’s ZnS: mean pairwise r-squared across all SNP pairs.
- Parameters:
r2_matrix_or_matrix (ndarray, HaplotypeMatrix, or GenotypeMatrix) – Square r-squared matrix, or a matrix object (dispatches to haploid or diploid r-squared computation automatically). When a HaplotypeMatrix is passed, uses tiled computation to avoid materializing the full m×m r² matrix.
missing_data (str) –
'include'(default) uses per-site valid data for frequency computation.'exclude'filters to sites with no missing data.estimator (str) –
'auto'(default) uses the unbiasedsigma_d2estimator when the input is aHaplotypeMatrix, and falls back to naiver2for pre-computed r² arrays orGenotypeMatrixinputs (wheresigma_d2is not available).'r2'always computes naive r-squared.'sigma_d2'always uses the unbiased multinomial projection estimators (Ragsdale & Gravel 2019), computing mean \(\sigma_D^2 = D^2/\pi_2\) per pair with falling-factorial corrections. RequiresHaplotypeMatrixinput.
- Returns:
Mean r-squared (or mean sigma_D^2 when sigma_d2 is selected).
- Return type:
- pg_gpu.ld_statistics.omega(r2_matrix_or_matrix, missing_data='include', estimator='auto')[source]
Kim and Nielsen’s Omega: max ratio of within-partition to cross-partition mean LD.
For each possible SNP partition point l, splits variants into [0:l) and [l:m), computes mean r-squared within each block and between blocks. Returns max(mean_within / mean_cross).
Uses GPU prefix sums on the upper triangle to evaluate all partition points without a Python loop. Matches diploSHIC’s convention of using upper-triangle pairs only.
- Parameters:
r2_matrix_or_matrix (ndarray, HaplotypeMatrix, or GenotypeMatrix) – Square r-squared matrix, or a matrix object (dispatches to haploid or diploid r-squared computation automatically).
missing_data (str) –
'include'(default) uses per-site valid data for frequency computation.'exclude'filters to sites with no missing data.estimator (str) –
'auto'(default) uses the unbiasedsigma_d2estimator when the input is aHaplotypeMatrix, and falls back to naiver2for pre-computed r² arrays orGenotypeMatrixinputs (wheresigma_d2is not available).'r2'always computes naive r-squared.'sigma_d2'always uses unbiased \(\sigma_D^2 = D^2/\pi_2\) (Ragsdale & Gravel 2019). RequiresHaplotypeMatrixinput.
- Returns:
Maximum omega value. Returns 0 if fewer than 5 SNPs.
- Return type:
- pg_gpu.ld_statistics.mu_ld(haplotype_matrix, missing_data='include')[source]
mu_LD: haplotype pattern exclusivity between left/right halves (RAiSD).
Splits variants at midpoint and measures how exclusively haplotype patterns associate across halves. Elevated at sweep boundaries where LD structure changes abruptly.
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) – ‘include’ - treat missing as wildcard in pattern matching ‘exclude’ - filter to sites with no missing data
- Return type:
Population Specifications
Single population:
populations=Noneor omitTwo populations:
populations=(0, 1)for DDThree indices:
populations=(0, 0, 1)for DzFour indices:
populations=(0, 0, 1, 1)for pi2
Diversity Statistics
Core Functions
- pg_gpu.diversity.pi(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]
Calculate nucleotide diversity (pi) for a population.
Nucleotide diversity is the average number of nucleotide differences per site between two randomly chosen sequences from the population.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
span_normalize (bool) –
True(default): auto-detect best denominator (accessible bases if mask set, else genomic span).False: return raw sum.missing_data (str) –
'include'(default) uses per-site valid data.'exclude'filters to sites with no missing data.
- Returns:
Nucleotide diversity value
- Return type:
- pg_gpu.diversity.theta_w(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]
Calculate Watterson’s theta for a population.
Watterson’s theta is an estimator of the population mutation rate based on the number of segregating sites.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
span_normalize (bool) –
True(default): auto-detect best denominator.False: return raw sum.missing_data (str) –
'include'(default) uses per-site valid data.'exclude'filters to sites with no missing data.
- Returns:
Watterson’s theta value
- Return type:
- pg_gpu.diversity.tajimas_d(haplotype_matrix, population=None, missing_data='include')[source]
Calculate Tajima’s D statistic.
Tajima’s D tests the neutral mutation hypothesis by comparing two estimates of the population mutation rate.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) –
'include'(default) uses per-site valid data with harmonic mean of per-site sample sizes for variance terms.'exclude'filters to sites with no missing data.
- Returns:
Tajima’s D value
- Return type:
- pg_gpu.diversity.haplotype_diversity(haplotype_matrix, population=None, missing_data='include')[source]
Calculate haplotype diversity for a population.
Haplotype diversity is defined as 1 - sum(p_i^2) where p_i is the frequency of the i-th unique haplotype in the population.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - exclude haplotypes with any missing data ‘exclude’ - filter to sites with no missing data
- Returns:
Haplotype diversity value
- Return type:
- pg_gpu.diversity.allele_frequency_spectrum(haplotype_matrix, population=None, missing_data='include')[source]
Calculate the allele frequency spectrum (AFS).
The AFS is a histogram of allele frequencies across all sites.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Calculate AFS using available data per site. ‘exclude’ - Only use sites with no missing data.
- Returns:
Array where element i contains the number of sites with i derived alleles
- Return type:
ndarray
- pg_gpu.diversity.segregating_sites(haplotype_matrix, population=None, missing_data='include')[source]
Count the number of segregating sites.
A site is segregating if it has more than one allele present.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Count sites as segregating based on non-missing data only. ‘exclude’ - Only count sites with no missing data.
- Returns:
Number of segregating sites
- Return type:
- pg_gpu.diversity.singleton_count(haplotype_matrix, population=None, missing_data='include')[source]
Count the number of singleton variants.
A singleton is a variant present in exactly one haplotype.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Count singletons based on non-missing data only. ‘exclude’ - Only count singletons at sites with no missing data.
- Returns:
Number of singleton variants
- Return type:
- pg_gpu.diversity.fay_wus_h(haplotype_matrix, population=None, missing_data='include')[source]
Calculate Fay and Wu’s H statistic.
Tests for an excess of high-frequency derived alleles, which can indicate positive selection.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
- Returns:
Fay and Wu’s H value
- Return type:
- pg_gpu.diversity.heterozygosity_expected(haplotype_matrix, population=None, missing_data='include')[source]
Compute expected heterozygosity (gene diversity) per variant.
He = 1 - sum(p_i^2) for each variant, where p_i are allele frequencies. For biallelic sites this simplifies to He = 2*p*(1-p).
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘exclude’ - NaN at sites with any missing data ‘include’ - per-site n_valid for frequency calculation
- Returns:
Expected heterozygosity per variant.
- Return type:
ndarray, float64, shape (n_variants,)
- pg_gpu.diversity.heterozygosity_observed(haplotype_matrix, population=None, ploidy=2, missing_data='include')[source]
Compute observed heterozygosity per variant.
Assumes consecutive haplotypes belong to the same individual (standard for diploid VCF data). A site is heterozygous in an individual if the two haplotypes differ.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
ploidy (int) – Ploidy level. Default 2 (diploid).
missing_data (str) – ‘include’ - skip missing individuals per site (default) ‘exclude’ - NaN at sites with any missing data
- Returns:
Observed heterozygosity per variant.
- Return type:
ndarray, float64, shape (n_variants,)
- pg_gpu.diversity.inbreeding_coefficient(haplotype_matrix, population=None, ploidy=2, missing_data='include')[source]
Compute Wright’s inbreeding coefficient F per variant.
F = 1 - Ho/He, where Ho is observed heterozygosity and He is expected heterozygosity.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
ploidy (int) – Ploidy level for observed heterozygosity computation.
missing_data (str) – Passed to heterozygosity_expected and heterozygosity_observed.
- Returns:
Inbreeding coefficient per variant. NaN where He = 0.
- Return type:
ndarray, float64, shape (n_variants,)
Convenience Functions
- pg_gpu.diversity.theta_l(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]
Compute theta_L diversity estimator.
theta_L = sum_i(i * xi_i) / (n - 1), where xi_i is the count of sites with derived allele count i. Weights variants linearly by derived allele frequency, bridging theta_pi and theta_H.
Reference: Zeng et al. (2006), Genetics 174: 1431-1439, Equation (8).
- pg_gpu.diversity.normalized_fay_wus_h(haplotype_matrix, population=None, missing_data='include')[source]
Compute normalized Fay and Wu’s H (H*).
H = theta_pi - theta_H, normalized by its standard deviation under the standard neutral model. The normalization allows comparison across samples with different numbers of segregating sites.
Reference: Zeng et al. (2006), “Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants”, Genetics 174: 1431-1439, Equation (11).
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) –
'include'(default) uses per-site sample sizes with harmonic mean n for variance terms.'exclude'filters to sites with no missing data.
- Returns:
Normalized H*. Negative values indicate excess high-frequency derived alleles (directional selection signal).
- Return type:
- pg_gpu.diversity.zeng_e(haplotype_matrix, population=None, missing_data='include')[source]
Compute Zeng’s E test statistic.
E = theta_L - theta_W, normalized by its standard deviation.
Reference: Zeng et al. (2006), Genetics 174: 1431-1439, Equation (13).
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) –
'include'(default) or'exclude'.
- Return type:
- pg_gpu.diversity.zeng_dh(haplotype_matrix, population=None, missing_data='include')[source]
Compute Zeng’s DH joint test statistic.
Combines Tajima’s D and Fay & Wu’s H into a single test with improved power to detect directional selection. Defined as the product D * H when both are negative, zero otherwise.
Reference: Zeng et al. (2006), “Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants”, Genetics 174: 1431-1439, Equation (15).
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) – Passed through to tajimas_d and fay_wus_h.
- Returns:
DH statistic. Positive when both D and H are negative (consistent with a selective sweep).
- Return type:
- pg_gpu.diversity.diversity_stats(haplotype_matrix, population=None, statistics=['pi', 'theta_w', 'tajimas_d'], span_normalize=True, missing_data='include')[source]
Compute multiple diversity statistics at once.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
statistics (list) – List of statistics to compute
span_normalize (bool) –
True(default): auto-detect best denominator.False: return raw sums.missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
- Returns:
Dictionary mapping statistic names to values
- Return type:
- pg_gpu.diversity.neutrality_tests(haplotype_matrix, population=None, missing_data='include')[source]
Compute common neutrality test statistics.
Returns Tajima’s D, Fay and Wu’s H, and related values.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
- Returns:
Dictionary with neutrality test results
- Return type:
- pg_gpu.diversity.theta_h(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]
Compute theta_H (homozygosity-based diversity estimator).
theta_H = sum_i [ i^2 * S_i ] * 2 / (n*(n-1)) where S_i is the count of variants with derived allele count i. Used to compute Fay and Wu’s H.
- pg_gpu.diversity.max_daf(haplotype_matrix, population=None, missing_data='include')[source]
Maximum derived allele frequency across all variants.
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) – ‘include’ - per-site n_valid for frequency ‘exclude’ - only sites with no missing data
- Returns:
Maximum DAF in [0, 1].
- Return type:
- pg_gpu.diversity.haplotype_count(haplotype_matrix, population=None, missing_data='include')[source]
Count distinct haplotypes.
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) – ‘include’ - exclude haplotypes with any missing ‘exclude’ - filter to sites with no missing data
- Return type:
- pg_gpu.diversity.daf_histogram(matrix, n_bins=20, population=None, missing_data='include')[source]
Normalized histogram of derived allele frequencies.
Accepts HaplotypeMatrix or GenotypeMatrix. For diploid data, DAF = sum(genotypes) / (2 * n_individuals).
- Parameters:
matrix (HaplotypeMatrix or GenotypeMatrix)
n_bins (int) – Number of frequency bins spanning [0, 1].
missing_data (str) – ‘include’ - per-site n_valid for frequency ‘exclude’ - only sites with no missing data
- Returns:
hist (ndarray, float64, shape (n_bins,)) – Normalized counts (sum to 1).
bin_edges (ndarray, float64, shape (n_bins + 1,))
- pg_gpu.diversity.diplotype_frequency_spectrum(genotype_matrix, population=None)[source]
Count distinct multi-locus genotype patterns (diplotypes).
- Parameters:
genotype_matrix (GenotypeMatrix)
- Returns:
freqs (ndarray, float64, sorted descending) – Diplotype frequencies.
n_diplotypes (int) – Number of distinct diplotypes.
- pg_gpu.diversity.mu_var(haplotype_matrix, window_length=None, population=None)[source]
mu_VAR: SNP density statistic (RAiSD).
Number of SNPs per base pair. Elevated near sweeps due to hitchhiking effects on local variant density.
- Parameters:
haplotype_matrix (HaplotypeMatrix)
window_length (float, optional) – Window length in bp. If None, uses chrom_end - chrom_start.
- Return type:
- pg_gpu.diversity.mu_sfs(haplotype_matrix, population=None, missing_data='include')[source]
mu_SFS: fraction of SNPs at SFS edges (RAiSD).
Counts singletons (DAC=1) and near-fixed variants (DAC=n-1), divided by total segregating sites. Elevated near selective sweeps.
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) – ‘include’ - per-site n_valid for edge classification ‘exclude’ - only sites with no missing data
- Return type:
Divergence Statistics
Core Functions
- pg_gpu.divergence.fst(haplotype_matrix, pop1, pop2, method='hudson', missing_data='include')[source]
Compute FST between two populations.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
pop1 (str or list) – First population (name from sample_sets or list of indices)
pop2 (str or list) – Second population (name from sample_sets or list of indices)
method (str) – FST estimation method (‘hudson’, ‘weir_cockerham’, ‘nei’)
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
- Returns:
FST value between populations
- Return type:
- pg_gpu.divergence.fst_hudson(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute Hudson’s FST estimator between two populations.
Hudson (1992) estimator: FST = 1 - (Hw / Hb) where Hw is within-population diversity and Hb is between-population diversity
- Parameters:
- Returns:
Hudson’s FST estimate
- Return type:
- pg_gpu.divergence.fst_weir_cockerham(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute Weir & Cockerham’s (1984) FST estimator.
This is the method of moments estimator that accounts for sampling variance. Computes all three variance components (a, b, c) including within-individual heterozygosity from paired haplotypes.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data. Consecutive haplotype pairs are treated as diploid individuals for the heterozygosity component.
missing_data (str) – ‘include’ - per-site sample sizes, ratio-of-sums ‘exclude’ - Only use sites with no missing data
- Returns:
Weir & Cockerham’s FST estimate
- Return type:
- pg_gpu.divergence.fst_nei(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute Nei’s GST (1973).
GST = (HT - HS) / HT where HT is total heterozygosity and HS is within-subpopulation heterozygosity
- Parameters:
- Returns:
Nei’s GST estimate
- Return type:
- pg_gpu.divergence.dxy(haplotype_matrix, pop1, pop2, per_site=False, missing_data='include', span_normalize=True)[source]
Compute absolute divergence (Dxy) between two populations.
Dxy measures the average number of nucleotide differences between sequences from two populations.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
per_site (bool) – If True, return per-site values; if False, return mean
missing_data (str) –
'include'(default) uses per-site valid data.'exclude'filters to sites with no missing data.span_normalize (bool) –
True(default): auto-detect best denominator.False: return raw sum / sites-with-data average.
- Returns:
Mean Dxy or per-site Dxy values
- Return type:
float or cp.ndarray
- pg_gpu.divergence.da(haplotype_matrix, pop1, pop2, missing_data='include', span_normalize=True)[source]
Compute net divergence (Da) between two populations.
Da = Dxy - (pi1 + pi2) / 2 where pi1 and pi2 are within-population diversities
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
span_normalize (bool) –
True(default): auto-detect best denominator.False: return raw sum / sites-with-data average.
- Returns:
Net divergence (Da)
- Return type:
Convenience Functions
- pg_gpu.divergence.divergence_stats(haplotype_matrix, pop1, pop2, statistics=['fst', 'dxy', 'da'], missing_data='include', span_normalize=True)[source]
Compute multiple divergence statistics between two populations.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
statistics (list) – List of statistics to compute
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
span_normalize (bool) –
True(default): auto-detect.False: raw sum / per-site average.
- Returns:
Dictionary of statistic names to values
- Return type:
- pg_gpu.divergence.pairwise_fst(haplotype_matrix, populations=None, method='hudson', missing_data='include')[source]
Compute pairwise FST matrix for multiple populations.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – The haplotype data
populations (list, optional) – List of population names. If None, uses all populations in sample_sets
method (str) – FST method to use
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
- Returns:
fst_matrix (cp.ndarray) – Pairwise FST matrix
pop_names (list) – Population names in matrix order
- Return type:
- pg_gpu.divergence.pbs(haplotype_matrix, pop1, pop2, pop3, window_size, window_start=0, window_stop=None, window_step=None, normed=True, missing_data='include')[source]
Compute the Population Branch Statistic (PBS).
PBS detects genomic regions unusually differentiated in pop1 relative to pop2 and pop3, using pairwise Hudson FST estimates.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data containing all three populations.
window_size (int) – Number of variants per window.
window_start (int, optional) – Starting variant index.
window_stop (int, optional) – Stopping variant index.
window_step (int, optional) – Stride between windows. Defaults to window_size (non-overlapping).
normed (bool, optional) – If True (default), return normalized PBS (PBSn1).
missing_data (str) – ‘include’ - per-site sample sizes ‘exclude’ - only use sites with no missing data
- Returns:
PBS values per window.
- Return type:
ndarray, float64, shape (n_windows,)
Distance-Based Two-Population Statistics
- pg_gpu.divergence.snn(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]
Hudson’s nearest-neighbor statistic (Snn).
For each haplotype, determines whether its nearest neighbor is from the same population. Snn is the fraction of haplotypes whose nearest neighbor is conspecific. Under panmixia Snn ~ 0.5; under population structure Snn -> 1.
- Parameters:
- Returns:
Snn statistic in [0, 1].
- Return type:
References
Hudson, R.R. (2000). A New Statistic for Detecting Genetic Differentiation. Genetics, 155(4), 2011-2014.
- pg_gpu.divergence.dxy_min(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]
Minimum pairwise distance between two populations.
The Hamming distance of the closest pair of haplotypes across the two populations. Used in Gmin (Geneva et al.) and dd (Schrider et al.) statistics.
- Parameters:
- Returns:
Minimum pairwise distance.
- Return type:
References
Geneva, A.J. et al. (2015). A new method to scan genomes for introgression in a secondary contact model. PLoS ONE, 10(4).
- pg_gpu.divergence.gmin(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]
Geneva’s Gmin: ratio of minimum to mean between-population distance.
Gmin = Dxy_min / Dxy_mean. Low values indicate unusually similar haplotypes across populations relative to average divergence, suggesting recent gene flow or shared ancestral variation.
- Parameters:
- Returns:
Gmin ratio.
- Return type:
References
Geneva, A.J. et al. (2015). A new method to scan genomes for introgression in a secondary contact model. PLoS ONE, 10(4).
- pg_gpu.divergence.dd(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]
Relative minimum divergence (dd1, dd2).
dd1 = Dxy_min / pi1, dd2 = Dxy_min / pi2. Low values indicate that the closest between-population pair is unusually similar relative to within-population diversity.
- Parameters:
- Returns:
dd1 (float) – Dxy_min / pi(pop1).
dd2 (float) – Dxy_min / pi(pop2).
- Return type:
References
Schrider, D.R., Ayroles, J., Matute, D.R. & Kern, A.D. (2018). Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genetics, 14(4), e1007341. https://doi.org/10.1371/journal.pgen.1007341
- pg_gpu.divergence.dd_rank(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]
Rank of Dxy_min in within-population pairwise distance distributions.
For each population, computes the fraction of within-population pairwise distances that are <= Dxy_min. Low values indicate the closest between-population pair is more similar than most within-population pairs, suggesting introgression.
- Parameters:
- Returns:
rank1 (float) – Fraction of pop1 within-pop distances <= Dxy_min.
rank2 (float) – Fraction of pop2 within-pop distances <= Dxy_min.
- Return type:
References
Schrider, D.R., Ayroles, J., Matute, D.R. & Kern, A.D. (2018). Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genetics, 14(4), e1007341. https://doi.org/10.1371/journal.pgen.1007341
- pg_gpu.divergence.zx(haplotype_matrix, pop1, pop2, missing_data='include')[source]
ZnS ratio: within-population LD relative to total LD.
Zx = (ZnS_pop1 + ZnS_pop2) / (2 * ZnS_total). Values > 1 indicate stronger LD within populations than across the combined sample, consistent with population structure or recent admixture.
- Parameters:
- Returns:
Zx ratio.
- Return type:
References
Schrider, D.R., Ayroles, J., Matute, D.R. & Kern, A.D. (2018). Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genetics, 14(4), e1007341. https://doi.org/10.1371/journal.pgen.1007341
- pg_gpu.divergence.pairwise_distance_matrix(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute pairwise distance matrices between and within two populations.
Returns three cupy distance matrices: between-population, within-pop1, and within-pop2. Pre-compute once and pass to individual stats via the
distance_matricesparameter to avoid redundant GPU work.
Selection Scan Statistics
Haplotype-Based Selection Scans
- pg_gpu.selection.ihs(haplotype_matrix, pos=None, map_pos=None, min_ehh=0.05, min_maf=0.05, include_edges=False, gap_scale=20000, max_gap=200000, is_accessible=None, population=None, missing_data='include')[source]
Compute the unstandardized integrated haplotype score (iHS).
Compares the integrated EHH for the reference vs alternate allele at each variant.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
pos (array_like, optional) – Variant positions. If None, uses haplotype_matrix.positions.
map_pos (array_like, optional) – Genetic map positions. If None, uses physical positions.
min_ehh (float, optional) – Minimum EHH below which to truncate IHH computation.
min_maf (float, optional) – Minimum minor allele frequency to compute score.
include_edges (bool, optional) – If True, report scores even when EHH does not decay below min_ehh.
gap_scale (int, optional) – Rescale gaps larger than this value.
max_gap (int, optional) – Mark gaps larger than this as invalid.
is_accessible (array_like, optional) – Genome accessibility mask.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan
- Returns:
score – Unstandardized iHS scores: log(IHH1 / IHH0).
- Return type:
- pg_gpu.selection.xpehh(haplotype_matrix, pop1, pop2, pos=None, map_pos=None, min_ehh=0.05, include_edges=False, gap_scale=20000, max_gap=200000, is_accessible=None, missing_data='include')[source]
Compute the unstandardized cross-population EHH (XP-EHH) statistic.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data containing both populations.
pop1 (str or list) – First population name or sample indices.
pop2 (str or list) – Second population name or sample indices.
pos (array_like, optional) – Variant positions. If None, uses haplotype_matrix.positions.
map_pos (array_like, optional) – Genetic map positions.
min_ehh (float, optional) – Minimum EHH threshold.
include_edges (bool, optional) – Report scores even when EHH does not decay below min_ehh.
gap_scale (int, optional) – Rescale gaps larger than this value.
max_gap (int, optional) – Mark gaps larger than this as invalid.
is_accessible (array_like, optional) – Genome accessibility mask.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan
- Returns:
score – Unstandardized XP-EHH scores: log(IHH_pop1 / IHH_pop2).
- Return type:
- pg_gpu.selection.nsl(haplotype_matrix, population=None, missing_data='include')[source]
Compute the unstandardized nSL statistic for each variant.
Compares the mean shared haplotype length around the reference (0) vs alternate (1) allele at each site.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or list of sample indices.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan
- Returns:
score – Unstandardized nSL scores: log(nSL1 / nSL0).
- Return type:
- pg_gpu.selection.xpnsl(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute the unstandardized cross-population nSL statistic.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data containing both populations.
pop1 (str or list) – First population name or sample indices.
pop2 (str or list) – Second population name or sample indices.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan
- Returns:
score – Unstandardized XP-nSL scores: log(nSL_pop1 / nSL_pop2).
- Return type:
EHH and Haplotype Homozygosity
- pg_gpu.selection.ehh_decay(haplotype_matrix, truncate=False, population=None, missing_data='include')[source]
Compute EHH decay from the first variant.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
truncate (bool, optional) – If True, exclude trailing zeros from the result.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘include’ - missing sites skipped in prefix comparison (default) ‘exclude’ - filter to sites with no missing data
- Returns:
ehh – EHH values at each variant position from the first.
- Return type:
- pg_gpu.selection.garud_h(matrix, population=None, missing_data='include')[source]
Compute Garud’s H1, H12, H123, and H2/H1 statistics.
Accepts either HaplotypeMatrix (uses haplotype frequencies) or GenotypeMatrix (uses diplotype frequencies) and dispatches automatically.
- Parameters:
matrix (HaplotypeMatrix or GenotypeMatrix) – Haplotype or diploid genotype data.
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples.
missing_data (str) – ‘include’ - treat missing as wildcard (compatible with any allele). ‘exclude’ - filter to sites with no missing data.
- Returns:
h1 (float) – Sum of squared haplotype/diplotype frequencies.
h12 (float) – H12 statistic (top two combined).
h123 (float) – H123 statistic (top three combined).
h2_h1 (float) – H2/H1 ratio indicating sweep softness.
- pg_gpu.selection.moving_garud_h(haplotype_matrix, size, start=0, stop=None, step=None, population=None, missing_data='include')[source]
Compute Garud’s H statistics in moving windows of variants.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
size (int) – Window size (number of variants).
start (int, optional) – Starting variant index.
stop (int, optional) – Stopping variant index.
step (int, optional) – Step between windows. Defaults to size (non-overlapping).
population (str or list, optional) – Population name or list of sample indices.
missing_data (str) – ‘include’ - treat missing as wildcard in pattern matching ‘exclude’ - filter to sites with no missing data
- Returns:
h1 (ndarray, float)
h12 (ndarray, float)
h123 (ndarray, float)
h2_h1 (ndarray, float)
Standardization
- pg_gpu.selection.standardize_by_allele_count(score, aac, bins=None, n_bins=None)[source]
Standardize scores within allele frequency bins.
- Parameters:
score (array_like, float, shape (n_variants,)) – Raw scores (e.g., unstandardized IHS or nSL).
aac (array_like, int, shape (n_variants,)) – Alternate allele counts for each variant.
bins (array_like, int, optional) – Allele count bin edges. Overrides n_bins.
n_bins (int, optional) – Number of bins to create. Default: max(aac) // 2.
- Returns:
score_standardized (ndarray, float, shape (n_variants,)) – Standardized scores.
bins (ndarray) – Bin edges used.
Site Frequency Spectrum
Single Population
- pg_gpu.sfs.sfs(haplotype_matrix, population=None, missing_data='include')[source]
Compute the unfolded site frequency spectrum.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘include’ - per-site n_valid; bins by actual DAC ‘exclude’ - only sites with no missing data
- Returns:
Element k = number of variants with k derived alleles.
- Return type:
ndarray, int64, shape (n_chromosomes + 1,)
- pg_gpu.sfs.sfs_folded(haplotype_matrix, population=None, missing_data='include')[source]
Compute the folded site frequency spectrum (minor allele counts).
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
missing_data (str)
- Returns:
Element k = number of variants with minor allele count k.
- Return type:
ndarray, int64, shape (n_chromosomes // 2 + 1,)
- pg_gpu.sfs.sfs_scaled(haplotype_matrix, population=None, missing_data='include')[source]
Compute the scaled unfolded site frequency spectrum.
Scaling: element k is multiplied by k, yielding a constant expectation under neutrality and constant population size.
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str)
- Return type:
ndarray, float64, shape (n_chromosomes + 1,)
- pg_gpu.sfs.sfs_folded_scaled(haplotype_matrix, population=None, missing_data='include')[source]
Compute the scaled folded site frequency spectrum.
Scaling: element k is multiplied by k * (n - k) / n.
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str)
- Return type:
ndarray, float64, shape (n_chromosomes // 2 + 1,)
Joint (Two-Population)
- pg_gpu.sfs.joint_sfs(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute the joint site frequency spectrum between two populations.
- Parameters:
- Returns:
Element [i, j] = number of variants with i derived alleles in pop1 and j derived alleles in pop2.
- Return type:
ndarray, int64, shape (n1 + 1, n2 + 1)
- pg_gpu.sfs.joint_sfs_folded(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute the folded joint site frequency spectrum.
- pg_gpu.sfs.joint_sfs_scaled(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute the scaled joint site frequency spectrum.
Scaling: element [i, j] is multiplied by i * j.
- pg_gpu.sfs.joint_sfs_folded_scaled(haplotype_matrix, pop1, pop2, missing_data='include')[source]
Compute the scaled folded joint site frequency spectrum.
Scaling: element [i, j] is multiplied by i * j * (n1 - i) * (n2 - j).
- pg_gpu.sfs.project_joint_sfs(haplotype_matrix, pop1, pop2, target_n1, target_n2, missing_data='include')[source]
Joint SFS projected to
(target_n1+1, target_n2+1)via hypergeometric sampling.Mathematically identical to
P1 @ joint_sfs(...) @ P2.Twith hypergeometric projection matricesP1, P2, but applied per-variant so the(n1+1, n2+1)full histogram is never materialized. That intermediate would be 80 GB at 100k haps per population; the projected output stays small regardless of source size. Use this whenever the source size is too large forjoint_sfsto allocate its bincount.- Parameters:
haplotype_matrix (HaplotypeMatrix or StreamingHaplotypeMatrix)
pop1 (str or list) – Population names or explicit sample-index lists.
pop2 (str or list) – Population names or explicit sample-index lists.
target_n1 (int) – Projection targets; each must be <= the corresponding source population size.
target_n2 (int) – Projection targets; each must be <= the corresponding source population size.
missing_data (str)
- Return type:
ndarray, float64, shape
(target_n1 + 1, target_n2 + 1)
Scaling and Folding Utilities
- pg_gpu.sfs.scale_sfs_folded(s, n)[source]
Scale a folded SFS: element k multiplied by k * (n - k) / n.
- pg_gpu.sfs.scale_joint_sfs_folded(s, n1, n2)[source]
Scale a folded joint SFS: element [i,j] * i * j * (n1-i) * (n2-j).
Resampling (Block Jackknife and Bootstrap)
General-purpose block-resampling estimators for calibrated standard error / CI on any
scalar genome-wide statistic. Both accept pre-binned per-block values
(single 1D array or tuple of arrays for ratio-of-sums statistics) and a
callable statistic. See examples/sweep_tajimas_d_bootstrap.py for
an end-to-end CI on Tajima’s D under a completed sweep.
- pg_gpu.resampling.block_jackknife(values, statistic, *, weights=None)[source]
Block-jackknife standard error for a statistic over pre-binned blocks.
Implements the delete-1 block jackknife: for each block
j, the statistic is recomputed on the remaining blocks; the spread of those leave-one-out estimates gives a calibrated SE. Whenweightsis provided, the Busing et al. (1999) weighted delete-\(m_j\) formulation is used so unequal block sizes do not skew the SE.- Parameters:
values (ndarray or tuple of ndarrays) – Per-block values. Use a single 1D array for statistics of the form
statistic(vb); use a tuple(num, den, ...)for ratio-of-sums statistics where multiple per-block arrays are consumed bystatistic. All arrays in the tuple must have the same length.statistic (callable) –
statistic(*values) -> float. Called once on the full data and once per block with that block masked out.weights (ndarray, optional) – Length-
narray of block sizes (e.g., SNPs per block). WhenNone(default), blocks are treated as equally weighted and the formula reduces to the unweighted block jackknife. See Notes.
- Returns:
estimate (float) – Point estimate. Unweighted case: mean of leave-one-out values (matches the current admixture jackknife). Weighted case: mean of the Busing pseudo-values.
se (float) – Jackknife standard error.
per_iter (ndarray, shape (n,)) – Per-block leave-one-out statistic values.
Notes
For uniform block sizes the weighted formulation reduces exactly to
sqrt(((n-1)/n) * sum((vj - mean(vj))**2)).Weighted formulation (Busing 1999 eq. 4-5):
\[ \begin{align}\begin{aligned}h_j = N / m_j, \qquad \phi_j = h_j \hat\theta - (h_j - 1) \hat\theta_{-j}\\\tilde\theta = \frac{1}{g} \sum_j \phi_j, \qquad \mathrm{Var}(\tilde\theta) = \frac{1}{g} \sum_j \frac{(\phi_j - \tilde\theta)^2}{h_j - 1}\end{aligned}\end{align} \]where
gis the number of blocks,m_jis the size of blockj, andN = sum(m_j).References
Busing, F. M. T. A., Meijer, E., & Van der Leeden, R. (1999). Delete-m jackknife for unequal m. Statistics and Computing, 9, 3-8.
- pg_gpu.resampling.block_bootstrap(values, statistic, *, n_replicates=1000, rng=None)[source]
Block-bootstrap distribution for a statistic over pre-binned blocks.
Resamples block indices with replacement
n_replicatestimes and evaluatesstatisticon each resample. Whenvaluesis a tuple, the same sampled indices are applied to every array — required for ratio-of-sums statistics where numerator and denominator share block indexing (e.g. normed Patterson F3, Patterson D). To bootstrap two independent arrays, callblock_bootstrapon each separately.- Parameters:
values (ndarray or tuple of ndarrays) – Per-block values. Tuple entries must have the same length.
statistic (callable) –
statistic(*values) -> float.n_replicates (int, default 1000) – Number of bootstrap replicates.
rng (int, numpy.random.Generator, or None) – Seed or Generator for reproducibility.
Noneuses fresh entropy.
- Returns:
estimate (float) – Plug-in point estimate
statistic(*values)on the full data. This follows Efron & Tibshirani (1993) §6.2: the bootstrap mean is a noisy estimate of the plug-in estimate, not a replacement for it.se (float) – Bootstrap standard error,
std(replicates, ddof=1).replicates (ndarray, shape (n_replicates,)) – Bootstrap replicate values. Use e.g.
np.quantile(replicates, [0.025, 0.975])for a percentile 95% CI.
References
Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.
Admixture and F-Statistics
Per-Variant Statistics
- pg_gpu.admixture.patterson_f2(haplotype_matrix, pop_a, pop_b, missing_data='include')[source]
Unbiased estimator for F2(A, B), the branch length between populations.
- Parameters:
- Returns:
f2 – Per-variant F2 estimates.
- Return type:
ndarray, float64, shape (n_variants,)
- pg_gpu.admixture.patterson_f3(haplotype_matrix, pop_c, pop_a, pop_b, missing_data='include')[source]
Unbiased estimator for F3(C; A, B), the three-population admixture test.
A significantly negative F3 indicates that population C is admixed between populations A and B.
- Parameters:
- Returns:
T (ndarray, float64, shape (n_variants,)) – Un-normalized F3 estimates per variant.
B (ndarray, float64, shape (n_variants,)) – Heterozygosity estimates (2 * h_hat) for population C.
- pg_gpu.admixture.patterson_d(haplotype_matrix, pop_a, pop_b, pop_c, pop_d, missing_data='include')[source]
Unbiased estimator for D(A, B; C, D), the ABBA-BABA test.
Tests for admixture between (A or B) and (C or D).
- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) – ‘include’ - per-site n_valid ‘exclude’ - filter to sites with no missing data
- Returns:
num (ndarray, float64, shape (n_variants,)) – Numerator (un-normalized F4 estimates).
den (ndarray, float64, shape (n_variants,)) – Denominator.
Windowed Statistics
- pg_gpu.admixture.moving_patterson_f3(haplotype_matrix, pop_c, pop_a, pop_b, size, start=0, stop=None, step=None, normed=True, missing_data='include')[source]
Estimate F3(C; A, B) in moving windows.
- Parameters:
- Returns:
f3
- Return type:
ndarray, float64, shape (n_windows,)
Block-Jackknife Averaged
- pg_gpu.admixture.average_patterson_f3(haplotype_matrix, pop_c, pop_a, pop_b, blen, normed=True, missing_data='include')[source]
Estimate F3(C; A, B) with standard error via block-jackknife.
- Parameters:
- Returns:
f3 (float) – Overall estimate.
se (float) – Standard error.
z (float) – Z-score.
vb (ndarray) – Per-block values.
vj (ndarray) – Jackknife resampled values.
- pg_gpu.admixture.average_patterson_d(haplotype_matrix, pop_a, pop_b, pop_c, pop_d, blen, missing_data='include')[source]
Estimate D(A, B; C, D) with standard error via block-jackknife.
- Parameters:
- Returns:
d (float) – Overall estimate.
se (float) – Standard error.
z (float) – Z-score.
vb (ndarray) – Per-block values.
vj (ndarray) – Jackknife resampled values.
Dimensionality Reduction and Distance
PCA
- pg_gpu.decomposition.pca(haplotype_matrix, n_components=10, scaler='patterson', population=None, missing_data='include')[source]
Principal Component Analysis on haplotype data.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data. Rows are haplotypes, columns are variants.
n_components (int) – Number of principal components to compute.
scaler (str or None) – Scaling method before PCA: ‘patterson’ - scale by sqrt(p*(1-p)), Patterson et al. (2006) ‘standard’ - standardize to unit variance per variant None - center only (subtract mean)
population (str or list, optional) – Population subset to use.
missing_data (str) – ‘include’ - impute missing to per-site mean ‘exclude’ - filter sites with any missing
- Returns:
coords (ndarray, float64, shape (n_samples, n_components)) – Sample coordinates in PC space.
explained_variance_ratio (ndarray, float64, shape (n_components,)) – Proportion of variance explained by each component.
- pg_gpu.decomposition.randomized_pca(haplotype_matrix, n_components=10, scaler='patterson', population=None, n_iter=3, random_state=None, missing_data='include')[source]
Randomized PCA using truncated SVD approximation.
Faster than full PCA for large datasets where only a few components are needed.
- Parameters:
- Returns:
coords (ndarray, float64, shape (n_samples, n_components))
explained_variance_ratio (ndarray, float64, shape (n_components,))
Distance
- pg_gpu.decomposition.pairwise_distance(haplotype_matrix, metric='euclidean', population=None, missing_data='include')[source]
Compute pairwise distances between samples.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
metric (str) – Distance metric. Supported on GPU: ‘euclidean’, ‘cityblock’, ‘sqeuclidean’. Falls back to scipy for other metrics.
missing_data (str) – ‘include’ - mask missing, normalize per pair ‘exclude’ - filter sites with any missing
- Returns:
dist – Condensed distance matrix.
- Return type:
ndarray, float64, shape (n_samples * (n_samples - 1) // 2,)
- pg_gpu.decomposition.pcoa(dist, n_components=None)[source]
Principal Coordinate Analysis (classical MDS).
- Parameters:
dist (array_like) – Pairwise distances in condensed form (from pairwise_distance) or as a square distance matrix.
n_components (int, optional) – Number of dimensions to return. If None, returns all dimensions with positive eigenvalues.
- Returns:
coords (ndarray, float64, shape (n_samples, n_components)) – Sample coordinates.
explained_variance_ratio (ndarray, float64, shape (n_components,)) – Proportion of variance explained by each axis.
Local PCA (lostruct)
GPU port of the lostruct method (Li & Ralph 2019) for detecting genomic
regions where population structure differs from the chromosome-wide
pattern (inversions, introgression, low-recombination regions).
.. Per-window eigendecomposition is batched with a single cp.linalg.eigh over a
.. stacked (n_windows, n_samples, n_samples) tensor.
For most users, pg_gpu.decomposition.lostruct() is the
one-and-done entry point: it bundles per-window local PCA, the
between-window Frobenius distance, MDS on that distance, and corner /
outlier detection into a single call and returns a
LostructResult. The lower-level
functions below are exposed for callers who want intermediate access
or want to recompute one stage with different parameters.
- pg_gpu.decomposition.lostruct(haplotype_matrix, *, window_params=None, k=2, scaler=None, missing_data='include', population=None, batch_size=None, window_size=None, step_size=None, window_type='snp', regions=None, jackknife=False, n_blocks=10, jackknife_aggregate='mean', npc=None, normalize='L1', pc_dist_weights=1, n_components=2, corner_prop=0.05, n_corners=3, random_state=None, engine='dense-eigh', tile_size=None, oversample=8, n_subspace_iter=1)[source]
End-to-end lostruct pipeline (Li & Ralph 2019).
Runs the four-step recipe in one call: per-window local PCA, pairwise Frobenius distance between window covariance reps, classical MDS on that distance, and corner / outlier detection in the MDS embedding.
Optionally also computes the block-within-window jackknife standard error of the per-window eigenvectors (Li & Ralph 2019, Appendix); when requested the SE is computed in the same window pass as the base eigendecomposition (shared matrix prep), and is exposed via
LostructResult.jackknife_se.- Parameters:
haplotype_matrix (HaplotypeMatrix)
window_params (WindowParams, optional) – Pre-built window spec. Required when
window_sizeis not given.k (int) – Number of PCs retained per window in the local PCA step.
scaler (see
local_pca().)missing_data (see
local_pca().)population (see
local_pca().)batch_size (see
local_pca().)window_size (int | None) – Short-hand for constructing a
WindowParams.step_size (int | None) – Short-hand for constructing a
WindowParams.window_type (str) – Short-hand for constructing a
WindowParams.regions – Short-hand for constructing a
WindowParams.jackknife (bool) – If True, also compute the delete-1 block jackknife standard error of the per-window top-k eigenvectors and expose it via
LostructResult.jackknife_se. Per-window matrix preparation is shared with the base eigendecomposition, so enabling this is much cheaper than callinglocal_pca_jackknifeseparately.n_blocks (int) – Number of jackknife blocks per window when
jackknife=True. Ignored otherwise.jackknife_aggregate ({'mean', None}) –
'mean'returns shape(n_windows, k)– per-PC standard error averaged across samples (matches the R reference).Nonereturns the full(n_windows, k, n_samples)tensor. Ignored whenjackknife=False.npc (int, optional) – Number of PCs used by
pc_dist. Defaults tok.normalize ({'L1', 'L2', None}) – Per-window eigenvalue normalization before distance computation.
pc_dist_weights (array_like or scalar) – Sample weights forwarded to
pc_distasw.n_components (int) – Number of MDS dimensions to compute (default 2).
corner_prop (float) – Fraction of points per corner cluster (forwarded to
corners).n_corners (int) – Number of corner clusters to detect. Set to 0 (or any non-positive value) to skip corner detection and leave
LostructResult.corner_indicesas None.random_state (int, optional) – RNG seed for the MEC tie-breaks inside
corners.engine (str)
tile_size (int | None)
oversample (int)
n_subspace_iter (int)
- Returns:
Bundle of the four pipeline outputs. Each piece is also reachable via the underlying functions for users who want to recompute one stage with different parameters.
- Return type:
See also
References
Li & Ralph (2019), Genetics. https://doi.org/10.1534/genetics.118.301747
- pg_gpu.decomposition.local_pca(haplotype_matrix, window_params=None, k=2, scaler=None, missing_data='include', population=None, batch_size=None, window_size=None, step_size=None, window_type='snp', regions=None, engine='dense-eigh', tile_size=None, oversample=8, n_subspace_iter=1, random_state=None)[source]
Local PCA along the genome (lostruct; Li & Ralph 2019).
For each genomic window, computes the top-k eigenvalues/eigenvectors of the sample-sample covariance matrix. The covariance matches R’s cov(sweep(x, 1, rowMeans(x), “-“)) so outputs are directly comparable to lostruct::eigen_windows().
- Parameters:
haplotype_matrix (HaplotypeMatrix)
window_params (WindowParams, optional) – Pre-built window spec. Required when
window_sizeis not given.k (int) – Number of PCs to retain per window.
scaler (str or None) – None (lostruct default), ‘patterson’, or ‘standard’. See pca().
missing_data (str) – ‘include’ (impute to per-site mean) or ‘exclude’ (drop missing sites). lostruct uses pairwise-complete; we approximate with mean imputation.
batch_size (int, optional) – Windows per batched eigh call. Auto-sized if None.
window_size (int | None) – Short-hand for constructing a WindowParams without importing it.
step_size (int | None) – Short-hand for constructing a WindowParams without importing it.
window_type (str) – Short-hand for constructing a WindowParams without importing it.
regions – Short-hand for constructing a WindowParams without importing it.
engine (str) –
'dense-eigh'(default) holds the full(n_windows, n_hap, n_hap)Gram stack on the GPU and runs a single batchedcp.linalg.eigh; ~2x faster per window than streaming when the stack fits.'streaming-dense'processes each window independently with bounded peak memory; required when the Gram stack would exceed GPU memory (e.g. thousands of haplotypes x thousands of windows).'streaming-rsvd'is the streaming path with a randomized top-k per window; chosen explicitly for very thin windows.'auto'estimates the Gram-stack peak and picksdense-eighwhen it fits in 70% of free GPU memory,streaming-denseotherwise.tile_size (int | None)
oversample (int)
n_subspace_iter (int)
random_state (int | None)
- Return type:
- pg_gpu.decomposition.local_pca_jackknife(haplotype_matrix, window_params=None, k=2, n_blocks=10, scaler=None, missing_data='include', population=None, aggregate='mean', batch_size=None, window_size=None, step_size=None, window_type='snp', regions=None)[source]
Delete-1 block jackknife SE of local PCs (DPGP_jackknife_var.R port).
For each window, partitions variants into n_blocks contiguous blocks; for each block, recomputes the sample-sample covariance with that block removed, eigendecomposes it, takes the top-k eigenvectors, sign-aligns across replicates, and computes the jackknife SE per sample.
- Parameters:
n_blocks (int) – Number of jackknife blocks per window.
aggregate ({'mean', None}) – ‘mean’ returns shape (n_windows, k) — per-PC SE averaged across samples (matches R’s mean(b)). None returns the full (n_windows, k, n_samples) tensor.
haplotype_matrix (HaplotypeMatrix)
k (int)
scaler (str | None)
missing_data (str)
batch_size (int | None)
window_size (int | None)
step_size (int | None)
window_type (str)
- Returns:
NaN rows for windows with fewer than max(k, n_blocks+1) variants.
- Return type:
- pg_gpu.decomposition.pc_dist(source, npc=None, normalize='L1', w=1)[source]
Pairwise Frobenius distance between windows’ low-rank covariance reps.
Implements the trace identity from
lostruct::pc_dist:||A - B||^2 = sum(a^2) + sum(b^2) - 2 * tr(diag(a) X diag(b) X^T)
where
X = U^T Vfor eigenvector blocks U, V.- Parameters:
source (LocalPCAResult or numpy.ndarray) – Either a LocalPCAResult or a flat lostruct-style matrix of shape (n_windows, 1 + k + k*n_samples).
npc (int, optional) – Number of PCs to use. Defaults to
source.kwhensourceis a LocalPCAResult; required whensourceis a flat matrix.normalize ({'L1', 'L2', None}) – Per-window eigenvalue normalization before distance computation.
w (array_like or scalar) – Sample weights (default 1).
- Returns:
Symmetric, non-negative distance matrix. NaN for invalid windows.
- Return type:
numpy.ndarray, shape (n_windows, n_windows)
- pg_gpu.decomposition.corners(xy, prop, k=3, random_state=None)[source]
Find k “corner” clusters in a 2D embedding (lostruct::corners).
Computes the minimum enclosing circle of xy, then for each of the k defining points returns the int(prop * n) nearest xy points. If fewer than k defining points exist on the MEC boundary, extra corners are added greedily from points farthest from the MEC center and not already in an existing corner’s neighborhood (mirrors corners.R:42-53).
- Parameters:
xy (numpy.ndarray, shape (n, 2)) – Coordinates (e.g. MDS output). NaN rows are dropped before the MEC.
prop (float) – Fraction of points per corner.
k (int) – Number of corners (effective minimum 3, matching R).
random_state (int, optional)
- Returns:
Column i = indices (into the original, NaN-inclusive xy) of points closest to the i-th corner.
- Return type:
numpy.ndarray, int64, shape (n_per_corner, k)
- class pg_gpu.decomposition.LocalPCAResult(windows, eigvals, eigvecs, sumsq, k, scaler, missing_data, jackknife_se=None)[source]
Bases:
objectPer-window local PCA output.
- Parameters:
- windows
One row per window (chrom, start, end, center, n_variants, window_id).
- Type:
pandas.DataFrame
- eigvals
Top-k eigenvalues of each window’s sample-sample covariance matrix, descending. NaN for windows with fewer than k variants.
- Type:
numpy.ndarray, float64, shape (n_windows, k)
- eigvecs
Top-k eigenvectors. NaN for invalid windows.
- Type:
numpy.ndarray, float64, shape (n_windows, k, n_samples)
- sumsq
Sum of squared entries of each window’s covariance matrix; used by pc_dist for the proportion-of-variance denominator.
- Type:
numpy.ndarray, float64, shape (n_windows,)
- jackknife_se
Delete-1 block jackknife SE of the top-k eigenvectors. Shape
(n_windows, k)withaggregate='mean', or(n_windows, k, n_samples)withaggregate=None.Nonewhen jackknife was not computed.- Type:
numpy.ndarray or None
- to_lostruct_matrix()[source]
Flat (n_windows, 1 + k + k*n_samples) matrix matching lostruct::eigen_windows() output layout.
Eigenvector block is column-major per R: PC_1 samples 1..N, then PC_2 samples 1..N, etc. That coincides with the C-order flatten of our (k, n_samples) per-window array.
- Return type:
- class pg_gpu.decomposition.LostructResult(local_pca, distance, mds, explained_variance_ratio, corner_indices)[source]
Bases:
objectEnd-to-end lostruct pipeline output.
Bundles the four pipeline stages into one object so a caller can run
lostruct(hm, ...)and pick off whichever intermediates they need.- Parameters:
local_pca (LocalPCAResult)
distance (ndarray)
mds (ndarray)
explained_variance_ratio (ndarray)
corner_indices (ndarray | None)
- local_pca
Per-window local PCA output (eigvals, eigvecs, sumsq, windows).
- Type:
- distance
Pairwise Frobenius distance between windows from
pc_dist.- Type:
numpy.ndarray, shape (n_windows, n_windows)
- mds
MDS coordinates from
pcoa.- Type:
numpy.ndarray, shape (n_windows, n_components)
- explained_variance_ratio
Variance fraction explained by each MDS axis.
- Type:
numpy.ndarray, shape (n_components,)
- corner_indices
Window indices of the
n_cornersoutlier clusters fromcorners.Nonewhen corner detection was skipped.- Type:
numpy.ndarray or None, shape (n_per_corner, n_corners)
- property windows: DataFrame
Convenience pointer at
local_pca.windows.
Windowed Statistics (GPU-Native)
The windowed_analysis() convenience function automatically routes
through fused CUDA kernels for maximum performance. A single kernel
launch processes all windows in parallel.
- pg_gpu.windowed_analysis.windowed_analysis(haplotype_matrix, window_size=50000, step_size=None, statistics=['pi'], populations=None, missing_data='include', span_normalize=True, accessible_bed=None, chrom=None, **kwargs)[source]
Convenience function for windowed analysis.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Input haplotype data
window_size (int) – Window size in base pairs
step_size (int, optional) – Step size. If None, uses window_size
statistics (list) – Statistics to compute
populations (list, optional) – Population names
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
span_normalize (bool) –
True(default): auto-detect best denominator per window.False: return raw sums.accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions. If provided and the matrix has no mask, loads the mask.
**kwargs – Additional arguments passed to WindowedAnalyzer
chrom (str)
- Returns:
Windowed statistics results
- Return type:
pd.DataFrame
- pg_gpu.windowed_analysis.windowed_statistics_fused(haplotype_matrix, bp_bins, statistics=('pi', 'theta_w', 'tajimas_d', 'segregating_sites', 'singletons'), population=None, pop1=None, pop2=None, per_base=True, is_accessible=None, _win_starts=None, _win_stops=None, missing_data='include', chrom=None)[source]
GPU-native windowed statistics using fused CUDA kernels.
One kernel launch processes ALL windows in parallel. Each thread block handles one window, with threads cooperatively reducing over variants. Reads the haplotype matrix once and computes all statistics simultaneously.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
bp_bins (array_like) – Window edges in base pairs. N+1 edges define N windows.
statistics (tuple of str) – Statistics to compute. Single-pop: ‘pi’, ‘theta_w’, ‘tajimas_d’, ‘segregating_sites’, ‘singletons’. Two-pop: ‘fst’, ‘dxy’.
population (str or list, optional) – Population for single-pop statistics.
pop1 (str or list, optional) – Populations for two-pop statistics.
pop2 (str or list, optional) – Populations for two-pop statistics.
per_base (bool) – Normalize by window size in base pairs.
is_accessible (array_like, optional) – Accessibility mask for per-base normalization.
chrom (str or int, optional) – Chromosome label emitted in the
chromoutput column.
- Returns:
Maps column names to numpy arrays of shape (n_windows,). The first six columns are always
chrom, start, end, center, n_variants, window_id, followed by one column per requested statistic.- Return type:
Supported Fused Windowed Statistics
Single-population (one kernel launch for all):
pi– nucleotide diversitytheta_w– Watterson’s thetatajimas_d– Tajima’s Dsegregating_sites– count of segregating sitessingletons– count of singletons
Two-population (one kernel launch for all):
fst– Hudson’s FST (ratio of averages)fst_hudson– alias forfstfst_wc– Weir-Cockerham FST (haploid)dxy– absolute divergenceda– net divergence (Dxy - mean within-pop pi)
Selection scan statistics:
garud_h1,garud_h12,garud_h123,garud_h2h1– Garud’s H statistics per window (prefix-sum hashing + shared-memory sort)mean_nsl– mean nSL per window (per-site nSL + scatter binning)
Structure / dimensionality reduction:
local_pca– Li & Ralph (2019) local PCA. Vector-valued per window; dispatches topg_gpu.decomposition.local_pca()and returns aLocalPCAResultinstead of the scalar-stat DataFrame. Can be combined with scalar statistics in the same call; the scalar columns are merged ontoresult.windows.local_pca_jackknife– delete-1 block jackknife standard error for local PCA eigenvectors. The jackknife is block-within-window: each window is partitioned inton_blocksSNP-blocks, the local-PCA eigendecomposition is recomputed with one block left out, and the per-window standard error is taken across the leave-one-out replicates. Sign alignment between replicates and per-window aggregation follow the procedure in Li & Ralph (2019), Appendix, which also discusses interpreting the resulting standard error as a signal-to-noise diagnostic for outlier windows. Returns aLocalPCAResultwith thejackknife_sefield populated. When requested alongsidelocal_pca, per-window matrix preparation is shared. Acceptsn_blocksandaggregatekwargs.
Legacy Functions
- pg_gpu.windowed_analysis.windowed_statistics(haplotype_matrix, bp_bins, statistics=('pi', 'theta_w', 'tajimas_d'), population=None, pop1=None, pop2=None, per_base=True, is_accessible=None, chrom=None)[source]
GPU-native windowed statistics with no Python loop over windows.
Computes per-variant values once, then aggregates into windows using GPU scatter_add operations. Dramatically faster than per-window computation for large numbers of windows.
- Parameters:
haplotype_matrix (HaplotypeMatrix) – Haplotype data.
bp_bins (array_like) – Window edges in base pairs. N+1 edges define N windows.
statistics (tuple of str) – Statistics to compute. Supported: Single-population: ‘pi’, ‘theta_w’, ‘tajimas_d’, ‘segregating_sites’, ‘singletons’, ‘het_expected’ Two-population: ‘fst’, ‘dxy’
population (str or list, optional) – Population for single-pop statistics.
pop1 (str or list, optional) – Populations for two-pop statistics (fst, dxy).
pop2 (str or list, optional) – Populations for two-pop statistics (fst, dxy).
per_base (bool) – If True, normalize by window size in base pairs.
is_accessible (array_like, optional) – Boolean accessibility mask for per-base normalization.
chrom (str or int, optional) – Chromosome label emitted in the
chromoutput column.
- Returns:
Maps column names to numpy arrays of shape (n_windows,). The first six columns are
chrom, start, end, center, n_variants, window_id, followed by one column per requested statistic.- Return type:
Moments Integration (LD Inference)
GPU-accelerated drop-in replacement for moments.LD.Parsing.compute_ld_statistics().
Computes the 15 two-population LD statistics (DD, Dz, pi2) and 3 heterozygosity
statistics on GPU, returning output in the exact format moments expects for
demographic inference.
Requires the moments pixi environment: pixi install -e moments.
- pg_gpu.moments_ld.compute_ld_statistics(vcf_file=None, rec_map_file=None, pop_file=None, pops=None, r_bins=None, bp_bins=None, use_genotypes=False, report=True, ac_filter=True, haplotype_matrix=None, genotype_matrix=None, accessible_bed=None, pop_assignment=None)[source]
GPU-accelerated drop-in replacement for moments.LD.Parsing.compute_ld_statistics.
Accepts the same arguments as the moments version so existing pipelines can switch by changing only the import:
# moments (CPU): import moments.LD ld_stats = moments.LD.Parsing.compute_ld_statistics( vcf_file="data.vcf.gz", rec_map_file="rec_map.txt", pop_file="pops.txt", pops=["popA", "popB"], r_bins=[0, 1e-6, 2e-6, 5e-6], ) # pg_gpu (GPU, same call signature): from pg_gpu.moments_ld import compute_ld_statistics ld_stats = compute_ld_statistics( vcf_file="data.vcf.gz", rec_map_file="rec_map.txt", pop_file="pops.txt", pops=["popA", "popB"], r_bins=[0, 1e-6, 2e-6, 5e-6], )
The returned dict has the same structure (keys ‘bins’, ‘sums’, ‘stats’, ‘pops’) and can be passed directly to moments inference functions.
- Parameters:
vcf_file (str, optional) – Path to VCF file. Not needed if haplotype_matrix/genotype_matrix provided.
rec_map_file (str, optional) – Recombination map (tab-delimited: pos, Map(cM)). Required with r_bins.
pop_file (str, dict, numpy.ndarray, list, or False, optional) – Sample-to-population assignment in any form
normalize_pop_inputaccepts (path / sample-to-pop dict / labels array / zarr key name /Falseto disable). The two kwargs are aliases;pop_fileis the moments-compatible spelling,pop_assignmentis the spelling the rest of pg_gpu uses. Passing both at once is an error.pop_assignment (str, dict, numpy.ndarray, list, or False, optional) – Sample-to-population assignment in any form
normalize_pop_inputaccepts (path / sample-to-pop dict / labels array / zarr key name /Falseto disable). The two kwargs are aliases;pop_fileis the moments-compatible spelling,pop_assignmentis the spelling the rest of pg_gpu uses. Passing both at once is an error.pops (list of str) – Population names (1-4). Defaults to [‘pop0’, ‘pop1’].
r_bins (array-like, optional) – Recombination rate bin edges (Morgans).
bp_bins (array-like, optional) – Base-pair distance bin edges (alternative to r_bins).
use_genotypes (bool) – If True, use diploid genotype counts (9-way) instead of haplotype counts (4-way). Requires unphased diploid data.
report (bool) – Print progress.
ac_filter (bool) – Apply biallelic filter.
haplotype_matrix (HaplotypeMatrix, optional) – Pre-loaded HaplotypeMatrix (skips VCF loading and GPU transfer).
genotype_matrix (GenotypeMatrix, optional) – Pre-loaded GenotypeMatrix (skips VCF loading and GPU transfer).
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions. Variants at inaccessible positions are removed before computing statistics.
- Return type:
dict with keys ‘bins’, ‘sums’, ‘stats’, ‘pops’ (moments format).
Distance Distribution Statistics
- pg_gpu.distance_stats.pairwise_diffs(matrix, population=None, missing_data='include')[source]
Compute pairwise Hamming distances on GPU.
Accepts HaplotypeMatrix (0/1 data, single matrix multiply) or GenotypeMatrix (0/1/2 data, indicator matrix approach). Dispatches automatically.
- Parameters:
matrix (HaplotypeMatrix or GenotypeMatrix)
missing_data (str)
- Returns:
diffs
- Return type:
ndarray, float64, condensed form (n_pairs,)
- pg_gpu.distance_stats.dist_moments(matrix, population=None, missing_data='include')[source]
Compute variance, skewness, and kurtosis of pairwise distances.
Computes the distance matrix once and derives all three moments, avoiding redundant matrix multiplies.
- Parameters:
matrix (HaplotypeMatrix or GenotypeMatrix)
missing_data (str)
- Returns:
var (float)
skew (float)
kurt (float)
- pg_gpu.distance_stats.dist_var(matrix, population=None, missing_data='include')[source]
Variance of pairwise distance distribution.
Visualization
Site Frequency Spectrum
- pg_gpu.plotting.plot_sfs(s, ax=None, folded=False, scaled=False, clip_endpoints=True, log_scale=True, color=None, **kwargs)[source]
Plot a site frequency spectrum.
- Parameters:
s (array_like) – SFS array (from pg_gpu.sfs.sfs, sfs_folded, etc.).
ax (matplotlib.axes.Axes, optional) – Axes to plot on. If None, creates a new figure.
folded (bool) – Label x-axis as minor allele count.
scaled (bool) – Label y-axis as scaled count.
clip_endpoints (bool) – If True, exclude the first and last entries (monomorphic classes).
log_scale (bool) – If True, use log scale for y-axis.
color (str, optional) – Bar color.
- Returns:
ax
- Return type:
matplotlib.axes.Axes
Linkage Disequilibrium
- pg_gpu.plotting.plot_pairwise_ld(r2_matrix, ax=None, cmap='Greys', vmin=0, vmax=1, positions=None, **kwargs)[source]
Plot a pairwise LD (r-squared) matrix as a heatmap.
- Parameters:
- Returns:
ax
- Return type:
matplotlib.axes.Axes
- pg_gpu.plotting.plot_ld_decay(distances, r2_values, ax=None, bins=50, color=None, percentile=50, **kwargs)[source]
Plot LD decay curve (r-squared vs distance).
- Parameters:
distances (array_like) – Pairwise distances between variants.
r2_values (array_like) – Pairwise r-squared values.
ax (matplotlib.axes.Axes, optional)
bins (int or array_like) – Number of distance bins or explicit bin edges.
color (str, optional)
percentile (float) – Percentile of r-squared to plot per bin (default: median).
- Returns:
ax
- Return type:
matplotlib.axes.Axes
PCA and Population Structure
- pg_gpu.plotting.plot_pca(coords, ax=None, pc_x=0, pc_y=1, explained_variance=None, labels=None, palette='Set2', s=20, alpha=0.8, **kwargs)[source]
Scatter plot of PCA coordinates.
- Parameters:
coords (array_like, shape (n_samples, n_components)) – PCA coordinates (from pg_gpu.decomposition.pca).
ax (matplotlib.axes.Axes, optional)
pc_x (int) – Which principal components to plot (0-indexed).
pc_y (int) – Which principal components to plot (0-indexed).
explained_variance (array_like, optional) – Variance explained ratios for axis labels.
labels (array_like, optional) – Population labels per sample for coloring.
palette (str) – Seaborn palette name.
s (float) – Marker size.
alpha (float) – Marker transparency.
- Returns:
ax
- Return type:
matplotlib.axes.Axes
- pg_gpu.plotting.plot_pairwise_distance(dist, ax=None, labels=None, cmap='viridis', **kwargs)[source]
Plot a pairwise distance matrix as a heatmap.
- Parameters:
dist (array_like) – Condensed or square distance matrix.
ax (matplotlib.axes.Axes, optional)
labels (array_like, optional) – Sample labels for axes.
cmap (str) – Colormap.
- Returns:
ax
- Return type:
matplotlib.axes.Axes
Windowed Statistics
- pg_gpu.plotting.plot_windowed(window_start, values, ax=None, stat_name='', color=None, fill_alpha=0.15, **kwargs)[source]
Plot windowed statistics along the genome.
- Parameters:
- Returns:
ax
- Return type:
matplotlib.axes.Axes
- pg_gpu.plotting.plot_windowed_panel(results, statistics=None, figsize=None, colors=None)[source]
Plot multiple windowed statistics as stacked panels.
- Parameters:
results (dict) – Output from windowed_statistics() or windowed_statistics_fused(). Must contain ‘start’ and statistic arrays.
statistics (list of str, optional) – Which statistics to plot. If None, plots all non-metadata keys.
figsize (tuple, optional) – Figure size.
colors (list, optional) – Colors for each panel.
- Returns:
fig (matplotlib.figure.Figure)
axes (list of matplotlib.axes.Axes)
Haplotype Structure
- pg_gpu.plotting.plot_haplotype_frequencies(freqs, ax=None, palette='Paired', singleton_color='white')[source]
Plot haplotype frequencies as a horizontal stacked bar.
- Parameters:
- Returns:
ax
- Return type:
matplotlib.axes.Axes
FrequencySpectrum (SFS-Based Estimation)
- class pg_gpu.diversity.FrequencySpectrum(haplotype_matrix, population=None, missing_data='include', n_total_sites=None)[source]
Site frequency spectrum with support for variable sample sizes.
Computes derived allele counts on GPU, groups by per-site sample size, and provides theta estimation via weight vector dot products. For built-in estimators, use the scalar functions (
pi(), etc.) which are faster. This class is for custom weight functions, SFS inspection, projection, and the general Achaz variance framework.- Parameters:
haplotype_matrix (HaplotypeMatrix)
missing_data (str) – ‘include’ (default) or ‘exclude’
n_total_sites (int, optional) – Total callable sites for invariant site correction.
- theta(weights='pi', span_normalize=False, span=None)[source]
Compute a theta estimator from the SFS.
- neutrality_test(w1='pi', w2='watterson')[source]
Compute T = (theta1 - theta2) / sqrt(var) using Achaz (2009) Eq. 9.