API Reference 

property has_accessible_mask: Whether an accessible site mask is attached.

set_accessible_mask(mask_or_path, chrom=None)[source]

Attach an accessible site mask (non-destructive).

The mask filters which variants are visible through the haplotypes and positions properties. Original data is preserved and a different mask can be applied later. Returns self for chaining.

Parameters:

mask_or_path (str, path-like, numpy.ndarray, or AccessibleMask) – If a string/path, treated as a BED file path defining accessible regions. If an ndarray, wrapped as an AccessibleMask with offset equal to chrom_start. If already an AccessibleMask, assigned directly.
chrom (str, optional) – Chromosome name to extract from the BED file. Required when mask_or_path is a BED file path.

Returns:

For method chaining.

Return type:

self

remove_accessible_mask()[source]

Remove the accessible mask, restoring all original variants.

Returns self for chaining.

property has_invariant_info: Whether invariant site information is available for span normalization.

property n_callable_sites

Total callable sites in the analysis universe.

Alias for n_total_sites. When an accessible mask is set, this is the BED span (mask.total_accessible). When no mask is set but the matrix was loaded with include_invariant=True, this is the matrix length at construction. Otherwise None.

This is the denominator for per-base span normalization, and n_callable_sites = n_segregating_sites + n_invariant_sites.

property n_segregating_sites

Number of polymorphic sites in the matrix.

Counts sites where 0 < derived_count < n_valid (i.e. polymorphic among observed haplotypes), with at least 2 valid samples.

property n_invariant_sites

Number of invariant sites in the callable span, or None if unknown.

Computed as n_callable_sites - n_segregating_sites. The matrix may physically contain fewer than n_callable_sites rows (e.g. for a variants-only VCF the matrix has only the polymorphic rows and the rest are implied invariants from the accessible-mask span); in that case the returned count includes those implied invariants.

Note that num_variants (matrix row count) is generally not equal to n_segregating_sites and not equal to n_callable_sites – it is just the physical matrix length.

transfer_to_gpu()[source]: Transfer data from CPU to GPU.

transfer_to_cpu()[source]: Transfer data from GPU to CPU.

classmethod from_vcf(path, region=None, samples=None, include_invariant=False, accessible_bed=None, fields=None)[source]

Construct a HaplotypeMatrix from a VCF file.

Parameters:

path (str) – Path to VCF/BCF file (optionally gzipped + tabix-indexed).
region (str, optional) – Genomic region to load, e.g. ‘chr1:1000000-2000000’. Requires the VCF to be bgzipped and tabix-indexed.
samples (list of str, optional) – Subset of samples to load. If None, loads all samples.
include_invariant (bool) – If True, set n_total_sites from the loaded variant count.
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions.
fields (list of str, optional) – VCF FORMAT/INFO tags to load alongside GT (e.g. ['GQ', 'DP', 'MQ']). Each tag is auto-resolved: INFO matches land in hm.fields[tag] as a (n_var,) array, FORMAT matches as a (n_var, n_samples) array. Tags missing from the VCF are warned and dropped silently.

Returns:

Phased haplotype data with sample names stored.

Return type:

HaplotypeMatrix or StreamingHaplotypeMatrix

classmethod from_zarr(path, region=None, accessible_bed=None, pop_assignment=None, streaming='auto', chunk_bp=1500000, prefetch=1, backend='auto', fields=None)[source]

Construct a HaplotypeMatrix from a Zarr store.

Supports both VCZ (bio2zarr) and scikit-allel zarr layouts. Layout is auto-detected. For multi-chromosome stores, region must be specified.

Parameters:

path (str) – Path to Zarr store directory.
region (str, optional) – Genomic region ‘chrom:start-end’ to load a subset.
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions.
pop_assignment (str, numpy.ndarray, list, dict, or False, optional) –
Sample-to-population assignments. Accepted forms:
- str – path to a tab-delimited file with a header row and sample pop columns, in the format that HaplotypeMatrix.load_pop_file expects.
- numpy.ndarray or list of length n_samples – one population label per sample, in the same order as the store’s sample axis.
- dict mapping sample id to population label. Any sample not in the dict is left unassigned.
- False – skip the auto-load.
Default (None) looks for <path>.pops.tsv next to the store and loads it if present, announcing the auto-load to stderr.
streaming ({'auto', 'always', 'never'}, optional) – Whether to return a StreamingHaplotypeMatrix that iterates the store chunk-by-chunk through the GPU (suitable for large-scale stores that do not fit entirely in GPU memory). 'always' forces streaming; 'never' forces a single-shot load (and raises MemoryError if the matrix would not fit in free GPU memory). 'auto' (default) checks the projected footprint of the haplotype matrix against free GPU memory and picks streaming when the full matrix would consume more than half the device. Scikit-allel-formatted zarr always routes to the single-shot path because the only allowed format for 'streaming' is VCZ.
chunk_bp (int, optional) – Genomic chunk size in bp for the streaming path. Ignored when the data are loaded into GPU memory in one shot.
prefetch (int, optional) – Number of chunks to read ahead on a background thread while the current chunk is being processed – 0 disables read-ahead; 1 (default) overlaps one chunk’s disk read with the previous chunk’s GPU work. Ignored when the data are loaded into GPU memory in one shot.
backend ({'auto', 'host', 'kvikio'}, optional) – Streaming chunk-fetch backend. 'kvikio' decodes the store’s codec on the GPU via kvikio + nvCOMP; only works when call_genotype’s codec is in the nvCOMP-supported list (zstd, blosc, lz4, deflate) and the sample-axis chunking is bio2zarr-shaped (sample chunk smaller than the full sample axis). 'host' is the host-buffer fallback. 'auto' (default) picks 'kvikio' when both conditions hold and warns + 'host' when chunks are whole-sample-axis (the kvikio path gives no speedup at that chunking).
fields (list of str, optional) – VCF FORMAT/INFO tags to load alongside genotypes (e.g. ['GQ', 'DP', 'MQ']). Each tag is auto-resolved from the store: variant_<tag> lands in hm.fields[tag] as a (n_var,) array, call_<tag> as (n_var, n_samples). Tags missing from the store are warned and dropped silently. Not supported on the streaming path.

Return type:

to_zarr(zarr_path, format='vcz', contig_name=None)[source]

Save haplotype data to Zarr format.

Parameters:

zarr_path (str) – Output Zarr store path.
format (str) – 'vcz' (default) for bio2zarr-compatible VCZ layout, 'scikit-allel' for legacy layout.
contig_name (str, optional) – Chromosome/contig name to store in VCZ format.

static vcf_to_zarr(vcf_paths, zarr_path, worker_processes=None, icf_path=None, max_memory='4GB', show_progress=True)[source]

Convert VCF file(s) to VCZ-format Zarr store using bio2zarr.

Parameters:

vcf_paths (str or list of str) – Path(s) to VCF/BCF files (bgzipped + indexed).
zarr_path (str) – Output Zarr store path.
worker_processes (int, optional) – Number of worker processes. Defaults to os.cpu_count().
icf_path (str, optional) – Directory for intermediate columnar format files. Defaults to <zarr_path>.icf_tmp (same filesystem as output). ICF files can be 1-3x the VCF size.
max_memory (str) – Maximum memory for encoding step. Default ‘4GB’.
show_progress (bool) – Show progress bars. Default True.

load_pop_file(pop_assignment, pops=None)[source]

Load population assignments from a tab-delimited file or an already-resolved sample->population mapping.

Sets sample_sets from a file (or dict) mapping sample names to populations. Requires that sample names were stored during from_vcf / from_zarr.

Parameters:

pop_assignment (str or dict) – Either a path to a tab-delimited file with columns sample pop (header line starting with sample is skipped), or a dict mapping sample names to population labels.
pops (list of str, optional) – Populations to include. If None, includes all found populations.

classmethod from_ts(ts, device='CPU', include_invariant=False, accessible_bed=None, chrom=None)[source]

Create a HaplotypeMatrix from a tskit.TreeSequence.

Parameters:

ts (TreeSequence) – A tskit.TreeSequence object
device (str) – ‘CPU’ or ‘GPU’
include_invariant (bool) – If True, set n_total_sites from the sequence length so that calculations can account for invariant sites analytically (no extra rows stored).
accessible_bed (str) – Path to a BED file defining accessible regions.
chrom (str) – Chromosome name for BED file filtering.

Returns:

A new HaplotypeMatrix instance

Return type:

Notes

Populations declared in the tree sequence (with a name in metadata) are automatically registered in sample_sets: each pop name maps to the haplotype row indices of its samples. The no-demography case (sim_ancestry(samples=N) with a single unnamed population) leaves sample_sets unset. Users who need custom groupings can overwrite hm.sample_sets after construction.

get_matrix()[source]

Returns the haplotype matrix.

Returns:: The array representing the haplotype/genotype matrix.
Return type:: cp.ndarray

get_positions()[source]

Returns the variant positions.

Returns:: The array of positions.
Return type:: cp.ndarray

property shape

Returns the shape of the haplotype matrix.

Returns:

A tuple representing the dimensions (variants, samples): of the haplotype matrix.

Return type:

property num_variants: Returns the number of variants in the haplotype matrix.

property num_haplotypes: Returns the number of haplotypes in the haplotype matrix.

get_subset(positions)[source]

Get a subset of the haplotype matrix based on the provided positions.

Parameters:: positions – A one-dimensional array of indices to select from the haplotype matrix. This can be either a NumPy array or a CuPy array.
Returns:: A new instance containing the subset of the haplotype matrix.
Return type:: HaplotypeMatrix

get_subset_from_range(low, high)[source]

Get a subset of the haplotype matrix based on a range of positions.

Parameters:

low (int) – The lower bound of the range (inclusive).
high (int) – The upper bound of the range (exclusive).

Returns:

A new instance containing the subset of the haplotype matrix.

Return type:

apply_biallelic_filter()[source]

Apply biallelic filter to remove variants that are not strictly biallelic.

This filter matches the behavior of moments’ get_genotypes function, which uses is_biallelic_01() to remove variants that: 1. Have more than 2 alleles present in the data 2. Don’t have both reference (0) and alternate (1) alleles present

This is the actual filtering that moments does by default, not an AC filter.

Returns:: A new HaplotypeMatrix instance with filtered variants.
Return type:: HaplotypeMatrix

Note

This replicates moments’ is_biallelic_01() filtering behavior.

is_missing(axis=None)[source]

Detect missing calls (-1 values).

Parameters:: axis (int, optional) – Axis along which to compute. If None, returns boolean array same shape as haplotypes. If 0, returns missing per variant. If 1, returns missing per sample.
Returns:: missing – Boolean array indicating missing data
Return type:: array

is_called(axis=None)[source]

Detect valid (non-missing) calls.

Parameters:: axis (int, optional) – Axis along which to compute. If None, returns boolean array same shape as haplotypes. If 0, returns called per variant. If 1, returns called per sample.
Returns:: called – Boolean array indicating valid data
Return type:: array

count_missing(axis=None)[source]

Count missing calls per variant or sample.

Parameters:: axis (int, optional) – Axis along which to count. If None, returns total count. If 0, returns count per variant. If 1, returns count per sample.
Returns:: count – Count of missing calls
Return type:: int or array

count_called(axis=None)[source]

Count valid calls per variant or sample.

Parameters:: axis (int, optional) – Axis along which to count. If None, returns total count. If 0, returns count per variant. If 1, returns count per sample.
Returns:: count – Count of valid calls
Return type:: int or array

get_span(mode='auto')[source]

Get the genomic span for normalization calculations.

Parameters:: mode (str) – ‘auto’ - Use best available: accessible mask > n_total_sites > total genomic span > callable span ‘accessible’ - Use accessible base count from mask (error if no mask set) ‘per_base’ - Use total genomic span (chrom_end - chrom_start) ‘per_variant’ - Use number of variant sites ‘callable’ - Use span from first to last variant position ‘total’ - Alias for ‘per_base’ (backward compatibility) ‘sites’ - Alias for ‘per_variant’ (backward compatibility)
Returns:: span – The span to use for normalization
Return type:: int

exclude_missing_sites(populations=None)[source]

Return a new matrix with only sites that have no missing data.

Parameters:: populations (list of (str or list), optional) – If given, only require completeness within these populations. Each element is a population name (looked up in sample_sets) or a list of sample indices.
Returns:: Filtered matrix. Returns self if no missing data.
Return type:: HaplotypeMatrix
Raises:: ValueError – If no sites remain after filtering.

filter_variants_by_missing(max_missing_freq=0.1)[source]

Return a new HaplotypeMatrix with variants filtered by missing data frequency.

Parameters:: max_missing_freq (float) – Maximum allowed frequency of missing data per variant
Returns:: filtered – New HaplotypeMatrix with filtered variants
Return type:: HaplotypeMatrix

filter(variants=None, genotypes=None, drop_all_missing=True)[source]

Return a new HaplotypeMatrix with quality filters applied.

Parameters:

variants (array-like of bool, shape (n_variants,), optional) – Per-variant keep mask. Variants where False are dropped from every array on the returned matrix (haplotypes, positions, all per-variant and per-genotype fields).
genotypes (array-like of bool, shape (n_variants, n_samples), optional) – Per-genotype keep mask. Genotypes where False are set to -1 (missing) on both haplotype rows of the affected sample. fields arrays are not zeroed out – the per-genotype QC values remain available for downstream inspection.
drop_all_missing (bool, default True) – After applying both masks, drop variants whose every haplotype call is -1. Default keeps the output free of rows where the genotype mask blanked every call.

Returns:

Fresh matrix; the underlying haplotype + position arrays are new allocations (not views of self). Sample names, sample_sets, and chrom_start / chrom_end are propagated. n_total_sites is dropped because the variant axis has changed; if the caller wants span normalization on the result they should re-attach an accessibility mask explicitly. Any accessibility mask previously attached to self is not inherited for the same reason.

Return type:

Notes

filter operates on the underlying haplotype matrix (self._haplotypes), not the view exposed when an accessibility mask is attached. The QC fields arrays this method reads from were loaded against the underlying axis, so masks built from them stay aligned automatically.

summarize_missing_data()[source]

Get summary statistics about missing data patterns.

Returns:: summary – Dictionary with missing data statistics
Return type:: dict

allele_frequency_spectrum()[source]

Calculate the allele frequency spectrum for a haplotype matrix.

Note: This method is deprecated. Use diversity.allele_frequency_spectrum() instead.

Return type:: ndarray

diversity(span_normalize=True)[source]

Calculate the nucleotide diversity (π) for the haplotype matrix.

Note: This method is deprecated. Use diversity.pi() instead.

Parameters:: span_normalize (bool, optional) – If True, the result is normalized by the span of the haplotype matrix. Defaults to True.
Returns:: The nucleotide diversity (π) for the haplotype matrix.
Return type:: float

watersons_theta(span_normalize=True)[source]

Calculate Waterson’s theta for the haplotype matrix.

Note: This method is deprecated. Use diversity.theta_w() instead.

Return type:: float

Tajimas_D()[source]

Calculate Tajima’s D for the haplotype matrix.

Note: This method is deprecated. Use diversity.tajimas_d() instead.

Return type:: float

pairwise_LD_v()[source]

Pairwise linkage disequilibrium (D statistic) via matrix multiply.

Return type:: ndarray

pairwise_r2(estimator='r2')[source]

Pairwise r-squared for all variant pairs.

Parameters:: estimator (str) – 'r2' (default) – naive haplotype r² from frequency-based formula. 'rogers_huff' – Rogers-Huff r² computed on diploid 0/1/2 dosages obtained by pairing adjacent haplotypes (sample 0 = haplotypes 0,1; sample 1 = haplotypes 2,3; …). Matches scikit-allel.rogers_huff_r() ** 2.
Return type:: cupy.ndarray, float64, shape (n_variants, n_variants)

locate_unlinked(size=100, step=20, threshold=0.1)[source]

Locate variants in approximate linkage equilibrium.

Uses a sliding window approach to identify variants whose r-squared with all other variants in the window is below the threshold.

Parameters:

size (int) – Window size (number of variants).
step (int) – Number of variants to advance between windows.
threshold (float) – Maximum r-squared value to consider variants unlinked.

Returns:

True for variants in approximate linkage equilibrium.

Return type:

ndarray, bool, shape (n_variants,)

windowed_r_squared(bp_bins, percentile=50, pop=None, estimator='r2')[source]

Compute percentiles of r-squared in genomic distance bins.

Parameters:

bp_bins (array_like) – Bin edges for genomic distances in base pairs.
percentile (float or array_like) – Percentile(s) to compute within each bin.
pop (str, optional) – Population key to use.
estimator (str) – 'r2' (default) – naive haplotype r² from frequency counts. 'rogers_huff' – Rogers-Huff r² on diploid 0/1/2 dosages obtained by pairing adjacent haplotypes; matches scikit-allel.rogers_huff_r() ** 2.

Returns:

result (ndarray, shape (n_bins,) or (n_bins, n_percentiles)) – Percentile(s) of r-squared per bin.
counts (ndarray, int, shape (n_bins,)) – Number of variant pairs per bin.

tally_gpu_haplotypes(pop=None)[source]

GPU implementation of computing pairwise haplotype tallies. Automatically detects and handles missing data if present.

Parameters:

pop (str, optional) – Population key from sample_sets to use. If None, uses all samples.

Returns:

(counts, n_valid) where:

counts: Array of shape (#pairs, 4) containing [n11, n10, n01, n00] for each variant pair
n_valid: Array of shape (#pairs,) containing the number of valid haplotypes for each pair
or None if no missing data is present

Return type:

tally_gpu_haplotypes_two_pops_with_missing(pop1, pop2)[source]

GPU implementation of computing pairwise haplotype tallies for two populations with missing data support.

For each variant pair, only counts haplotypes where both variants are non-missing in both populations. Missing data is encoded as -1 in the haplotype matrix.

Parameters:

pop1 (str) – First population key from sample_sets
pop2 (str) – Second population key from sample_sets

Returns:

(counts, n_valid1, n_valid2) where:

counts: Array of shape (#pairs, 8) containing counts for both populations [n11_1, n10_1, n01_1, n00_1, n11_2, n10_2, n01_2, n00_2]
n_valid1: Array of shape (#pairs,) with valid haplotypes for pop1
n_valid2: Array of shape (#pairs,) with valid haplotypes for pop2

Return type:

tally_gpu_haplotypes_two_pops(pop1, pop2)[source]

GPU version of tallying haplotype counts between all pairs of variants for two populations. Automatically detects and handles missing data if present.

Returns:

(counts, n_valid1, n_valid2) where:

counts: Array of shape (#pairs, 8) containing counts for both populations
n_valid1: Array of shape (#pairs,) with valid haplotypes for pop1 (or None if no missing data)
n_valid2: Array of shape (#pairs,) with valid haplotypes for pop2 (or None if no missing data)

Return type:

Parameters:

pop1 (str)
pop2 (str)

compute_ld_statistics_gpu_single_pop(bp_bins, raw=False, ac_filter=True, chunk_size='auto')[source]

GPU-based LD statistics computation for a single population.

Computes DD, Dz, and pi2 statistics for variant pairs binned by distance. Only processes pairs within max(bp_bins) distance for memory efficiency.

Parameters:

bp_bins (array-like) – Array of bin boundaries in base pairs. Pairs are binned by distance into intervals [bp_bins[i], bp_bins[i+1]).
raw (bool, optional) – If True, return raw sums of statistics across pairs in each bin. If False (default), return means.
ac_filter (bool, optional) – If True (default), apply biallelic filtering before computation.
chunk_size (int or 'auto', optional) – Number of pairs to process per chunk. If ‘auto’ (default), automatically estimates optimal size based on available GPU memory. Can specify an integer for manual control.

Returns:

Dictionary mapping (bin_start, bin_end) tuples to tuples of statistics. Each bin contains (DD, Dz, pi2) values.

Return type:

Examples

>>> bp_bins = [0, 10000, 50000, 100000]
>>> stats = hm.compute_ld_statistics_gpu_single_pop(bp_bins)
>>> stats[(0.0, 10000.0)]  # (DD, Dz, pi2) for first bin

compute_ld_statistics_gpu_two_pops(bp_bins, pop1, pop2, raw=False, ac_filter=True, chunk_size='auto')[source]

GPU-based LD statistics computation for two populations.

Computes DD, Dz, and pi2 statistics for variant pairs binned by distance. Only processes pairs within max(bp_bins) distance for memory efficiency.

Parameters:

bp_bins (array-like) – Array of bin boundaries in base pairs. Pairs are binned by distance into intervals [bp_bins[i], bp_bins[i+1]).
pop1 (str) – Name of first population (must exist in sample_sets)
pop2 (str) – Name of second population (must exist in sample_sets)
raw (bool, optional) – If True, return raw sums of statistics across pairs in each bin. If False (default), return means.
ac_filter (bool, optional) – If True (default), apply biallelic filtering before computation.
chunk_size (int or 'auto', optional) – Number of pairs to process per chunk. If ‘auto’ (default), automatically estimates optimal size based on available GPU memory. Can specify an integer for manual control.

Returns:

Dictionary mapping (bin_start, bin_end) tuples to OrderedDict of statistics. Each bin contains 15 statistics: - DD_0_0, DD_0_1, DD_1_1 (D squared) - Dz_0_0_0, Dz_0_0_1, Dz_0_1_1, Dz_1_0_0, Dz_1_0_1, Dz_1_1_1 - pi2_0_0_0_0, pi2_0_0_0_1, pi2_0_0_1_1, pi2_0_1_0_1, pi2_0_1_1_1, pi2_1_1_1_1

Return type:

GenotypeMatrix or StreamingGenotypeMatrix

Examples

>>> bp_bins = [0, 10000, 50000, 100000]
>>> stats = hm.compute_ld_statistics_gpu_two_pops(bp_bins, 'pop1', 'pop2')
>>> stats[(0.0, 10000.0)]['DD_0_0']  # D^2 for pop1 in first bin

GenotypeMatrix

class pg_gpu.GenotypeMatrix(genotypes, positions, chrom_start=None, chrom_end=None, sample_sets=None, n_total_sites=None, samples=None, accessible_mask=None, fields=None)[source]

Bases: object

Diploid genotype matrix with values 0 (hom ref), 1 (het), 2 (hom alt).

Shape: (n_individuals, n_variants). Missing data encoded as -1.

Parameters:

genotypes (ndarray, int8, shape (n_individuals, n_variants)) – Alt allele count per individual per variant.
positions (ndarray, shape (n_variants,)) – Variant positions.
chrom_start (int, optional) – Chromosome boundaries.
chrom_end (int, optional) – Chromosome boundaries.
sample_sets (dict, optional) – Maps population names to lists of individual indices.

property genotypes

property positions

property accessible_mask

property device

property sample_sets

property shape

property num_variants

property num_individuals

property has_accessible_mask: Whether an accessible site mask is attached.

set_accessible_mask(mask_or_path, chrom=None)[source]

Attach an accessible site mask (non-destructive).

Returns self for chaining.

Parameters:

mask_or_path (str, path-like, numpy.ndarray, or AccessibleMask) – BED file path, boolean array, or AccessibleMask instance.
chrom (str, optional) – Chromosome name (required for BED file input).

remove_accessible_mask()[source]

Remove the accessible mask, restoring all original variants.

Returns self for chaining.

property has_invariant_info: Whether invariant site information is available for span normalization.

property n_callable_sites

Total callable sites in the analysis universe.

Alias for n_total_sites. See HaplotypeMatrix.n_callable_sites for full semantics.

property n_segregating_sites

Number of polymorphic sites in the matrix.

Counts sites where 0 < alt_count < 2 * n_valid (polymorphic among observed diploid genotypes), with at least 2 valid samples.

property n_invariant_sites

Number of invariant sites in the callable span, or None if unknown.

Computed as n_callable_sites - n_segregating_sites. See HaplotypeMatrix.n_invariant_sites for full semantics.

transfer_to_gpu()[source]

transfer_to_cpu()[source]

classmethod from_haplotype_matrix(hap_matrix)[source]

Convert a HaplotypeMatrix to a GenotypeMatrix.

Pairs consecutive haplotypes (0,1), (2,3), … as diploid individuals. Genotype = sum of paired haplotypes (0, 1, or 2).

Parameters:: hap_matrix (HaplotypeMatrix) – Haploid data. Must have even number of haplotypes.
Return type:: GenotypeMatrix

to_haplotype_matrix()[source]

Convert back to HaplotypeMatrix (expand diploid to haploid).

Each individual becomes two consecutive haplotypes. Het sites (genotype=1) are assigned as (0,1).

Return type:: HaplotypeMatrix

classmethod from_vcf(path, include_invariant=False, accessible_bed=None, fields=None)[source]

Construct from a VCF file.

Parameters:

path (str) – Path to VCF file.
include_invariant (bool) – If True, include invariant sites and set n_total_sites.
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions.
fields (list of str, optional) – VCF FORMAT/INFO tags to load (e.g. ['GQ', 'DP', 'MQ']); see HaplotypeMatrix.from_vcf for the shape contract. Arrays are sliced down to the biallelic-only variant set this constructor keeps, so per-variant fields end up shape (n_kept,) and per-genotype fields (n_kept, n_samples).

Return type:

GenotypeMatrix

classmethod from_zarr(path, region=None, accessible_bed=None, include_invariant=False, pop_assignment=None, streaming='auto', chunk_bp=1500000, prefetch=1, backend='auto', fields=None)[source]

Construct a GenotypeMatrix from a Zarr store.

Identical interface to HaplotypeMatrix.from_zarr; see that method’s docstring for the meaning of every kwarg (streaming, pop_assignment flexibility, chunk_bp, prefetch, backend, fields). The only difference is the returned type: this returns GenotypeMatrix (or StreamingGenotypeMatrix on the streaming path) where each row is a (n_indiv, ploidy) genotype call, versus HaplotypeMatrix which presents the same data as a (n_haplotypes, n_variants) phased matrix.

Parameters:

include_invariant (bool) – If True, set n_total_sites from the loaded variant count. Unique to this entry point (the haplotype matrix does not need it).
path – See HaplotypeMatrix.from_zarr.
region – See HaplotypeMatrix.from_zarr.
accessible_bed – See HaplotypeMatrix.from_zarr.
pop_assignment – See HaplotypeMatrix.from_zarr.
streaming (str) – See HaplotypeMatrix.from_zarr.
chunk_bp (int) – See HaplotypeMatrix.from_zarr.
prefetch (int) – See HaplotypeMatrix.from_zarr.
backend (str) – See HaplotypeMatrix.from_zarr.
fields (list) – See HaplotypeMatrix.from_zarr.

Return type:

filter(variants=None, genotypes=None, drop_all_missing=True)[source]

Return a new GenotypeMatrix with quality filters applied.

Parameters:

variants (array-like of bool, shape (n_variants,), optional) – Per-variant keep mask; variants where False are dropped from every array on the returned matrix.
genotypes (array-like of bool, shape (n_variants, n_samples), optional) – Per-genotype keep mask; genotypes where False are set to -1 (missing). fields entries keep their original per-genotype QC values.
drop_all_missing (bool, default True) – After applying both masks, drop variants where every individual’s call is -1.

Returns:

Fresh matrix; underlying arrays are new allocations. samples, sample_sets, chrom_start / chrom_end are preserved. See HaplotypeMatrix.filter for the accessibility-mask interaction (this method behaves the same way).

Return type:

GenotypeMatrix

to_zarr(zarr_path, format='vcz', contig_name=None)[source]

Save genotype data to Zarr format.

Parameters:

zarr_path (str) – Output Zarr store path.
format (str) – 'vcz' (default) or 'scikit-allel'.
contig_name (str, optional) – Chromosome name for VCZ format.

load_pop_file(pop_assignment, pops=None)[source]

Load population assignments from a tab-delimited file or an already-resolved sample->population mapping.

Parameters:

pop_assignment (str or dict) – Either a path to a tab-delimited file with columns sample pop, or a dict mapping sample names to population labels.
pops (list of str, optional) – Populations to include. If None, includes all found.

apply_biallelic_filter()[source]

Filter to biallelic variant sites.

Keeps variants where both ref and alt alleles are present among non-missing individuals.

Return type:: GenotypeMatrix

Biobank-Scale Streaming

A VCZ store is bio2zarr’s Zarr-on-disk encoding of a VCF: the genotype matrix is split into compressed chunks, each chunk a small array of samples by variants. For VCZ stores that do not fit entirely in GPU memory, HaplotypeMatrix.from_zarr / GenotypeMatrix.from_zarr can return a streaming view (streaming='always', or streaming='auto' with a too-large projected footprint). The streaming object iterates the chromosome chunk by chunk through every kernel that accepts a fully loaded matrix; see Biobank-Scale Streaming from VCZ for the end-to-end pattern and the VCF-to-VCZ conversion step. The reader requires diploid genotypes; haploid and polyploid stores are rejected.

class pg_gpu.streaming_matrix.StreamingHaplotypeMatrix(source, fetcher, chunk_bp, prefetch, *, align_bp=None)[source]

Bases: _StreamingMatrixBase

Chunked view over a ZarrGenotypeSource, yielding per-chunk HaplotypeMatrix instances.

Returned by HaplotypeMatrix.from_zarr when the requested matrix does not fit eagerly on the GPU. See _StreamingMatrixBase for the iteration / materialize / sample_sets contract.

property align_bp

Chunk-boundary alignment in bp.

Every chunk has a width that is a multiple of this, so a window whose size also divides align_bp is guaranteed to fit inside a single chunk. Streaming-aware kernels read this property to validate that the caller’s window_size divides it.

property chrom_end: Chunk-grid right edge (exclusive). Not the last variant position; per-chunk windows are uniform width within their chunk, so the right edge here is the chunk grid’s exclusive upper bound rather than HaplotypeMatrix’s last-inclusive convention.

property chrom_start: Chunk-grid origin. Not the first variant position – per-chunk windows are anchored to the chunk grid, so reporting the variant-based origin would be misleading for callers building a comparable eager matrix.

iter_gpu_chunks()

Yield (left, right, eager_matrix) tuples covering the source.

Each yielded eager matrix lives on the GPU and represents one genomic chunk. Empty chunks (regions with no variants, e.g. an acrocentric arm) are skipped – callers see only chunks with at least one variant.

materialize(*, region=None, sample_subset=None)

Build an eager matrix over a sub-region.

Pairwise / cross-window kernels (pairwise_r2, grm, ibs, locate_unlinked, the r^2 heatmap path) can’t be evaluated chunk-by-chunk because they need every (variant, variant) pair simultaneously. Pull the slice you want into one device-resident eager matrix and run those kernels on it.

Parameters:

region (tuple of int, optional) – (left, right) bp interval to materialize. right is exclusive. None materializes the full mappable range, which on a biobank-scale store will OOM.
sample_subset (sequence of int, optional) – Haplotype-axis indices to keep. None keeps every haplotype.

property sample_sets: Population -> sample-axis indices. Falls back to a single ‘all’ set when no pop file was resolved.

class pg_gpu.streaming_matrix.StreamingGenotypeMatrix(source, fetcher, chunk_bp, prefetch, *, align_bp=None)[source]

Bases: _StreamingMatrixBase

Chunked view over a ZarrGenotypeSource, yielding per-chunk GenotypeMatrix (dosage-coded (n_indiv, n_var)) instances.

Returned by GenotypeMatrix.from_zarr when the requested matrix does not fit eagerly on the GPU. Sample sets index the diploid axis (0..num_individuals-1), not the haplotype axis. ibs accepts this class directly; grm still needs an eager sub-region via .materialize(region=...).

property align_bp

Chunk-boundary alignment in bp.

Every chunk has a width that is a multiple of this, so a window whose size also divides align_bp is guaranteed to fit inside a single chunk. Streaming-aware kernels read this property to validate that the caller’s window_size divides it.

property chrom_end: Chunk-grid right edge (exclusive). Not the last variant position; per-chunk windows are uniform width within their chunk, so the right edge here is the chunk grid’s exclusive upper bound rather than HaplotypeMatrix’s last-inclusive convention.

property chrom_start: Chunk-grid origin. Not the first variant position – per-chunk windows are anchored to the chunk grid, so reporting the variant-based origin would be misleading for callers building a comparable eager matrix.

iter_gpu_chunks()

Yield (left, right, eager_matrix) tuples covering the source.

Each yielded eager matrix lives on the GPU and represents one genomic chunk. Empty chunks (regions with no variants, e.g. an acrocentric arm) are skipped – callers see only chunks with at least one variant.

materialize(*, region=None, sample_subset=None)

Build an eager matrix over a sub-region.

Pairwise / cross-window kernels (pairwise_r2, grm, ibs, locate_unlinked, the r^2 heatmap path) can’t be evaluated chunk-by-chunk because they need every (variant, variant) pair simultaneously. Pull the slice you want into one device-resident eager matrix and run those kernels on it.

Parameters:

region (tuple of int, optional) – (left, right) bp interval to materialize. right is exclusive. None materializes the full mappable range, which on a biobank-scale store will OOM.
sample_subset (sequence of int, optional) – Haplotype-axis indices to keep. None keeps every haplotype.

property sample_sets: Population -> sample-axis indices. Falls back to a single ‘all’ set when no pop file was resolved.

class pg_gpu.zarr_source.ZarrGenotypeSource(path, *, region=None, pop_assignment=None, contig_id=None)[source]

Bases: object

Open a VCZ store and expose chunked + subset read primitives.

Parameters:

path (str) – Filesystem path to a VCZ zarr store.
region (str, optional) – 'chrom:start-end' to restrict the source to a single sub-region. The source is single-contig regardless: a multi-contig store without region raises.
pop_assignment (str, optional) – Tab-delimited file mapping sample -> pop (same format as HaplotypeMatrix.load_pop_file). Default None looks for <path>.pops.tsv next to the store and uses it if present; emits a one-line stderr note so the auto-load isn’t invisible. Pass pop_assignment=False to disable the auto-load entirely.
contig_id (str, optional) – Pick a contig by name when region is not given and the store has multiple contigs. Ignored otherwise.

num_variants

Variants inside region (or the whole store), after contig selection.

Type:: int

num_diploids

Sample-axis length.

Type:: int

num_haplotypes

2 * num_diploids.

Type:: int

site_pos

Variant positions (host), length num_variants.

Type:: ndarray

chrom

Selected contig name.

Type:: str

chunks

call_genotype chunk shape.

Type:: tuple

pop_cols

{pop_name: ndarray[hap_axis_indices]} when a pop file was resolved, else None.

Type:: dict or None

slice_region(left, right)[source]

Read every haplotype for variants in [left, right).

Returns:

gt (ndarray, shape (n_var, num_diploids, 2), dtype int8) – Raw call_genotype block, -1 for multiallelic rows.
pos (ndarray, shape (n_var,), dtype int64) – Variant positions.

slice_region_gpu(left, right, *, cg)[source]

Like slice_region but reads through a caller-provided cg – the call_genotype zarr array on a group opened from a kvikio.zarr.GDSStore with the GPU buffer prototype active. The chunk lands on the GPU via the active buffer prototype.

cg is passed in (not opened here) because zarr caches the codec pipeline on the group at open time; the caller opens the group once after toggling zarr.config.enable_gpu() and reuses the resulting array across every chunk in the iteration. Per-call opens would cost ~ms per chunk and add nothing.

slice_subsample_gpu(left, right, hap_cols, *, cg)[source]

Like slice_subsample(to_gpu=True) but reads chunks through the caller-provided cg – a call_genotype zarr array on a group opened from a kvikio.zarr.GDSStore with the GPU buffer prototype active. Decompression happens on the GPU (nvCOMP) rather than via the synchronous host-side oindex codec pipeline, which at biobank-scale sample subsets is minutes per call.

Sample-axis subsetting works by grouping the requested diploids into contiguous runs (typically one or two for a population-subset read), reading each run as a contiguous cg[zlo:zhi, rlo:rhi, :] slab that kvikio decompresses directly to GPU memory, then doing one GPU fancy-index to assemble the (n_var, len(hap_cols)) result. Pathologically scattered subsets (every dip its own run) fall back to slice_subsample(to_gpu=True).

Returns:

gm (cupy.ndarray, shape (n_var, len(hap_cols)), dtype int8)
pos (ndarray, shape (n_var,), dtype int64)

slice_subsample(left, right, hap_cols, *, to_gpu=False)[source]

Read variants in [left, right) restricted to hap_cols.

hap_cols is an iterable of haplotype-axis indices in [0, 2 * num_diploids). Uses zarr’s oindex for the sample-axis subset; with bio2zarr-style sample chunking only the chunks containing the requested diploids are decompressed.

Parameters:

to_gpu (bool) – When True the returned gm is a cupy array on the GPU. The ploidy gather is done on the GPU regardless – this kwarg only controls whether gm gets downloaded back to host before returning. Callers that immediately upload the result (e.g. materialize) should pass to_gpu=True to avoid the round-trip.

Returns:

gm (ndarray (host) or cupy.ndarray (device), shape) – (n_var, len(hap_cols)), dtype int8 -1 for multiallelic rows.
pos (ndarray, shape (n_var,), dtype int64) – Variant positions.

iter_chunks(chunk_bp, align_bp=None, start=None)[source]

Yield (left, right) genomic intervals tiling the source.

Intervals are sized as multiples of align_bp (defaults to chunk_bp) so a caller running window-based stats can guarantee windows never straddle a chunk boundary. Empty regions at the start of the contig (e.g. acrocentric arm) are yielded as well – the caller can detect and skip them via the cheap gt.shape[0] == 0 check.

class pg_gpu.streaming_matrix.KvikioChunkFetcher(source, *, compat_mode='ON', num_threads=8)[source]

Bases: ChunkFetcher

Read chunks via kvikio.zarr.GDSStore with on-GPU codec decode.

Hardware / software requirements (the kvikio path falls back to HostChunkFetcher if these are not met):

NVIDIA driver and a CUDA-capable GPU; pg_gpu already requires both.
kvikio and nvidia-nvcomp Python packages in the same environment as pg_gpu (the project’s pixi env ships these by default).
The VCZ store’s call_genotype chunks compressed with a codec in nvCOMP’s GPU-decode list – currently zstd, blosc, lz4, or deflate. bio2zarr defaults to zstd. Other codecs surface as ValueError on construction so the caller can re-encode.
Sample-axis chunking smaller than the full sample axis. A store with one whole-sample-axis chunk per variant chunk is still readable but gets no kvikio speedup, so backend='auto' warns and falls back to the host path.

No special filesystem support is needed. The fetcher opens the store through kvikio.zarr.GDSStore and switches zarr 3 to its GPU buffer prototype so nvCOMP decodes the codec directly on the device – build_haplotype_matrix then skips the host-to-device copy in cp.asarray(gt) because gt is already a cupy array. The win is in that on-GPU decode, not in GPU Direct Storage.

By default this fetcher sets kvikio.defaults.compat_mode=ON so kvikio reads bytes through posix into bounce buffers rather than trying to use GPU Direct Storage (which would be slower on hosts without /etc/cufile.json configured because of repeated failed cufile probes). Pass compat_mode=AUTO if the storage path is known to be GDS-capable.

Construction probes the store’s call_genotype codec and raises ValueError if it’s not in the nvCOMP-supported list, rather than silently returning incorrect data through a CPU fallback. On close the zarr buffer prototype is reset so subsequent eager calls in the same process get numpy buffers back.

iter_chunks(chunks, prefetch)[source]

Yield (ci, left, right, gt, pos, t_read_s) per chunk.

Parameters:

chunks (list of (int, int)) – Half-open (left, right) bp intervals.
prefetch (int) – Read-ahead depth. 0 means serial; >=1 means a worker thread reads ahead of the consumer.

close()[source]

Reset the zarr buffer prototype.

Called automatically at the end of iter_chunks; exposed for tests and for callers that build a fetcher without immediately iterating it. There is no __del__ – Python’s GC timing is nondeterministic, so cleanup is via try/finally inside iter_chunks and explicit close() for callers that construct without iterating.

class pg_gpu.streaming_matrix.HostChunkFetcher(source)[source]

Bases: ChunkFetcher

Read chunks through the source’s host-buffer slice_region.

prefetch >= 1 spawns a daemon producer thread that fills a bounded queue with the next chunk while the consumer is computing on the current chunk. A GPU-codec-decode fetcher (kvikio + nvidia-nvcomp) is a drop-in replacement for the same ABC when the store’s codec is GPU-decodable.

iter_chunks(chunks, prefetch)[source]

Yield (ci, left, right, gt, pos, t_read_s) per chunk.

Parameters:

chunks (list of (int, int)) – Half-open (left, right) bp intervals.
prefetch (int) – Read-ahead depth. 0 means serial; >=1 means a worker thread reads ahead of the consumer.

class pg_gpu.MemoryLimitedWarning[source]

Bases: UserWarning

A load that is at risk of exhausting memory or taking far longer than the VCZ path would.

Emitted before parsing by HaplotypeMatrix.from_vcf and GenotypeMatrix.from_vcf when the VCF on disk is large enough, or has enough samples, that VCF text parsing will be slow and the full-matrix host load may not even fit. The warning text includes a copy-pastable recipe for converting the VCF to VCZ via HaplotypeMatrix.vcf_to_zarr. Silence with:

import warnings
from pg_gpu import MemoryLimitedWarning
warnings.filterwarnings("ignore", category=MemoryLimitedWarning)

pg_gpu.zarr_io.allel_zarr_to_vcz(allel_path, vcz_path, *, contig=None, region=None, variant_chunk=10000, sample_chunk=1000, progress=False)[source]

Convert a scikit-allel zarr store to vcz (bio2zarr) layout.

Streams calldata/GT in variant blocks so chromosome-scale allel stores – which the eager read_genotypes_allel would OOM on – can be converted without materializing the full genotype matrix. Writes call_genotype with bio2zarr-style sample-axis chunking so the pg_gpu streaming reader’s kvikio backend can decode chunks on the GPU.

Parameters:

allel_path (str) – Source allel store. Either a flat layout with calldata/GT at the root, or a chromosome-grouped layout where each <contig>/calldata/GT is one chromosome. Both are accepted.
vcz_path (str) – Destination vcz store path. Overwritten if it exists.
contig (str, optional) – For grouped stores, the chromosome key to read. Required when the source is grouped and region does not name a contig. For flat stores, used as the contig_id label in the output.
region (str, optional) – "chrom:start-end" (or "start-end" on a flat store) restricting the conversion to a position range.
variant_chunk (int) – Source variants read per streaming pass. Larger amortizes zarr read overhead; smaller bounds host RAM at biobank sample counts.
sample_chunk (int) – Sample-axis chunk size in the output call_genotype. Smaller chunks let the streaming reader’s kvikio + nvCOMP decode overlap with compute on biobank-scale stores.
progress (bool) – Print a one-line status per variant chunk to stderr.

Notes

Only the fields pg_gpu’s streaming reader requires are written: call_genotype, call_genotype_mask, variant_position, sample_id, contig_id, variant_contig. Other allel columns (REF, ALT, FILTER, etc.) are not preserved.

LD Statistics

Core Functions

pg_gpu.ld_statistics.dd(counts, populations=None, n_valid=None)[source]

Compute D² statistic for any population configuration.

Parameters:

counts (cp.ndarray) – Haplotype counts array: - Single population: shape (N, 4) - Two populations: shape (N, 8) - Multi-population: shape (N, 4*P)
populations (tuple of int, optional) – Population indices. None for single population, (i, j) for between populations i and j
n_valid (cp.ndarray, optional) – Valid sample counts per population. Shape depends on configuration: - Single pop: shape (N,) - Two pops: shape (N, 2) or tuple of (N,) arrays

Returns:

D² values for each locus

Return type:

cp.ndarray

pg_gpu.ld_statistics.dz(counts, populations=None, n_valid=None)[source]

Compute Dz statistic for any population configuration.

Parameters:

counts (cp.ndarray) – Haplotype counts array
populations (tuple of int, optional) – Three population indices (i, j, k) for Dz(i,j,k). None defaults to single population (0, 0, 0)
n_valid (cp.ndarray, optional) – Valid sample counts per population

Returns:

Dz values for each locus

Return type:

cp.ndarray

pg_gpu.ld_statistics.pi2(counts, populations=None, n_valid=None)[source]

Compute π₂ statistic for any population configuration.

Parameters:

counts (cp.ndarray) – Haplotype counts array
populations (tuple of int, optional) – Four population indices (i, j, k, l) for π₂(i,j,k,l). None defaults to single population (0, 0, 0, 0)
n_valid (cp.ndarray, optional) – Valid sample counts per population

Returns:

π₂ values for each locus

Return type:

cp.ndarray

pg_gpu.ld_statistics.r(counts, n_valid=None)[source]

Compute Pearson correlation coefficient r between variant pairs from haplotype counts.

Parameters:

counts (cp.ndarray, shape (N, 4)) – Haplotype counts [n11, n10, n01, n00] for each variant pair.
n_valid (cp.ndarray, optional) – Valid sample counts per pair. Shape (N,).

Returns:

Pearson r values. NaN where computation is undefined (monomorphic at either locus).

Return type:

cp.ndarray, float64, shape (N,)

pg_gpu.ld_statistics.r_squared(counts, n_valid=None)[source]

Compute r-squared (squared Pearson correlation) between variant pairs from haplotype counts.

Parameters:

counts (cp.ndarray, shape (N, 4)) – Haplotype counts [n11, n10, n01, n00] for each variant pair.
n_valid (cp.ndarray, optional) – Valid sample counts per pair. Shape (N,).

Returns:

r-squared values. NaN where computation is undefined.

Return type:

cp.ndarray, float64, shape (N,)

Convenience Functions

pg_gpu.ld_statistics.dd_within(counts, n_valid=None)[source]

Compute D² within a single population.

Convenience function equivalent to dd(counts, populations=None)

Parameters:

counts (ndarray)
n_valid (ndarray | None)

Return type:

ndarray

pg_gpu.ld_statistics.dd_between(counts, pop1_idx, pop2_idx, n_valid=None)[source]

Compute D² between two populations.

Convenience function equivalent to dd(counts, populations=(pop1_idx, pop2_idx))

Parameters:

counts (ndarray)
pop1_idx (int)
pop2_idx (int)
n_valid (ndarray | None)

Return type:

ndarray

pg_gpu.ld_statistics.compute_ld_statistics(counts, statistics=['dd', 'dz', 'pi2'], populations=None, n_valid=None)[source]

Compute multiple LD statistics in one pass.

Parameters:

counts (cp.ndarray) – Haplotype counts array
statistics (list of str) – Statistics to compute (‘dd’, ‘dz’, ‘pi2’)
populations (dict, optional) – Population configurations for each statistic. E.g., {‘dd’: (0, 1), ‘dz’: (0, 0, 1), ‘pi2’: (0, 0, 1, 1)}
n_valid (cp.ndarray, optional) – Valid sample counts per population

Returns:

Dictionary mapping statistic names to computed values

Return type:

pg_gpu.ld_statistics.zns(r2_matrix_or_matrix, missing_data='include', estimator='auto')[source]

Kelly’s ZnS: mean pairwise r-squared across all SNP pairs.

Parameters:

r2_matrix_or_matrix (ndarray, HaplotypeMatrix, or GenotypeMatrix) – Square r-squared matrix, or a matrix object (dispatches to haploid or diploid r-squared computation automatically). When a HaplotypeMatrix is passed, uses tiled computation to avoid materializing the full m×m r² matrix.
missing_data (str) – 'include' (default) uses per-site valid data for frequency computation. 'exclude' filters to sites with no missing data.
estimator (str) – 'auto' (default) uses the unbiased sigma_d2 estimator when the input is a HaplotypeMatrix, and falls back to naive r2 for pre-computed r² arrays or GenotypeMatrix inputs (where sigma_d2 is not available). 'r2' always computes naive r-squared. 'sigma_d2' always uses the unbiased multinomial projection estimators (Ragsdale & Gravel 2019), computing mean \(\sigma_D^2 = D^2/\pi_2\) per pair with falling-factorial corrections. Requires HaplotypeMatrix input.

Returns:

Mean r-squared (or mean sigma_D^2 when sigma_d2 is selected).

Return type:

pg_gpu.ld_statistics.omega(r2_matrix_or_matrix, missing_data='include', estimator='auto')[source]

Kim and Nielsen’s Omega: max ratio of within-partition to cross-partition mean LD.

For each possible SNP partition point l, splits variants into [0:l) and [l:m), computes mean r-squared within each block and between blocks. Returns max(mean_within / mean_cross).

Uses GPU prefix sums on the upper triangle to evaluate all partition points without a Python loop. Matches diploSHIC’s convention of using upper-triangle pairs only.

Parameters:

r2_matrix_or_matrix (ndarray, HaplotypeMatrix, or GenotypeMatrix) – Square r-squared matrix, or a matrix object (dispatches to haploid or diploid r-squared computation automatically).
missing_data (str) – 'include' (default) uses per-site valid data for frequency computation. 'exclude' filters to sites with no missing data.
estimator (str) – 'auto' (default) uses the unbiased sigma_d2 estimator when the input is a HaplotypeMatrix, and falls back to naive r2 for pre-computed r² arrays or GenotypeMatrix inputs (where sigma_d2 is not available). 'r2' always computes naive r-squared. 'sigma_d2' always uses unbiased \(\sigma_D^2 = D^2/\pi_2\) (Ragsdale & Gravel 2019). Requires HaplotypeMatrix input.

Returns:

Maximum omega value. Returns 0 if fewer than 5 SNPs.

Return type:

pg_gpu.ld_statistics.mu_ld(haplotype_matrix, missing_data='include')[source]

mu_LD: haplotype pattern exclusivity between left/right halves (RAiSD).

Splits variants at midpoint and measures how exclusively haplotype patterns associate across halves. Elevated at sweep boundaries where LD structure changes abruptly.

Parameters:

haplotype_matrix (HaplotypeMatrix)
missing_data (str) – ‘include’ - treat missing as wildcard in pattern matching ‘exclude’ - filter to sites with no missing data

Return type:

Population Specifications

Single population: populations=None or omit
Two populations: populations=(0, 1) for DD
Three indices: populations=(0, 0, 1) for Dz
Four indices: populations=(0, 0, 1, 1) for pi2

Diversity Statistics

Core Functions

pg_gpu.diversity.pi(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]

Calculate nucleotide diversity (pi) for a population.

Nucleotide diversity is the average number of nucleotide differences per site between two randomly chosen sequences from the population.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
span_normalize (bool) – True (default): auto-detect best denominator (accessible bases if mask set, else genomic span). False: return raw sum.
missing_data (str) – 'include' (default) uses per-site valid data. 'exclude' filters to sites with no missing data.

Returns:

Nucleotide diversity value

Return type:

pg_gpu.diversity.theta_w(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]

Calculate Watterson’s theta for a population.

Watterson’s theta is an estimator of the population mutation rate based on the number of segregating sites.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
span_normalize (bool) – True (default): auto-detect best denominator. False: return raw sum.
missing_data (str) – 'include' (default) uses per-site valid data. 'exclude' filters to sites with no missing data.

Returns:

Watterson’s theta value

Return type:

pg_gpu.diversity.tajimas_d(haplotype_matrix, population=None, missing_data='include')[source]

Calculate Tajima’s D statistic.

Tajima’s D tests the neutral mutation hypothesis by comparing two estimates of the population mutation rate.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – 'include' (default) uses per-site valid data with harmonic mean of per-site sample sizes for variance terms. 'exclude' filters to sites with no missing data.

Returns:

Tajima’s D value

Return type:

pg_gpu.diversity.haplotype_diversity(haplotype_matrix, population=None, missing_data='include')[source]

Calculate haplotype diversity for a population.

Haplotype diversity is defined as 1 - sum(p_i^2) where p_i is the frequency of the i-th unique haplotype in the population.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - exclude haplotypes with any missing data ‘exclude’ - filter to sites with no missing data

Returns:

Haplotype diversity value

Return type:

pg_gpu.diversity.allele_frequency_spectrum(haplotype_matrix, population=None, missing_data='include')[source]

Calculate the allele frequency spectrum (AFS).

The AFS is a histogram of allele frequencies across all sites.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Calculate AFS using available data per site. ‘exclude’ - Only use sites with no missing data.

Returns:

Array where element i contains the number of sites with i derived alleles

Return type:

ndarray

pg_gpu.diversity.segregating_sites(haplotype_matrix, population=None, missing_data='include')[source]

Count the number of segregating sites.

A site is segregating if it has more than one allele present.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Count sites as segregating based on non-missing data only. ‘exclude’ - Only count sites with no missing data.

Returns:

Number of segregating sites

Return type:

int

pg_gpu.diversity.singleton_count(haplotype_matrix, population=None, missing_data='include')[source]

Count the number of singleton variants.

A singleton is a variant present in exactly one haplotype.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Count singletons based on non-missing data only. ‘exclude’ - Only count singletons at sites with no missing data.

Returns:

Number of singleton variants

Return type:

int

pg_gpu.diversity.fay_wus_h(haplotype_matrix, population=None, missing_data='include')[source]

Calculate Fay and Wu’s H statistic.

Tests for an excess of high-frequency derived alleles, which can indicate positive selection.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data

Returns:

Fay and Wu’s H value

Return type:

pg_gpu.diversity.heterozygosity_expected(haplotype_matrix, population=None, missing_data='include')[source]

Compute expected heterozygosity (gene diversity) per variant.

He = 1 - sum(p_i^2) for each variant, where p_i are allele frequencies. For biallelic sites this simplifies to He = 2*p*(1-p).

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘exclude’ - NaN at sites with any missing data ‘include’ - per-site n_valid for frequency calculation

Returns:

Expected heterozygosity per variant.

Return type:

ndarray, float64, shape (n_variants,)

pg_gpu.diversity.heterozygosity_observed(haplotype_matrix, population=None, ploidy=2, missing_data='include')[source]

Compute observed heterozygosity per variant.

Assumes consecutive haplotypes belong to the same individual (standard for diploid VCF data). A site is heterozygous in an individual if the two haplotypes differ.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
ploidy (int) – Ploidy level. Default 2 (diploid).
missing_data (str) – ‘include’ - skip missing individuals per site (default) ‘exclude’ - NaN at sites with any missing data

Returns:

Observed heterozygosity per variant.

Return type:

ndarray, float64, shape (n_variants,)

pg_gpu.diversity.inbreeding_coefficient(haplotype_matrix, population=None, ploidy=2, missing_data='include')[source]

Compute Wright’s inbreeding coefficient F per variant.

F = 1 - Ho/He, where Ho is observed heterozygosity and He is expected heterozygosity.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
ploidy (int) – Ploidy level for observed heterozygosity computation.
missing_data (str) – Passed to heterozygosity_expected and heterozygosity_observed.

Returns:

Inbreeding coefficient per variant. NaN where He = 0.

Return type:

ndarray, float64, shape (n_variants,)

Convenience Functions

pg_gpu.diversity.theta_l(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]

Compute theta_L diversity estimator.

theta_L = sum_i(i * xi_i) / (n - 1), where xi_i is the count of sites with derived allele count i. Weights variants linearly by derived allele frequency, bridging theta_pi and theta_H.

Reference: Zeng et al. (2006), Genetics 174: 1431-1439, Equation (8).

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
span_normalize (bool) – True (default): auto-detect best denominator. False: return raw sum.
missing_data (str) – ‘include’ - per-site sample sizes ‘exclude’ - only sites with no missing data

Return type:

pg_gpu.diversity.normalized_fay_wus_h(haplotype_matrix, population=None, missing_data='include')[source]

Compute normalized Fay and Wu’s H (H*).

H = theta_pi - theta_H, normalized by its standard deviation under the standard neutral model. The normalization allows comparison across samples with different numbers of segregating sites.

Reference: Zeng et al. (2006), “Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants”, Genetics 174: 1431-1439, Equation (11).

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str) – 'include' (default) uses per-site sample sizes with harmonic mean n for variance terms. 'exclude' filters to sites with no missing data.

Returns:

Normalized H*. Negative values indicate excess high-frequency derived alleles (directional selection signal).

Return type:

pg_gpu.diversity.zeng_e(haplotype_matrix, population=None, missing_data='include')[source]

Compute Zeng’s E test statistic.

E = theta_L - theta_W, normalized by its standard deviation.

Reference: Zeng et al. (2006), Genetics 174: 1431-1439, Equation (13).

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str) – 'include' (default) or 'exclude'.

Return type:

pg_gpu.diversity.zeng_dh(haplotype_matrix, population=None, missing_data='include')[source]

Compute Zeng’s DH joint test statistic.

Combines Tajima’s D and Fay & Wu’s H into a single test with improved power to detect directional selection. Defined as the product D * H when both are negative, zero otherwise.

Reference: Zeng et al. (2006), “Statistical Tests for Detecting Positive Selection by Utilizing High-Frequency Variants”, Genetics 174: 1431-1439, Equation (15).

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str) – Passed through to tajimas_d and fay_wus_h.

Returns:

DH statistic. Positive when both D and H are negative (consistent with a selective sweep).

Return type:

pg_gpu.diversity.diversity_stats(haplotype_matrix, population=None, statistics=['pi', 'theta_w', 'tajimas_d'], span_normalize=True, missing_data='include')[source]

Compute multiple diversity statistics at once.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples
statistics (list) – List of statistics to compute
span_normalize (bool) – True (default): auto-detect best denominator. False: return raw sums.
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data

Returns:

Dictionary mapping statistic names to values

Return type:

pg_gpu.diversity.neutrality_tests(haplotype_matrix, population=None, missing_data='include')[source]

Compute common neutrality test statistics.

Returns Tajima’s D, Fay and Wu’s H, and related values.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
population (str or list, optional) – Population name or list of sample indices
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data

Returns:

Dictionary with neutrality test results

Return type:

pg_gpu.diversity.theta_h(haplotype_matrix, population=None, span_normalize=True, missing_data='include')[source]

Compute theta_H (homozygosity-based diversity estimator).

theta_H = sum_i [ i^2 * S_i ] * 2 / (n*(n-1)) where S_i is the count of variants with derived allele count i. Used to compute Fay and Wu’s H.

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
span_normalize (bool) – True (default): auto-detect best denominator. False: return raw sum.
missing_data (str) – ‘include’ - per-site sample sizes ‘exclude’ - only sites with no missing data

Return type:

pg_gpu.diversity.max_daf(haplotype_matrix, population=None, missing_data='include')[source]

Maximum derived allele frequency across all variants.

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str) – ‘include’ - per-site n_valid for frequency ‘exclude’ - only sites with no missing data

Returns:

Maximum DAF in [0, 1].

Return type:

pg_gpu.diversity.haplotype_count(haplotype_matrix, population=None, missing_data='include')[source]

Count distinct haplotypes.

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str) – ‘include’ - exclude haplotypes with any missing ‘exclude’ - filter to sites with no missing data

Return type:

int

pg_gpu.diversity.daf_histogram(matrix, n_bins=20, population=None, missing_data='include')[source]

Normalized histogram of derived allele frequencies.

Accepts HaplotypeMatrix or GenotypeMatrix. For diploid data, DAF = sum(genotypes) / (2 * n_individuals).

Parameters:

matrix (HaplotypeMatrix or GenotypeMatrix)
n_bins (int) – Number of frequency bins spanning [0, 1].
population (str or list, optional)
missing_data (str) – ‘include’ - per-site n_valid for frequency ‘exclude’ - only sites with no missing data

Returns:

hist (ndarray, float64, shape (n_bins,)) – Normalized counts (sum to 1).
bin_edges (ndarray, float64, shape (n_bins + 1,))

pg_gpu.diversity.diplotype_frequency_spectrum(genotype_matrix, population=None)[source]

Count distinct multi-locus genotype patterns (diplotypes).

Parameters:

genotype_matrix (GenotypeMatrix)
population (str or list, optional)

Returns:

freqs (ndarray, float64, sorted descending) – Diplotype frequencies.
n_diplotypes (int) – Number of distinct diplotypes.

pg_gpu.diversity.mu_var(haplotype_matrix, window_length=None, population=None)[source]

mu_VAR: SNP density statistic (RAiSD).

Number of SNPs per base pair. Elevated near sweeps due to hitchhiking effects on local variant density.

Parameters:

haplotype_matrix (HaplotypeMatrix)
window_length (float, optional) – Window length in bp. If None, uses chrom_end - chrom_start.
population (str or list, optional)

Return type:

pg_gpu.diversity.mu_sfs(haplotype_matrix, population=None, missing_data='include')[source]

mu_SFS: fraction of SNPs at SFS edges (RAiSD).

Counts singletons (DAC=1) and near-fixed variants (DAC=n-1), divided by total segregating sites. Elevated near selective sweeps.

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str) – ‘include’ - per-site n_valid for edge classification ‘exclude’ - only sites with no missing data

Return type:

Divergence Statistics

Core Functions

pg_gpu.divergence.fst(haplotype_matrix, pop1, pop2, method='hudson', missing_data='include')[source]

Compute FST between two populations.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
pop1 (str or list) – First population (name from sample_sets or list of indices)
pop2 (str or list) – Second population (name from sample_sets or list of indices)
method (str) – FST estimation method (‘hudson’, ‘weir_cockerham’, ‘nei’)
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data

Returns:

FST value between populations

Return type:

pg_gpu.divergence.fst_hudson(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute Hudson’s FST estimator between two populations.

Hudson (1992) estimator: FST = 1 - (Hw / Hb) where Hw is within-population diversity and Hb is between-population diversity

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
pop1 (str or list) – First population
pop2 (str or list) – Second population
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data

Returns:

Hudson’s FST estimate

Return type:

pg_gpu.divergence.fst_weir_cockerham(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute Weir & Cockerham’s (1984) FST estimator.

This is the method of moments estimator that accounts for sampling variance. Computes all three variance components (a, b, c) including within-individual heterozygosity from paired haplotypes.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data. Consecutive haplotype pairs are treated as diploid individuals for the heterozygosity component.
pop1 (str or list) – First population
pop2 (str or list) – Second population
missing_data (str) – ‘include’ - per-site sample sizes, ratio-of-sums ‘exclude’ - Only use sites with no missing data

Returns:

Weir & Cockerham’s FST estimate

Return type:

pg_gpu.divergence.fst_nei(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute Nei’s GST (1973).

GST = (HT - HS) / HT where HT is total heterozygosity and HS is within-subpopulation heterozygosity

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
pop1 (str or list) – First population
pop2 (str or list) – Second population
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data

Returns:

Nei’s GST estimate

Return type:

pg_gpu.divergence.dxy(haplotype_matrix, pop1, pop2, per_site=False, missing_data='include', span_normalize=True)[source]

Compute absolute divergence (Dxy) between two populations.

Dxy measures the average number of nucleotide differences between sequences from two populations.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
pop1 (str or list) – First population
pop2 (str or list) – Second population
per_site (bool) – If True, return per-site values; if False, return mean
missing_data (str) – 'include' (default) uses per-site valid data. 'exclude' filters to sites with no missing data.
span_normalize (bool) – True (default): auto-detect best denominator. False: return raw sum / sites-with-data average.

Returns:

Mean Dxy or per-site Dxy values

Return type:

float or cp.ndarray

pg_gpu.divergence.da(haplotype_matrix, pop1, pop2, missing_data='include', span_normalize=True)[source]

Compute net divergence (Da) between two populations.

Da = Dxy - (pi1 + pi2) / 2 where pi1 and pi2 are within-population diversities

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
pop1 (str or list) – First population
pop2 (str or list) – Second population
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
span_normalize (bool) – True (default): auto-detect best denominator. False: return raw sum / sites-with-data average.

Returns:

Net divergence (Da)

Return type:

Convenience Functions

pg_gpu.divergence.divergence_stats(haplotype_matrix, pop1, pop2, statistics=['fst', 'dxy', 'da'], missing_data='include', span_normalize=True)[source]

Compute multiple divergence statistics between two populations.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
pop1 (str or list) – First population
pop2 (str or list) – Second population
statistics (list) – List of statistics to compute
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
span_normalize (bool) – True (default): auto-detect. False: raw sum / per-site average.

Returns:

Dictionary of statistic names to values

Return type:

pg_gpu.divergence.pairwise_fst(haplotype_matrix, populations=None, method='hudson', missing_data='include')[source]

Compute pairwise FST matrix for multiple populations.

Parameters:

haplotype_matrix (HaplotypeMatrix) – The haplotype data
populations (list, optional) – List of population names. If None, uses all populations in sample_sets
method (str) – FST method to use
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data

Returns:

fst_matrix (cp.ndarray) – Pairwise FST matrix
pop_names (list) – Population names in matrix order

Return type:

Tuple[ndarray, list]

pg_gpu.divergence.pbs(haplotype_matrix, pop1, pop2, pop3, window_size, window_start=0, window_stop=None, window_step=None, normed=True, missing_data='include')[source]

Compute the Population Branch Statistic (PBS).

PBS detects genomic regions unusually differentiated in pop1 relative to pop2 and pop3, using pairwise Hudson FST estimates.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data containing all three populations.
pop1 (str or list) – Population names or sample indices.
pop2 (str or list) – Population names or sample indices.
pop3 (str or list) – Population names or sample indices.
window_size (int) – Number of variants per window.
window_start (int, optional) – Starting variant index.
window_stop (int, optional) – Stopping variant index.
window_step (int, optional) – Stride between windows. Defaults to window_size (non-overlapping).
normed (bool, optional) – If True (default), return normalized PBS (PBSn1).
missing_data (str) – ‘include’ - per-site sample sizes ‘exclude’ - only use sites with no missing data

Returns:

PBS values per window.

Return type:

ndarray, float64, shape (n_windows,)

Distance-Based Two-Population Statistics

pg_gpu.divergence.snn(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]

Hudson’s nearest-neighbor statistic (Snn).

For each haplotype, determines whether its nearest neighbor is from the same population. Snn is the fraction of haplotypes whose nearest neighbor is conspecific. Under panmixia Snn ~ 0.5; under population structure Snn -> 1.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)
distance_matrices (tuple, optional) – Pre-computed (dist_between, dist_within1, dist_within2) from pairwise_distance_matrix. Avoids recomputation when calling multiple stats on the same population pair.

Returns:

Snn statistic in [0, 1].

Return type:

References

Hudson, R.R. (2000). A New Statistic for Detecting Genetic Differentiation. Genetics, 155(4), 2011-2014.

pg_gpu.divergence.dxy_min(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]

Minimum pairwise distance between two populations.

The Hamming distance of the closest pair of haplotypes across the two populations. Used in Gmin (Geneva et al.) and dd (Schrider et al.) statistics.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)
distance_matrices (tuple, optional) – Pre-computed distance matrices.

Returns:

Minimum pairwise distance.

Return type:

References

Geneva, A.J. et al. (2015). A new method to scan genomes for introgression in a secondary contact model. PLoS ONE, 10(4).

pg_gpu.divergence.gmin(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]

Geneva’s Gmin: ratio of minimum to mean between-population distance.

Gmin = Dxy_min / Dxy_mean. Low values indicate unusually similar haplotypes across populations relative to average divergence, suggesting recent gene flow or shared ancestral variation.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)
distance_matrices (tuple, optional) – Pre-computed distance matrices.

Returns:

Gmin ratio.

Return type:

References

Geneva, A.J. et al. (2015). A new method to scan genomes for introgression in a secondary contact model. PLoS ONE, 10(4).

pg_gpu.divergence.dd(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]

Relative minimum divergence (dd1, dd2).

dd1 = Dxy_min / pi1, dd2 = Dxy_min / pi2. Low values indicate that the closest between-population pair is unusually similar relative to within-population diversity.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)
distance_matrices (tuple, optional) – Pre-computed distance matrices.

Returns:

dd1 (float) – Dxy_min / pi(pop1).
dd2 (float) – Dxy_min / pi(pop2).

Return type:

Tuple[float, float]

References

Schrider, D.R., Ayroles, J., Matute, D.R. & Kern, A.D. (2018). Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genetics, 14(4), e1007341. https://doi.org/10.1371/journal.pgen.1007341

pg_gpu.divergence.dd_rank(haplotype_matrix, pop1, pop2, missing_data='include', distance_matrices=None)[source]

Rank of Dxy_min in within-population pairwise distance distributions.

For each population, computes the fraction of within-population pairwise distances that are <= Dxy_min. Low values indicate the closest between-population pair is more similar than most within-population pairs, suggesting introgression.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)
distance_matrices (tuple, optional) – Pre-computed distance matrices.

Returns:

rank1 (float) – Fraction of pop1 within-pop distances <= Dxy_min.
rank2 (float) – Fraction of pop2 within-pop distances <= Dxy_min.

Return type:

Tuple[float, float]

References

Schrider, D.R., Ayroles, J., Matute, D.R. & Kern, A.D. (2018). Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genetics, 14(4), e1007341. https://doi.org/10.1371/journal.pgen.1007341

pg_gpu.divergence.zx(haplotype_matrix, pop1, pop2, missing_data='include')[source]

ZnS ratio: within-population LD relative to total LD.

Zx = (ZnS_pop1 + ZnS_pop2) / (2 * ZnS_total). Values > 1 indicate stronger LD within populations than across the combined sample, consistent with population structure or recent admixture.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)

Returns:

Zx ratio.

Return type:

References

Schrider, D.R., Ayroles, J., Matute, D.R. & Kern, A.D. (2018). Supervised machine learning reveals introgressed loci in the genomes of Drosophila simulans and D. sechellia. PLoS Genetics, 14(4), e1007341. https://doi.org/10.1371/journal.pgen.1007341

pg_gpu.divergence.pairwise_distance_matrix(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute pairwise distance matrices between and within two populations.

Returns three cupy distance matrices: between-population, within-pop1, and within-pop2. Pre-compute once and pass to individual stats via the distance_matrices parameter to avoid redundant GPU work.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)

Returns:

dist_between (cupy.ndarray, float64, shape (n1, n2))
dist_within1 (cupy.ndarray, float64, shape (n1, n1))
dist_within2 (cupy.ndarray, float64, shape (n2, n2))

pg_gpu.divergence.distance_based_stats(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute all distance-based two-population statistics at once.

Shares the pairwise distance matrix computation across Snn, Gmin, dd, and dd_rank, avoiding redundant GPU work.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)

Returns:

Keys: snn, dxy_min, gmin, dd1, dd2, dd_rank1, dd_rank2.

Return type:

Selection Scan Statistics

Haplotype-Based Selection Scans

pg_gpu.selection.ihs(haplotype_matrix, pos=None, map_pos=None, min_ehh=0.05, min_maf=0.05, include_edges=False, gap_scale=20000, max_gap=200000, is_accessible=None, population=None, missing_data='include')[source]

Compute the unstandardized integrated haplotype score (iHS).

Compares the integrated EHH for the reference vs alternate allele at each variant.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
pos (array_like, optional) – Variant positions. If None, uses haplotype_matrix.positions.
map_pos (array_like, optional) – Genetic map positions. If None, uses physical positions.
min_ehh (float, optional) – Minimum EHH below which to truncate IHH computation.
min_maf (float, optional) – Minimum minor allele frequency to compute score.
include_edges (bool, optional) – If True, report scores even when EHH does not decay below min_ehh.
gap_scale (int, optional) – Rescale gaps larger than this value.
max_gap (int, optional) – Mark gaps larger than this as invalid.
is_accessible (array_like, optional) – Genome accessibility mask.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan

Returns:

score – Unstandardized iHS scores: log(IHH1 / IHH0).

Return type:

ndarray, float, shape (n_variants,)

pg_gpu.selection.xpehh(haplotype_matrix, pop1, pop2, pos=None, map_pos=None, min_ehh=0.05, include_edges=False, gap_scale=20000, max_gap=200000, is_accessible=None, missing_data='include')[source]

Compute the unstandardized cross-population EHH (XP-EHH) statistic.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data containing both populations.
pop1 (str or list) – First population name or sample indices.
pop2 (str or list) – Second population name or sample indices.
pos (array_like, optional) – Variant positions. If None, uses haplotype_matrix.positions.
map_pos (array_like, optional) – Genetic map positions.
min_ehh (float, optional) – Minimum EHH threshold.
include_edges (bool, optional) – Report scores even when EHH does not decay below min_ehh.
gap_scale (int, optional) – Rescale gaps larger than this value.
max_gap (int, optional) – Mark gaps larger than this as invalid.
is_accessible (array_like, optional) – Genome accessibility mask.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan

Returns:

score – Unstandardized XP-EHH scores: log(IHH_pop1 / IHH_pop2).

Return type:

ndarray, float, shape (n_variants,)

pg_gpu.selection.nsl(haplotype_matrix, population=None, missing_data='include')[source]

Compute the unstandardized nSL statistic for each variant.

Compares the mean shared haplotype length around the reference (0) vs alternate (1) allele at each site.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or list of sample indices.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan

Returns:

score – Unstandardized nSL scores: log(nSL1 / nSL0).

Return type:

ndarray, float, shape (n_variants,)

pg_gpu.selection.xpnsl(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute the unstandardized cross-population nSL statistic.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data containing both populations.
pop1 (str or list) – First population name or sample indices.
pop2 (str or list) – Second population name or sample indices.
missing_data (str) – ‘include’ - missing extends shared suffix length (default) ‘exclude’ - filter to sites with no missing data before scan

Returns:

score – Unstandardized XP-nSL scores: log(nSL_pop1 / nSL_pop2).

Return type:

ndarray, float, shape (n_variants,)

EHH and Haplotype Homozygosity

pg_gpu.selection.ehh_decay(haplotype_matrix, truncate=False, population=None, missing_data='include')[source]

Compute EHH decay from the first variant.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
truncate (bool, optional) – If True, exclude trailing zeros from the result.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘include’ - missing sites skipped in prefix comparison (default) ‘exclude’ - filter to sites with no missing data

Returns:

ehh – EHH values at each variant position from the first.

Return type:

ndarray, float, shape (n_variants,)

pg_gpu.selection.garud_h(matrix, population=None, missing_data='include')[source]

Compute Garud’s H1, H12, H123, and H2/H1 statistics.

Accepts either HaplotypeMatrix (uses haplotype frequencies) or GenotypeMatrix (uses diplotype frequencies) and dispatches automatically.

Parameters:

matrix (HaplotypeMatrix or GenotypeMatrix) – Haplotype or diploid genotype data.
population (str or list, optional) – Population name or list of sample indices. If None, uses all samples.
missing_data (str) – ‘include’ - treat missing as wildcard (compatible with any allele). ‘exclude’ - filter to sites with no missing data.

Returns:

h1 (float) – Sum of squared haplotype/diplotype frequencies.
h12 (float) – H12 statistic (top two combined).
h123 (float) – H123 statistic (top three combined).
h2_h1 (float) – H2/H1 ratio indicating sweep softness.

pg_gpu.selection.moving_garud_h(haplotype_matrix, size, start=0, stop=None, step=None, population=None, missing_data='include')[source]

Compute Garud’s H statistics in moving windows of variants.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
size (int) – Window size (number of variants).
start (int, optional) – Starting variant index.
stop (int, optional) – Stopping variant index.
step (int, optional) – Step between windows. Defaults to size (non-overlapping).
population (str or list, optional) – Population name or list of sample indices.
missing_data (str) – ‘include’ - treat missing as wildcard in pattern matching ‘exclude’ - filter to sites with no missing data

Returns:

h1 (ndarray, float)
h12 (ndarray, float)
h123 (ndarray, float)
h2_h1 (ndarray, float)

Standardization

pg_gpu.selection.standardize(score)[source]

Centre scores to zero mean and scale to unit variance.

Parameters:: score (array_like, float, shape (n_variants,)) – Raw scores (e.g., unstandardized IHS or nSL).
Returns:: Standardized scores.
Return type:: ndarray, float, shape (n_variants,)

pg_gpu.selection.standardize_by_allele_count(score, aac, bins=None, n_bins=None)[source]

Standardize scores within allele frequency bins.

Parameters:

score (array_like, float, shape (n_variants,)) – Raw scores (e.g., unstandardized IHS or nSL).
aac (array_like, int, shape (n_variants,)) – Alternate allele counts for each variant.
bins (array_like, int, optional) – Allele count bin edges. Overrides n_bins.
n_bins (int, optional) – Number of bins to create. Default: max(aac) // 2.

Returns:

score_standardized (ndarray, float, shape (n_variants,)) – Standardized scores.
bins (ndarray) – Bin edges used.

Site Frequency Spectrum

Single Population

pg_gpu.sfs.sfs(haplotype_matrix, population=None, missing_data='include')[source]

Compute the unfolded site frequency spectrum.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
missing_data (str) – ‘include’ - per-site n_valid; bins by actual DAC ‘exclude’ - only sites with no missing data

Returns:

Element k = number of variants with k derived alleles.

Return type:

ndarray, int64, shape (n_chromosomes + 1,)

pg_gpu.sfs.sfs_folded(haplotype_matrix, population=None, missing_data='include')[source]

Compute the folded site frequency spectrum (minor allele counts).

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
population (str or list, optional) – Population name or sample indices.
missing_data (str)

Returns:

Element k = number of variants with minor allele count k.

Return type:

ndarray, int64, shape (n_chromosomes // 2 + 1,)

pg_gpu.sfs.sfs_scaled(haplotype_matrix, population=None, missing_data='include')[source]

Compute the scaled unfolded site frequency spectrum.

Scaling: element k is multiplied by k, yielding a constant expectation under neutrality and constant population size.

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str)

Return type:

ndarray, float64, shape (n_chromosomes + 1,)

pg_gpu.sfs.sfs_folded_scaled(haplotype_matrix, population=None, missing_data='include')[source]

Compute the scaled folded site frequency spectrum.

Scaling: element k is multiplied by k * (n - k) / n.

Parameters:

haplotype_matrix (HaplotypeMatrix)
population (str or list, optional)
missing_data (str)

Return type:

ndarray, float64, shape (n_chromosomes // 2 + 1,)

Joint (Two-Population)

pg_gpu.sfs.joint_sfs(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute the joint site frequency spectrum between two populations.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list) – Population names or sample indices.
pop2 (str or list) – Population names or sample indices.
missing_data (str)

Returns:

Element [i, j] = number of variants with i derived alleles in pop1 and j derived alleles in pop2.

Return type:

ndarray, int64, shape (n1 + 1, n2 + 1)

pg_gpu.sfs.joint_sfs_folded(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute the folded joint site frequency spectrum.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)

Return type:

ndarray, int64, shape (n1 // 2 + 1, n2 // 2 + 1)

pg_gpu.sfs.joint_sfs_scaled(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute the scaled joint site frequency spectrum.

Scaling: element [i, j] is multiplied by i * j.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)

Return type:

ndarray, float64, shape (n1 + 1, n2 + 1)

pg_gpu.sfs.joint_sfs_folded_scaled(haplotype_matrix, pop1, pop2, missing_data='include')[source]

Compute the scaled folded joint site frequency spectrum.

Scaling: element [i, j] is multiplied by i * j * (n1 - i) * (n2 - j).

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop1 (str or list)
pop2 (str or list)
missing_data (str)

Return type:

ndarray, float64, shape (n1 // 2 + 1, n2 // 2 + 1)

pg_gpu.sfs.project_joint_sfs(haplotype_matrix, pop1, pop2, target_n1, target_n2, missing_data='include')[source]

Joint SFS projected to (target_n1+1, target_n2+1) via hypergeometric sampling.

Mathematically identical to P1 @ joint_sfs(...) @ P2.T with hypergeometric projection matrices P1, P2, but applied per-variant so the (n1+1, n2+1) full histogram is never materialized. That intermediate would be 80 GB at 100k haps per population; the projected output stays small regardless of source size. Use this whenever the source size is too large for joint_sfs to allocate its bincount.

Parameters:

haplotype_matrix (HaplotypeMatrix or StreamingHaplotypeMatrix)
pop1 (str or list) – Population names or explicit sample-index lists.
pop2 (str or list) – Population names or explicit sample-index lists.
target_n1 (int) – Projection targets; each must be <= the corresponding source population size.
target_n2 (int) – Projection targets; each must be <= the corresponding source population size.
missing_data (str)

Return type:

ndarray, float64, shape (target_n1 + 1, target_n2 + 1)

Scaling and Folding Utilities

pg_gpu.sfs.scale_sfs(s)[source]: Scale a site frequency spectrum by multiplying element k by k.

pg_gpu.sfs.scale_sfs_folded(s, n)[source]: Scale a folded SFS: element k multiplied by k * (n - k) / n.

pg_gpu.sfs.scale_joint_sfs(s)[source]: Scale a joint SFS: element [i, j] multiplied by i * j.

pg_gpu.sfs.scale_joint_sfs_folded(s, n1, n2)[source]: Scale a folded joint SFS: element [i,j] * i * j * (n1-i) * (n2-j).

pg_gpu.sfs.fold_sfs(s, n)[source]

Fold an unfolded SFS.

Parameters:

s (array_like) – Unfolded SFS.
n (int) – Number of chromosomes.

Returns:

Folded SFS.

Return type:

ndarray

pg_gpu.sfs.fold_joint_sfs(s, n1, n2)[source]

Fold a joint SFS.

Parameters:

s (array_like, shape (n1 + 1, n2 + 1))
n1 (int)
n2 (int)

Returns:

Folded joint SFS.

Return type:

ndarray

Resampling (Block Jackknife and Bootstrap)

General-purpose block-resampling estimators for calibrated standard error / CI on any scalar genome-wide statistic. Both accept pre-binned per-block values (single 1D array or tuple of arrays for ratio-of-sums statistics) and a callable statistic. See examples/sweep_tajimas_d_bootstrap.py for an end-to-end CI on Tajima’s D under a completed sweep.

pg_gpu.resampling.block_jackknife(values, statistic, *, weights=None)[source]

Block-jackknife standard error for a statistic over pre-binned blocks.

Implements the delete-1 block jackknife: for each block j, the statistic is recomputed on the remaining blocks; the spread of those leave-one-out estimates gives a calibrated SE. When weights is provided, the Busing et al. (1999) weighted delete-\(m_j\) formulation is used so unequal block sizes do not skew the SE.

Parameters:

values (ndarray or tuple of ndarrays) – Per-block values. Use a single 1D array for statistics of the form statistic(vb); use a tuple (num, den, ...) for ratio-of-sums statistics where multiple per-block arrays are consumed by statistic. All arrays in the tuple must have the same length.
statistic (callable) – statistic(*values) -> float. Called once on the full data and once per block with that block masked out.
weights (ndarray, optional) – Length-n array of block sizes (e.g., SNPs per block). When None (default), blocks are treated as equally weighted and the formula reduces to the unweighted block jackknife. See Notes.

Returns:

estimate (float) – Point estimate. Unweighted case: mean of leave-one-out values (matches the current admixture jackknife). Weighted case: mean of the Busing pseudo-values.
se (float) – Jackknife standard error.
per_iter (ndarray, shape (n,)) – Per-block leave-one-out statistic values.

Notes

For uniform block sizes the weighted formulation reduces exactly to sqrt(((n-1)/n) * sum((vj - mean(vj))**2)).

Weighted formulation (Busing 1999 eq. 4-5):

\[ \begin{align}\begin{aligned}h_j = N / m_j, \qquad \phi_j = h_j \hat\theta - (h_j - 1) \hat\theta_{-j}\\\tilde\theta = \frac{1}{g} \sum_j \phi_j, \qquad \mathrm{Var}(\tilde\theta) = \frac{1}{g} \sum_j \frac{(\phi_j - \tilde\theta)^2}{h_j - 1}\end{aligned}\end{align} \]

where g is the number of blocks, m_j is the size of block j, and N = sum(m_j).

References

Busing, F. M. T. A., Meijer, E., & Van der Leeden, R. (1999). Delete-m jackknife for unequal m. Statistics and Computing, 9, 3-8.

pg_gpu.resampling.block_bootstrap(values, statistic, *, n_replicates=1000, rng=None)[source]

Block-bootstrap distribution for a statistic over pre-binned blocks.

Resamples block indices with replacement n_replicates times and evaluates statistic on each resample. When values is a tuple, the same sampled indices are applied to every array — required for ratio-of-sums statistics where numerator and denominator share block indexing (e.g. normed Patterson F3, Patterson D). To bootstrap two independent arrays, call block_bootstrap on each separately.

Parameters:

values (ndarray or tuple of ndarrays) – Per-block values. Tuple entries must have the same length.
statistic (callable) – statistic(*values) -> float.
n_replicates (int, default 1000) – Number of bootstrap replicates.
rng (int, numpy.random.Generator, or None) – Seed or Generator for reproducibility. None uses fresh entropy.

Returns:

estimate (float) – Plug-in point estimate statistic(*values) on the full data. This follows Efron & Tibshirani (1993) §6.2: the bootstrap mean is a noisy estimate of the plug-in estimate, not a replacement for it.
se (float) – Bootstrap standard error, std(replicates, ddof=1).
replicates (ndarray, shape (n_replicates,)) – Bootstrap replicate values. Use e.g. np.quantile(replicates, [0.025, 0.975]) for a percentile 95% CI.

References

Efron, B., & Tibshirani, R. J. (1993). An Introduction to the Bootstrap. Chapman & Hall/CRC.

Admixture and F-Statistics

Per-Variant Statistics

pg_gpu.admixture.patterson_f2(haplotype_matrix, pop_a, pop_b, missing_data='include')[source]

Unbiased estimator for F2(A, B), the branch length between populations.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop_a (str or list) – Population names or sample indices.
pop_b (str or list) – Population names or sample indices.
missing_data (str) – ‘include’ - per-site n_valid for frequencies ‘exclude’ - filter to sites with no missing data

Returns:

f2 – Per-variant F2 estimates.

Return type:

ndarray, float64, shape (n_variants,)

pg_gpu.admixture.patterson_f3(haplotype_matrix, pop_c, pop_a, pop_b, missing_data='include')[source]

Unbiased estimator for F3(C; A, B), the three-population admixture test.

A significantly negative F3 indicates that population C is admixed between populations A and B.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop_c (str or list) – Test population.
pop_a (str or list) – Source populations.
pop_b (str or list) – Source populations.
missing_data (str) – ‘include’ - per-site n_valid ‘exclude’ - filter to sites with no missing data

Returns:

T (ndarray, float64, shape (n_variants,)) – Un-normalized F3 estimates per variant.
B (ndarray, float64, shape (n_variants,)) – Heterozygosity estimates (2 * h_hat) for population C.

pg_gpu.admixture.patterson_d(haplotype_matrix, pop_a, pop_b, pop_c, pop_d, missing_data='include')[source]

Unbiased estimator for D(A, B; C, D), the ABBA-BABA test.

Tests for admixture between (A or B) and (C or D).

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop_a (str or list) – Population names or sample indices.
pop_b (str or list) – Population names or sample indices.
pop_c (str or list) – Population names or sample indices.
pop_d (str or list) – Population names or sample indices.
missing_data (str) – ‘include’ - per-site n_valid ‘exclude’ - filter to sites with no missing data

Returns:

num (ndarray, float64, shape (n_variants,)) – Numerator (un-normalized F4 estimates).
den (ndarray, float64, shape (n_variants,)) – Denominator.

Windowed Statistics

pg_gpu.admixture.moving_patterson_f3(haplotype_matrix, pop_c, pop_a, pop_b, size, start=0, stop=None, step=None, normed=True, missing_data='include')[source]

Estimate F3(C; A, B) in moving windows.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop_c (str or list)
pop_a (str or list)
pop_b (str or list)
size (int) – Window size (number of variants).
start (int, optional)
stop (int, optional)
step (int, optional)
normed (bool) – If True, compute normalized F3* per window.
missing_data (str)

Returns:

f3

Return type:

ndarray, float64, shape (n_windows,)

pg_gpu.admixture.moving_patterson_d(haplotype_matrix, pop_a, pop_b, pop_c, pop_d, size, start=0, stop=None, step=None, missing_data='include')[source]

Estimate D(A, B; C, D) in moving windows.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop_a (str or list)
pop_b (str or list)
pop_c (str or list)
pop_d (str or list)
size (int)
start (int, optional)
stop (int, optional)
step (int, optional)
missing_data (str)

Returns:

d

Return type:

ndarray, float64, shape (n_windows,)

Block-Jackknife Averaged

pg_gpu.admixture.average_patterson_f3(haplotype_matrix, pop_c, pop_a, pop_b, blen, normed=True, missing_data='include')[source]

Estimate F3(C; A, B) with standard error via block-jackknife.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop_c (str or list)
pop_a (str or list)
pop_b (str or list)
blen (int) – Block size (number of variants).
normed (bool) – If True, compute normalized F3*.
missing_data (str)

Returns:

f3 (float) – Overall estimate.
se (float) – Standard error.
z (float) – Z-score.
vb (ndarray) – Per-block values.
vj (ndarray) – Jackknife resampled values.

pg_gpu.admixture.average_patterson_d(haplotype_matrix, pop_a, pop_b, pop_c, pop_d, blen, missing_data='include')[source]

Estimate D(A, B; C, D) with standard error via block-jackknife.

Parameters:

haplotype_matrix (HaplotypeMatrix)
pop_a (str or list)
pop_b (str or list)
pop_c (str or list)
pop_d (str or list)
blen (int) – Block size (number of variants).
missing_data (str)

Returns:

d (float) – Overall estimate.
se (float) – Standard error.
z (float) – Z-score.
vb (ndarray) – Per-block values.
vj (ndarray) – Jackknife resampled values.

Dimensionality Reduction and Distance

PCA

pg_gpu.decomposition.pca(haplotype_matrix, n_components=10, scaler='patterson', population=None, missing_data='include')[source]

Principal Component Analysis on haplotype data.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data. Rows are haplotypes, columns are variants.
n_components (int) – Number of principal components to compute.
scaler (str or None) – Scaling method before PCA: ‘patterson’ - scale by sqrt(p*(1-p)), Patterson et al. (2006) ‘standard’ - standardize to unit variance per variant None - center only (subtract mean)
population (str or list, optional) – Population subset to use.
missing_data (str) – ‘include’ - impute missing to per-site mean ‘exclude’ - filter sites with any missing

Returns:

coords (ndarray, float64, shape (n_samples, n_components)) – Sample coordinates in PC space.
explained_variance_ratio (ndarray, float64, shape (n_components,)) – Proportion of variance explained by each component.

pg_gpu.decomposition.randomized_pca(haplotype_matrix, n_components=10, scaler='patterson', population=None, n_iter=3, random_state=None, missing_data='include')[source]

Randomized PCA using truncated SVD approximation.

Faster than full PCA for large datasets where only a few components are needed.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
n_components (int) – Number of components.
scaler (str or None) – ‘patterson’, ‘standard’, or None.
population (str or list, optional) – Population subset.
n_iter (int) – Number of power iterations for accuracy.
random_state (int, optional) – Random seed for reproducibility.
missing_data (str)

Returns:

coords (ndarray, float64, shape (n_samples, n_components))
explained_variance_ratio (ndarray, float64, shape (n_components,))

Distance

pg_gpu.decomposition.pairwise_distance(haplotype_matrix, metric='euclidean', population=None, missing_data='include')[source]

Compute pairwise distances between samples.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
metric (str) – Distance metric. Supported on GPU: ‘euclidean’, ‘cityblock’, ‘sqeuclidean’. Falls back to scipy for other metrics.
population (str or list, optional) – Population subset.
missing_data (str) – ‘include’ - mask missing, normalize per pair ‘exclude’ - filter sites with any missing

Returns:

dist – Condensed distance matrix.

Return type:

ndarray, float64, shape (n_samples * (n_samples - 1) // 2,)

pg_gpu.decomposition.pcoa(dist, n_components=None)[source]

Principal Coordinate Analysis (classical MDS).

Parameters:

dist (array_like) – Pairwise distances in condensed form (from pairwise_distance) or as a square distance matrix.
n_components (int, optional) – Number of dimensions to return. If None, returns all dimensions with positive eigenvalues.

Returns:

coords (ndarray, float64, shape (n_samples, n_components)) – Sample coordinates.
explained_variance_ratio (ndarray, float64, shape (n_components,)) – Proportion of variance explained by each axis.

Local PCA (lostruct)

GPU port of the lostruct method (Li & Ralph 2019) for detecting genomic regions where population structure differs from the chromosome-wide pattern (inversions, introgression, low-recombination regions). .. Per-window eigendecomposition is batched with a single cp.linalg.eigh over a .. stacked (n_windows, n_samples, n_samples) tensor.

For most users, pg_gpu.decomposition.lostruct() is the one-and-done entry point: it bundles per-window local PCA, the between-window Frobenius distance, MDS on that distance, and corner / outlier detection into a single call and returns a LostructResult. The lower-level functions below are exposed for callers who want intermediate access or want to recompute one stage with different parameters.

pg_gpu.decomposition.lostruct(haplotype_matrix, *, window_params=None, k=2, scaler=None, missing_data='include', population=None, batch_size=None, window_size=None, step_size=None, window_type='snp', regions=None, jackknife=False, n_blocks=10, jackknife_aggregate='mean', npc=None, normalize='L1', pc_dist_weights=1, n_components=2, corner_prop=0.05, n_corners=3, random_state=None, engine='dense-eigh', tile_size=None, oversample=8, n_subspace_iter=1)[source]

End-to-end lostruct pipeline (Li & Ralph 2019).

Runs the four-step recipe in one call: per-window local PCA, pairwise Frobenius distance between window covariance reps, classical MDS on that distance, and corner / outlier detection in the MDS embedding.

Optionally also computes the block-within-window jackknife standard error of the per-window eigenvectors (Li & Ralph 2019, Appendix); when requested the SE is computed in the same window pass as the base eigendecomposition (shared matrix prep), and is exposed via LostructResult.jackknife_se.

Parameters:

haplotype_matrix (HaplotypeMatrix)
window_params (WindowParams, optional) – Pre-built window spec. Required when window_size is not given.
k (int) – Number of PCs retained per window in the local PCA step.
scaler (see local_pca().)
missing_data (see local_pca().)
population (see local_pca().)
batch_size (see local_pca().)
window_size (int | None) – Short-hand for constructing a WindowParams.
step_size (int | None) – Short-hand for constructing a WindowParams.
window_type (str) – Short-hand for constructing a WindowParams.
regions – Short-hand for constructing a WindowParams.
jackknife (bool) – If True, also compute the delete-1 block jackknife standard error of the per-window top-k eigenvectors and expose it via LostructResult.jackknife_se. Per-window matrix preparation is shared with the base eigendecomposition, so enabling this is much cheaper than calling local_pca_jackknife separately.
n_blocks (int) – Number of jackknife blocks per window when jackknife=True. Ignored otherwise.
jackknife_aggregate ({'mean', None}) – 'mean' returns shape (n_windows, k) – per-PC standard error averaged across samples (matches the R reference). None returns the full (n_windows, k, n_samples) tensor. Ignored when jackknife=False.
npc (int, optional) – Number of PCs used by pc_dist. Defaults to k.
normalize ({'L1', 'L2', None}) – Per-window eigenvalue normalization before distance computation.
pc_dist_weights (array_like or scalar) – Sample weights forwarded to pc_dist as w.
n_components (int) – Number of MDS dimensions to compute (default 2).
corner_prop (float) – Fraction of points per corner cluster (forwarded to corners).
n_corners (int) – Number of corner clusters to detect. Set to 0 (or any non-positive value) to skip corner detection and leave LostructResult.corner_indices as None.
random_state (int, optional) – RNG seed for the MEC tie-breaks inside corners.
engine (str)
tile_size (int | None)
oversample (int)
n_subspace_iter (int)

Returns:

Bundle of the four pipeline outputs. Each piece is also reachable via the underlying functions for users who want to recompute one stage with different parameters.

Return type:

LostructResult

Relatedness and Kinship

pg_gpu.relatedness.grm(genotype_matrix_or_haplotype_matrix, population=None, missing_data='include')[source]

Compute the Genetic Relationship Matrix (GRM).

The GRM measures genome-wide allele sharing between all pairs of individuals, standardized by allele frequencies. Equivalent to GCTA’s –make-grm or PLINK2’s –make-rel.

Formula:: A_ij = (1/M) * sum_k[(g_ik - 2*p_k)(g_jk - 2*p_k) / (2*p_k*(1-p_k))]

where g is genotype (0/1/2), p is allele frequency, M is number of sites.

Parameters:

genotype_matrix_or_haplotype_matrix (GenotypeMatrix or HaplotypeMatrix) – Diploid genotypes (0/1/2) or haplotype data (auto-converted to diploid by pairing consecutive haplotypes).
population (str or list, optional) – Population subset.
missing_data (str) – ‘include’ - use per-site allele frequencies from non-missing data. ‘exclude’ - restrict to sites with no missing data.

Returns:

Symmetric GRM. Diagonal entries are individual inbreeding coefficients + 1. Off-diagonal entries are pairwise relatedness.

Return type:

ndarray, float64, shape (n_individuals, n_individuals)

pg_gpu.relatedness.ibs(genotype_matrix_or_haplotype_matrix, population=None, missing_data='include')[source]

Compute pairwise IBS (Identity by State) proportions.

IBS measures the proportion of alleles shared identical by state between all pairs of individuals. Equivalent to PLINK’s –distance ibs matrix.

For each site, IBS between individuals i and j is IBS = (2 - abs(g_i - g_j)) / 2.

The matrix contains the mean IBS across all jointly-called sites.

Parameters:

genotype_matrix_or_haplotype_matrix (GenotypeMatrix or HaplotypeMatrix) – Diploid genotypes (0/1/2) or haplotype data.
population (str or list, optional) – Population subset.
missing_data (str) – ‘include’ - use jointly non-missing sites per pair. ‘exclude’ - restrict to sites with no missing data.

Returns:

Symmetric IBS matrix. Values in [0, 1] where 1 = identical. Diagonal is always 1.

Return type:

ndarray, float64, shape (n_individuals, n_individuals)

Windowed Statistics (GPU-Native)

The windowed_analysis() convenience function automatically routes through fused CUDA kernels for maximum performance. A single kernel launch processes all windows in parallel.

pg_gpu.windowed_analysis.windowed_analysis(haplotype_matrix, window_size=50000, step_size=None, statistics=['pi'], populations=None, missing_data='include', span_normalize=True, accessible_bed=None, chrom=None, **kwargs)[source]

Convenience function for windowed analysis.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Input haplotype data
window_size (int) – Window size in base pairs
step_size (int, optional) – Step size. If None, uses window_size
statistics (list) – Statistics to compute
populations (list, optional) – Population names
missing_data (str) – ‘include’ - Use all sites, calculate from available data per site ‘exclude’ - Only use sites with no missing data
span_normalize (bool) – True (default): auto-detect best denominator per window. False: return raw sums.
accessible_bed (str, optional) – Path to a BED file defining accessible/callable regions. If provided and the matrix has no mask, loads the mask.
**kwargs – Additional arguments passed to WindowedAnalyzer
chrom (str)

Returns:

Windowed statistics results

Return type:

pd.DataFrame

pg_gpu.windowed_analysis.windowed_statistics_fused(haplotype_matrix, bp_bins, statistics=('pi', 'theta_w', 'tajimas_d', 'segregating_sites', 'singletons'), population=None, pop1=None, pop2=None, per_base=True, is_accessible=None, _win_starts=None, _win_stops=None, missing_data='include', chrom=None)[source]

GPU-native windowed statistics using fused CUDA kernels.

One kernel launch processes ALL windows in parallel. Each thread block handles one window, with threads cooperatively reducing over variants. Reads the haplotype matrix once and computes all statistics simultaneously.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
bp_bins (array_like) – Window edges in base pairs. N+1 edges define N windows.
statistics (tuple of str) – Statistics to compute. Single-pop: ‘pi’, ‘theta_w’, ‘tajimas_d’, ‘segregating_sites’, ‘singletons’. Two-pop: ‘fst’, ‘dxy’.
population (str or list, optional) – Population for single-pop statistics.
pop1 (str or list, optional) – Populations for two-pop statistics.
pop2 (str or list, optional) – Populations for two-pop statistics.
per_base (bool) – Normalize by window size in base pairs.
is_accessible (array_like, optional) – Accessibility mask for per-base normalization.
chrom (str or int, optional) – Chromosome label emitted in the chrom output column.

Returns:

Maps column names to numpy arrays of shape (n_windows,). The first six columns are always chrom, start, end, center, n_variants, window_id, followed by one column per requested statistic.

Return type:

Supported Fused Windowed Statistics

Single-population (one kernel launch for all):

pi – nucleotide diversity
theta_w – Watterson’s theta
tajimas_d – Tajima’s D
segregating_sites – count of segregating sites
singletons – count of singletons

Two-population (one kernel launch for all):

fst – Hudson’s FST (ratio of averages)
fst_hudson – alias for fst
fst_wc – Weir-Cockerham FST (haploid)
dxy – absolute divergence
da – net divergence (Dxy - mean within-pop pi)

Selection scan statistics:

garud_h1, garud_h12, garud_h123, garud_h2h1 – Garud’s H statistics per window (prefix-sum hashing + shared-memory sort)
mean_nsl – mean nSL per window (per-site nSL + scatter binning)

Structure / dimensionality reduction:

local_pca – Li & Ralph (2019) local PCA. Vector-valued per window; dispatches to pg_gpu.decomposition.local_pca() and returns a LocalPCAResult instead of the scalar-stat DataFrame. Can be combined with scalar statistics in the same call; the scalar columns are merged onto result.windows.
local_pca_jackknife – delete-1 block jackknife standard error for local PCA eigenvectors. The jackknife is block-within-window: each window is partitioned into n_blocks SNP-blocks, the local-PCA eigendecomposition is recomputed with one block left out, and the per-window standard error is taken across the leave-one-out replicates. Sign alignment between replicates and per-window aggregation follow the procedure in Li & Ralph (2019), Appendix, which also discusses interpreting the resulting standard error as a signal-to-noise diagnostic for outlier windows. Returns a LocalPCAResult with the jackknife_se field populated. When requested alongside local_pca, per-window matrix preparation is shared. Accepts n_blocks and aggregate kwargs.

Legacy Functions

pg_gpu.windowed_analysis.windowed_statistics(haplotype_matrix, bp_bins, statistics=('pi', 'theta_w', 'tajimas_d'), population=None, pop1=None, pop2=None, per_base=True, is_accessible=None, chrom=None)[source]

GPU-native windowed statistics with no Python loop over windows.

Computes per-variant values once, then aggregates into windows using GPU scatter_add operations. Dramatically faster than per-window computation for large numbers of windows.

Parameters:

haplotype_matrix (HaplotypeMatrix) – Haplotype data.
bp_bins (array_like) – Window edges in base pairs. N+1 edges define N windows.
statistics (tuple of str) – Statistics to compute. Supported: Single-population: ‘pi’, ‘theta_w’, ‘tajimas_d’, ‘segregating_sites’, ‘singletons’, ‘het_expected’ Two-population: ‘fst’, ‘dxy’
population (str or list, optional) – Population for single-pop statistics.
pop1 (str or list, optional) – Populations for two-pop statistics (fst, dxy).
pop2 (str or list, optional) – Populations for two-pop statistics (fst, dxy).
per_base (bool) – If True, normalize by window size in base pairs.
is_accessible (array_like, optional) – Boolean accessibility mask for per-base normalization.
chrom (str or int, optional) – Chromosome label emitted in the chrom output column.

Returns:

Maps column names to numpy arrays of shape (n_windows,). The first six columns are chrom, start, end, center, n_variants, window_id, followed by one column per requested statistic.

Return type: