Quick Start Guide ================= Basic Usage ----------- Loading Data ~~~~~~~~~~~~ The primary container, ``HaplotypeMatrix``, holds *phased* haplotypes (one row per haplotype, two rows per diploid sample). When loaded from a diploid VCF, the two haploid copies of each sample become two rows; the loader does not check whether your data are actually phased, so if the VCF contains unphased calls the resulting matrix is best thought of as a random-phasing of the genotypes. For unphased / diploid-aware analyses, use ``GenotypeMatrix`` (see "Phased to Unphased" below). .. code-block:: python from pg_gpu import HaplotypeMatrix # From VCF file (sample names are stored automatically). Diploid VCFs # are loaded as 2*n_samples haplotypes -- treated as phased. h = HaplotypeMatrix.from_vcf("data.vcf.gz") # Load a specific genomic region (requires tabix index) h = HaplotypeMatrix.from_vcf("data.vcf.gz", region="chr1:1000000-2000000") # Load a subset of samples h = HaplotypeMatrix.from_vcf("data.vcf.gz", samples=["ind1", "ind2", "ind3"]) # From NumPy array (n_haplotypes, n_variants) import numpy as np hap = np.array([[0, 1, 0], [1, 0, 1]], dtype=np.int8) positions = np.array([100, 200, 300]) h = HaplotypeMatrix(hap, positions, chrom_start=0, chrom_end=400) Populations can be assigned from a tab-delimited file (sample, pop) or manually via ``sample_sets``: .. code-block:: python # From a population file (uses stored sample names) h.load_pop_file("pops.txt") # Or manually h.sample_sets = { "pop1": [0, 1, 2, 3], "pop2": [4, 5, 6, 7] } Working with Zarr ~~~~~~~~~~~~~~~~~~ pg_gpu supports the VCZ zarr format (via `bio2zarr `_) as well as legacy scikit-allel zarr stores. The format is auto-detected on read. **Converting VCF to Zarr** (recommended for large datasets): .. code-block:: python # Convert VCF to VCZ zarr (uses bio2zarr, supports multicore) HaplotypeMatrix.vcf_to_zarr( "data.vcf.gz", "data.zarr", worker_processes=8, # parallel conversion ) # Load from zarr (much faster than VCF for repeated access) h = HaplotypeMatrix.from_zarr("data.zarr") **Region queries and multi-chromosome stores:** .. code-block:: python # Region queries work on all zarr layouts h = HaplotypeMatrix.from_zarr("data.zarr", region="chr1:1000000-2000000") # Chromosome-grouped stores (e.g., Ag1000G) require a region h = HaplotypeMatrix.from_zarr("ag1000g.zarr", region="3L:1-10000000") **Quick save/reload** (for intermediate results): .. code-block:: python # Save to VCZ format (default) h = HaplotypeMatrix.from_vcf("data.vcf.gz") h.to_zarr("data.zarr", contig_name="chr1") # Or legacy scikit-allel format h.to_zarr("data.zarr", format="scikit-allel") LD Statistics ~~~~~~~~~~~~~ .. code-block:: python from pg_gpu import ld_statistics # Pairwise r-squared matrix r2 = h.pairwise_r2() # Haplotype count-based statistics counts, n_valid = h.tally_gpu_haplotypes() dd_vals = ld_statistics.dd(counts, n_valid=n_valid) r_vals = ld_statistics.r(counts, n_valid=n_valid) r2_vals = ld_statistics.r_squared(counts, n_valid=n_valid) # LD pruning unlinked = h.locate_unlinked(size=100, step=20, threshold=0.1) # Windowed r-squared result, pair_counts = h.windowed_r_squared( bp_bins=[0, 1000, 5000, 10000] ) # Two-population LD stats = h.compute_ld_statistics_gpu_two_pops( bp_bins=[0, 1000, 5000, 10000], pop1="pop1", pop2="pop2" ) Diversity Statistics ~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from pg_gpu import diversity # Basic diversity pi_val = diversity.pi(h, span_normalize=True) theta = diversity.theta_w(h, span_normalize=True) tajd = diversity.tajimas_d(h) # Heterozygosity he = diversity.heterozygosity_expected(h) ho = diversity.heterozygosity_observed(h) f = diversity.inbreeding_coefficient(h) # Allele frequency spectrum afs = diversity.allele_frequency_spectrum(h) # Population-specific pi_pop1 = diversity.pi(h, population='pop1') Theta Estimators and Neutrality Tests ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ pg_gpu ships eight theta estimators and five neutrality tests. They can be called individually or batched (a single GPU pass for all requested statistics): .. code-block:: python from pg_gpu import diversity # Individual statistics diversity.pi(h, population="pop1") diversity.theta_w(h, population="pop1") diversity.tajimas_d(h, population="pop1") diversity.fay_wus_h(h, population="pop1") # Batched (single GPU pass for all) stats = diversity.diversity_stats(h, population="pop1", statistics=['pi', 'theta_w', 'theta_h', 'theta_l', 'tajimas_d']) Available estimators: ``pi``, ``theta_w``, ``theta_h``, ``theta_l``, ``eta1``, ``eta1_star``, ``minus_eta1``, ``minus_eta1_star``. Available tests: ``tajimas_d``, ``fay_wus_h``, ``normalized_fay_wus_h``, ``zeng_e``, ``zeng_dh``. Divergence Statistics ~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from pg_gpu import divergence fst = divergence.fst(h, 'pop1', 'pop2') dxy_val = divergence.dxy(h, 'pop1', 'pop2') # Population Branch Statistic (3 populations) pbs_vals = divergence.pbs(h, 'pop1', 'pop2', 'pop3', window_size=50) # Distance-based two-population statistics snn = divergence.snn(h, 'pop1', 'pop2') # Hudson (2000) dmin = divergence.dxy_min(h, 'pop1', 'pop2') # Geneva et al. (2015) g = divergence.gmin(h, 'pop1', 'pop2') # Geneva et al. (2015) dd1, dd2 = divergence.dd(h, 'pop1', 'pop2') # Schrider et al. (2018) r1, r2 = divergence.dd_rank(h, 'pop1', 'pop2') # Schrider et al. (2018) zx_val = divergence.zx(h, 'pop1', 'pop2') # Schrider et al. (2018) # Efficient: pre-compute distance matrix once, pass to multiple stats dm = divergence.pairwise_distance_matrix(h, 'pop1', 'pop2') snn = divergence.snn(h, 'pop1', 'pop2', distance_matrices=dm) g = divergence.gmin(h, 'pop1', 'pop2', distance_matrices=dm) dd1, dd2 = divergence.dd(h, 'pop1', 'pop2', distance_matrices=dm) # Or compute all distance-based stats in one call stats = divergence.distance_based_stats(h, 'pop1', 'pop2') Selection Scans ~~~~~~~~~~~~~~~ .. code-block:: python from pg_gpu import selection # Integrated haplotype score ihs_scores = selection.ihs(h) ihs_std = selection.standardize(ihs_scores) # Cross-population EHH xpehh_scores = selection.xpehh(h, 'pop1', 'pop2') # nSL (no distance weighting) nsl_scores = selection.nsl(h) # Garud's H statistics h1, h12, h123, h2_h1 = selection.garud_h(h) # EHH decay ehh = selection.ehh_decay(h) Site Frequency Spectrum ~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python from pg_gpu import sfs # Unfolded and folded SFS s = sfs.sfs(h) s_folded = sfs.sfs_folded(h) # Scaled SFS s_scaled = sfs.sfs_scaled(h) # Joint SFS (two populations) j = sfs.joint_sfs(h, 'pop1', 'pop2') j_folded = sfs.joint_sfs_folded(h, 'pop1', 'pop2') Admixture / F-Statistics ~~~~~~~~~~~~~~~~~~~~~~~~ Patterson's F-statistics are *ratio-of-sums* statistics: each call returns the per-site numerator and denominator separately, so that resampling (jackknife / bootstrap) can be applied across sites with the correct covariance structure. Wrappers like ``average_patterson_d`` do the combine-and-jackknife step for you. .. code-block:: python from pg_gpu import admixture # Patterson's F2 (branch length) f2 = admixture.patterson_f2(h, 'pop1', 'pop2') # Patterson's F3 (admixture test): per-site numerator and denominator. # The point estimate of F3* is sum(numer) / sum(denom). f3_numer, f3_denom = admixture.patterson_f3( h, 'test_pop', 'source1', 'source2') f3_star = np.nansum(f3_numer) / np.nansum(f3_denom) # Patterson's D (ABBA-BABA): per-site numerator and denominator. d_numer, d_denom = admixture.patterson_d( h, 'popA', 'popB', 'popC', 'popD') # Block-jackknife wrapper for Patterson's D # d_hat point estimate # d_se jackknife standard error # d_z z-score # v_block per-block leave-one-out values # v_jack jackknife pseudo-values d_hat, d_se, d_z, v_block, v_jack = admixture.average_patterson_d( h, 'popA', 'popB', 'popC', 'popD', blen=100 ) Resampling (Block Jackknife and Bootstrap) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ``block_jackknife`` and ``block_bootstrap`` are general-purpose resampling estimators: you bin your data into blocks (typically genomic windows or chromosome chunks) and pass an array of per-block values plus a ``statistic`` callable that maps blocks to a scalar. They return a point estimate, a standard error, and either jackknife pseudo-values or bootstrap replicates. For *ratio-of-sums* statistics (Patterson's F-statistics, FST, etc.), the numerator and denominator must be resampled with the same block assignments. Pass a tuple ``(numer_blocks, denom_blocks)`` and write ``statistic`` to consume both arrays. .. code-block:: python import numpy as np from pg_gpu import (block_jackknife, block_bootstrap, windowed_analysis, admixture) # 1) Jackknife standard error of a windowed mean. windowed_analysis gives per-window # Tajima's D; treat each window as a block and take the mean. df = windowed_analysis(h, window_size=50_000, step_size=25_000, statistics=['tajimas_d'], window_type='bp') tajd = df['tajimas_d'].to_numpy() tajd = tajd[np.isfinite(tajd)] jack_est, jack_se, _ = block_jackknife(tajd, statistic=np.mean) # 2) Bootstrap 95% CI on the same mean. boot_est, boot_se, reps = block_bootstrap( tajd, statistic=np.mean, n_replicates=2000, rng=0) lo, hi = np.quantile(reps, [0.025, 0.975]) # 3) Ratio-of-sums pattern: bin Patterson's D numer/denom into blocks # (here, equal-size chunks of consecutive sites) and bootstrap the # ratio of sums. d_numer, d_denom = admixture.patterson_d( h, 'popA', 'popB', 'popC', 'popD') block_size = 1000 n_blocks = len(d_numer) // block_size numer_blocks = d_numer[:n_blocks * block_size].reshape(n_blocks, -1).sum(1) denom_blocks = d_denom[:n_blocks * block_size].reshape(n_blocks, -1).sum(1) ratio_est, ratio_se, reps = block_bootstrap( (numer_blocks, denom_blocks), statistic=lambda numer, denom: np.sum(numer) / np.sum(denom), n_replicates=2000, rng=0, ) PCA and Distance ~~~~~~~~~~~~~~~~ .. code-block:: python from pg_gpu import decomposition # PCA (GPU-accelerated SVD) coords, var_ratio = decomposition.pca(h, n_components=10) # Randomized PCA (faster for large datasets) coords, var_ratio = decomposition.randomized_pca(h, n_components=10) # Pairwise genetic distance dist = decomposition.pairwise_distance(h, metric='euclidean') # PCoA from distance matrix coords, var_ratio = decomposition.pcoa(dist) Local PCA / lostruct ~~~~~~~~~~~~~~~~~~~~ GPU port of Li & Ralph's (2019) ``lostruct`` method for detecting genomic regions where population structure differs from the chromosome-wide pattern (inversions, introgression, low-recombination regions). All four steps of the pipeline are available: .. code-block:: python from pg_gpu.decomposition import local_pca, pc_dist, corners, pcoa # 1. Per-window top-k eigendecomposition (GPU-batched eigh) result = local_pca(h, window_size=500, window_type='snp', k=2) # result is a LocalPCAResult with: # result.windows -> DataFrame (chrom, start, end, center, n_variants, window_id) # result.eigvals -> (n_windows, k) # result.eigvecs -> (n_windows, k, n_samples) # result.sumsq -> (n_windows,) # 2. Frobenius distance between windows' low-rank covariance reps d = pc_dist(result, npc=2, normalize='L1') # 'L1' | 'L2' | None # 3. MDS on the window-by-window distance matrix coords, _ = pcoa(d, n_components=2) # 4. Extreme-cluster detection in MDS space (Welzl MEC) corner_idx = corners(coords, prop=0.05, k=3) # Optional: delete-1 block jackknife standard error of local PCs (standalone) from pg_gpu.decomposition import local_pca_jackknife se = local_pca_jackknife(h, window_size=500, window_type='snp', k=2, n_blocks=10) # (n_windows, k) Both ``local_pca`` and ``local_pca_jackknife`` can be requested through ``windowed_analysis``. When both are requested together, per-window matrix preparation is shared for efficiency. Scalar statistics can be mixed in the same call; the scalar columns are merged onto ``result.windows``. .. code-block:: python from pg_gpu import windowed_analysis result = windowed_analysis(h, window_size=500, window_type='snp', statistics=['local_pca', 'local_pca_jackknife'], k=2, n_blocks=10) result.jackknife_se # (n_windows, k) For an end-to-end example with a simulated partial sweep, see ``examples/local_pca.py``. Windowed Statistics (GPU-Native) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Compute statistics across all genomic windows in a single GPU pass via fused CUDA kernels. The ``windowed_analysis()`` convenience function automatically routes through fused kernels when possible. .. code-block:: python from pg_gpu import windowed_analysis # Single-population diversity stats (fused: single kernel launch) results = windowed_analysis( h, window_size=100_000, statistics=['pi', 'theta_w', 'tajimas_d'] ) # results is a DataFrame with columns: chrom, start, end, center, # n_variants, window_id, pi, theta_w, tajimas_d # Two-population divergence stats results = windowed_analysis( h, window_size=100_000, statistics=['fst', 'fst_wc', 'dxy', 'da'], populations=['pop1', 'pop2'] ) # Everything at once -- diversity, divergence, and selection scans results = windowed_analysis( h, window_size=100_000, statistics=['pi', 'theta_w', 'tajimas_d', 'segregating_sites', 'fst', 'fst_wc', 'dxy', 'da', 'garud_h1', 'garud_h12', 'mean_nsl'], populations=['pop1', 'pop2'] ) Supported fused windowed statistics: - **Single-pop**: ``pi``, ``theta_w``, ``tajimas_d``, ``segregating_sites``, ``singletons`` - **Two-pop**: ``fst``, ``fst_hudson``, ``fst_wc``, ``dxy``, ``da`` - **Selection**: ``garud_h1``, ``garud_h12``, ``garud_h123``, ``garud_h2h1``, ``mean_nsl`` - **Structure**: ``local_pca`` (lostruct); returns a ``LocalPCAResult`` rather than a scalar-stat DataFrame. See the `Local PCA / lostruct`_ section. For advanced usage with custom bin edges, use ``windowed_statistics_fused()`` directly: .. code-block:: python from pg_gpu.windowed_analysis import windowed_statistics_fused result = windowed_statistics_fused( h, bp_bins=[0, 10000, 20000, 30000, 40000, 50000], statistics=('pi', 'theta_w', 'tajimas_d'), ) Phased to Unphased ~~~~~~~~~~~~~~~~~~ Use ``GenotypeMatrix`` for unphased diploid genotypes (entries 0 / 1 / 2 counting alt alleles). Many functions auto-dispatch on input type, so the same call works on either container: .. code-block:: python from pg_gpu import GenotypeMatrix # Convert from haploid (pairs consecutive haplotypes) gm = GenotypeMatrix.from_haplotype_matrix(h) # Same functions work on both types h1, h12, h123, h2h1 = selection.garud_h(gm) # uses diplotype frequencies z = ld_statistics.zns(gm) # uses genotype correlation hist, edges = diversity.daf_histogram(gm) # DAF = sum / (2*n_ind) # Distance distribution moments var, skew, kurt = distance_stats.dist_moments(gm) Custom Theta via FrequencySpectrum ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For custom weight functions or hypergeometric SFS projection, build a ``FrequencySpectrum``. Any new theta estimator that can be written as a weighted sum over the SFS is one ``fs.theta(weight_fn)`` call away. .. code-block:: python from pg_gpu.diversity import FrequencySpectrum fs = FrequencySpectrum(h, population="pop1") fs.theta(my_custom_weight_fn) fs.project(target_n=50).theta("pi") Missing Data and Accessibility Masks ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Missing genotypes are encoded as ``-1`` and handled automatically. Two modes control how sites with missing data contribute to per-site statistics: .. code-block:: python # Missing data modes pi_include = diversity.pi(h, missing_data='include') # default: per-site valid data pi_exclude = diversity.pi(h, missing_data='exclude') # only fully genotyped sites A complementary concern is *site accessibility*: in real data, only some bases are reliably callable (high-coverage, non-repeat, etc.). Without an accessibility mask, span-normalized statistics divide by total genomic span and look spuriously low in low-callability regions. With a mask attached, both per-site sums and the denominator are restricted to accessible bases, so statistics report the true rate over the regions you can actually see. .. code-block:: python import numpy as np from pg_gpu import HaplotypeMatrix, diversity, windowed_analysis # Simulate or load a chromosome where part of the sequence is hard # to call. Build a boolean mask: True = accessible, False = excluded. accessible = np.ones(seq_length, dtype=bool) accessible[exon_start:exon_end] = False # mark a region inaccessible hm_unmasked = HaplotypeMatrix.from_ts(ts) hm_masked = HaplotypeMatrix.from_ts(ts).set_accessible_mask(accessible) # Genome-wide pi: unmasked divides by full sequence length, # masked divides by accessible bases only. diversity.pi(hm_unmasked) diversity.pi(hm_masked) # Windowed pi: fully-inaccessible windows return NaN so they don't # show up as a misleading dip; partially-accessible windows are # normalized by their accessible base count. df_unmasked = windowed_analysis(hm_unmasked, window_size=10_000, statistics=["pi"]) df_masked = windowed_analysis(hm_masked, window_size=10_000, statistics=["pi"]) For an end-to-end demo (with simulated data and a side-by-side plot) see ``examples/accessibility_mask.py``. See :doc:`missing_data` for masks-and-modes details.