Missing Data Handling ===================== pg_gpu provides support for missing data across all population genetics statistics. Missing data is encoded as ``-1`` in haplotype and genotype matrices. Missing Data Modes ------------------ Every function that operates on genetic data accepts a ``missing_data`` parameter with two options: **include** (default) Use all sites, computing statistics from observed data only. Each site uses its own sample size (``n_valid``). For haplotype identity comparisons (e.g., Garud's H), missing values are treated as wildcards compatible with any allele. **exclude** Drop entire sites that have any missing data in any sample. Only fully genotyped sites contribute to the result. Simulation testing under the standard neutral model confirms that ``include`` mode is unbiased under MCAR (missing completely at random) at missingness rates from 0--60% for pi, theta_w, theta_h, theta_l, Tajima's D, dxy, Hudson FST, da, and all SFS-based estimators. Basic Usage ----------- .. code-block:: python from pg_gpu import HaplotypeMatrix, diversity, divergence h = HaplotypeMatrix.from_vcf("data.vcf") # Default: per-site valid data pi = diversity.pi(h) # Conservative: only fully genotyped sites pi_excl = diversity.pi(h, missing_data='exclude') How It Works ------------ Consider a site with 100 haplotypes where 10 are missing (``-1``): * **include**: Computes allele frequencies from the 90 observed haplotypes. The site contributes to the result with ``n_valid = 90``. * **exclude**: The site is dropped entirely because it has missing data. For statistics that require a single sample size (e.g., Tajima's D variance formula), the harmonic mean of per-site sample sizes is used. Supported Statistics -------------------- Every public function accepts the ``missing_data`` parameter: .. list-table:: :header-rows: 1 :widths: 40 30 30 * - Function - include - exclude * - Diversity (pi, theta_w, theta_h, theta_l) - per-site n - filter sites * - Neutrality tests (tajimas_d, fay_wus_h, H*, E, DH) - per-site n, harmonic mean for variance - filter sites * - Divergence (dxy, fst_hudson, fst_weir_cockerham, da) - per-site n - filter sites * - SFS (sfs, joint_sfs, folded variants) - per-site n - filter sites * - Admixture (patterson_d, f2, f3) - per-site n - filter sites * - Selection scans (ihs, nsl, xpehh) - wildcard in shared-site length (SSL) - filter sites * - Haplotype stats (garud_h, haplotype_diversity) - wildcard match - filter sites * - Distance (pairwise_diffs, pca) - per-pair norm - filter sites * - LD (zns, omega) - per-site n - filter sites * - SFS estimators (FrequencySpectrum) - group by n - filter sites Multiallelic Sites ------------------ pg_gpu encodes haplotype alleles as integers: ``0`` = reference, ``1`` = first alternate, ``2`` = second alternate, etc. Missing data is ``-1``. For derived allele counting (used by all SFS-based statistics), any non-reference allele is treated as derived: allele values 1, 2, 3, etc. all contribute 1 to the derived allele count. This is the standard "reference vs any alternate" folding used by most population genetics tools. This means multiallelic VCF sites are handled correctly without filtering -- a site with REF=A, ALT=T,C where some samples carry the C allele will count all non-reference haplotypes as derived. If you need strictly biallelic data (e.g., for LD statistics or compatibility with other tools), use ``apply_biallelic_filter()``: .. code-block:: python h = h.apply_biallelic_filter() # keeps only 0/1 sites This is available on both ``HaplotypeMatrix`` and ``GenotypeMatrix``. Note that ``GenotypeMatrix.from_vcf()`` applies this filter automatically during loading. LD Estimator Choice ------------------- For LD statistics, ``zns()`` and ``omega()`` accept an ``estimator`` parameter independent of missing data handling: * ``estimator='auto'`` (default): use the unbiased ``sigma_d2`` estimator when the input is a ``HaplotypeMatrix``, and fall back to naive ``r2`` for pre-computed r² arrays or ``GenotypeMatrix`` inputs (where ``sigma_d2`` is not available). * ``estimator='r2'``: always use naive r-squared. Convenient when you want the "classical" estimator regardless of input type. * ``estimator='sigma_d2'``: always use the unbiased multinomial projection estimators (Ragsdale & Gravel 2019), computing :math:`\sigma_D^2 = D^2 / \pi_2` with falling-factorial corrections. More robust with small or variable sample sizes; requires a ``HaplotypeMatrix``. .. code-block:: python from pg_gpu import ld_statistics # Default 'auto': unbiased sigma_d2 when given a HaplotypeMatrix zns = ld_statistics.zns(h) # Force naive r² zns_naive = ld_statistics.zns(h, estimator='r2') # Pre-computed r² array: 'auto' falls back to 'r2' r2 = h.pairwise_r2() zns_from_array = ld_statistics.zns(r2) Haplotype Identity and Missing Data ------------------------------------ For statistics based on haplotype identity (Garud's H, haplotype diversity, haplotype count), missing values are treated as wildcards: two haplotypes match if they agree at all positions where both are non-missing. .. code-block:: python # Haplotypes [0, 1, 0, 1] and [0, -1, 0, 1] are considered identical # because they match at positions 0, 2, 3 (position 1 is missing) from pg_gpu import selection h1, h12, h123, h2_h1 = selection.garud_h(h) HaplotypeMatrix and GenotypeMatrix Utilities -------------------------------------------- The same utilities are available on both ``HaplotypeMatrix`` and ``GenotypeMatrix`` -- substitute ``gm`` for ``h`` below if you are working with diploid genotypes. .. code-block:: python # Detect and count missing data missing_per_site = h.count_missing(axis=0) missing_per_sample = h.count_missing(axis=1) # Filter by missing data frequency h_clean = h.filter_variants_by_missing(max_missing_freq=0.1) # Summary statistics summary = h.summarize_missing_data() Accessible Site Masks --------------------- Genome accessibility masks (from BED files) define which sites are callable in a sequencing experiment. This matters for normalization: if only 60% of a region is accessible, per-base diversity estimates should divide by the accessible base count, not the total span. pg_gpu integrates accessibility masks into ``HaplotypeMatrix`` and ``GenotypeMatrix`` as a non-destructive filter. When a mask is set, the ``haplotypes`` and ``positions`` properties transparently return only variants within accessible regions. The original data is preserved and the mask can be swapped or removed at any time. .. code-block:: python from pg_gpu import HaplotypeMatrix h = HaplotypeMatrix.from_vcf("data.vcf.gz") print(h.num_variants) # e.g. 50,000 # Attach a mask -- only variants in accessible regions are visible h.set_accessible_mask("accessibility.bed", chrom="3L") print(h.num_variants) # e.g. 42,000 (filtered) # n_total_sites is automatically set to the accessible base count print(h.n_total_sites) # e.g. 30,000,000 # Masks can also be set at load time h = HaplotypeMatrix.from_vcf("data.vcf.gz", accessible_bed="accessibility.bed") h = HaplotypeMatrix.from_zarr("data.zarr", region="3L:1-10000000", accessible_bed="accessibility.bed") # Remove the mask to restore all variants h.remove_accessible_mask() print(h.num_variants) # back to 50,000 **Key behaviors:** * ``set_accessible_mask()`` is non-destructive and returns ``self`` for chaining. It automatically sets ``n_total_sites`` to the count of accessible bases. * The mask covers the union of the BED's extent and the matrix's ``[chrom_start, chrom_end]`` range, so BED accessible bases that fall outside the variant range (common for variants-only VCFs) are not silently dropped. ``n_total_sites`` always equals the full BED accessible-base count. * ``get_span('accessible')`` returns the accessible base count, used for per-base normalization. This matches the denominator used by ``allel.sequence_diversity(is_accessible=...)``. * The mask stays on CPU and uses a lazy prefix-sum for O(1) range queries, so windowed analysis over many windows is efficient. Site Count Properties --------------------- After a mask is attached (or ``include_invariant=True`` was passed at load time), three properties decompose the analysis universe: * ``n_callable_sites`` -- alias for ``n_total_sites``; the BED span when masked, or the matrix length if loaded with ``include_invariant=True``. * ``n_segregating_sites`` -- polymorphic sites in the matrix (``0 < derived_count < n_valid``). * ``n_invariant_sites`` -- ``n_callable_sites - n_segregating_sites``; may include implied invariants outside the matrix when the VCF was variants-only. These satisfy ``n_callable_sites == n_segregating_sites + n_invariant_sites``. Note that ``num_variants`` is the *physical* matrix row count and is generally not equal to either ``n_segregating_sites`` (which excludes monomorphic rows) or ``n_callable_sites`` (which can include implied invariants). .. code-block:: python h.set_accessible_mask("accessibility.bed", chrom="3L") h.n_callable_sites # e.g. 30,000,000 (BED total) h.n_segregating_sites # e.g. 1,200,000 (polymorphic in matrix) h.n_invariant_sites # e.g. 28,800,000 (callable - segregating) h.num_variants # whatever rows are physically present * ``get_subset()`` and ``get_population_matrix()`` read from the filtered properties, so child matrices automatically contain only accessible variants. **Interaction with missing data modes:** Accessibility masks and missing data modes are complementary. The mask controls *which variants are visible* (a site-level filter based on genome quality), while ``missing_data`` controls *how missing genotypes at visible sites are handled* (a sample-level concern). Both can be active simultaneously: .. code-block:: python h = HaplotypeMatrix.from_vcf("data.vcf.gz", accessible_bed="mask.bed") pi = diversity.pi(h) # uses accessible mask + per-site valid counts Span Normalization ------------------ Rate estimators (pi, theta_w, dxy, etc.) accept a ``span_normalize`` parameter that controls *how results are expressed*. This is orthogonal to missing data handling. ``span_normalize`` accepts ``True`` or ``False``: * ``True`` (default): auto-detect the best denominator. If an accessible mask is set, divides by ``mask.total_accessible`` (the BED span). Otherwise divides by the genomic span (1-based inclusive, ``chrom_end - chrom_start + 1``). * ``False``: return raw sum (used internally by composite statistics like Tajima's D, and by advanced users who need custom normalization). .. code-block:: python # Per base pair (default -- auto-detects best denominator) pi = diversity.pi(h) # With accessible mask: auto uses accessible bases h.set_accessible_mask("mask.bed", chrom="3L") pi = diversity.pi(h) # per accessible base, automatically # Raw sum (no normalization) pi_raw = diversity.pi(h, span_normalize=False) Test statistics (Tajima's D, Fay-Wu's H, FST) do not accept ``span_normalize`` -- they are dimensionless by definition. SFS Projection -------------- When samples are missing at different sites the per-site sample size varies, which complicates statistics that are sensitive to sample size (notably theta estimators). *Hypergeometric projection* re-expresses an observed SFS as the SFS that would have been seen if every site had been called in exactly ``target_n`` randomly chosen samples. The projection is unbiased and lets you build a single, comparable SFS from data with mixed sample sizes -- and to compare populations that were sequenced to different depths. ``FrequencySpectrum`` supports it following Marth et al. (2004) / the implementation used in :math:`\partial a \partial i` (Gutenkunst et al. 2009): .. code-block:: python from pg_gpu.diversity import FrequencySpectrum fs = FrequencySpectrum(h, population="pop1") fs_proj = fs.project(target_n=50) # project down to n=50 pi_proj = fs_proj.theta("pi") # any theta on the projected SFS Component-Level Access ---------------------- For advanced use cases (e.g., custom windowed aggregation), raw pairwise difference and comparison counts are available via separate functions: .. code-block:: python from pg_gpu.diversity import pi_components from pg_gpu.divergence import dxy_components # Within-population: (total_diffs, total_comps, total_missing, n_sites) diffs, comps, missing, n = pi_components(h.haplotypes) pi_manual = diffs / comps # Between-population: (total_diffs, total_comps, n_sites) pop1_haps = h.haplotypes[h.sample_sets['pop1']] pop2_haps = h.haplotypes[h.sample_sets['pop2']] diffs, comps, n = dxy_components(pop1_haps, pop2_haps) dxy_manual = diffs / comps Best Practices -------------- 1. **Use include mode** (default) for most analyses. It uses all available data at each site and is unbiased under MCAR. 2. **Use exclude mode** when you need all samples to be comparable at exactly the same sites (e.g., for certain LD analyses or when missingness is non-random). 3. **Use FrequencySpectrum** for theta estimators and neutrality tests when you want proper handling of variable sample sizes via group-by-n or SFS projection. 4. **The default ``estimator='auto'`` for ``zns()`` and ``omega()`` uses the unbiased ``sigma_d2`` estimator on ``HaplotypeMatrix`` inputs**, which corrects the upward bias inherent in naive :math:`r^2` -- particularly important with small or variable sample sizes. Pass ``estimator='r2'`` explicitly if you want the classical naive estimator instead (e.g. for backward comparison with a previous analysis). 5. **Check missingness patterns** before analysis with ``summarize_missing_data()`` and consider filtering sites with very high missing rates. Example Workflow ---------------- .. code-block:: python from pg_gpu import HaplotypeMatrix, diversity, divergence from pg_gpu.windowed_analysis import windowed_analysis # Load data with accessible mask h = HaplotypeMatrix.from_zarr("data.zarr", region="3L:1-10000000", accessible_bed="accessibility.bed") # Inspect missing data summary = h.summarize_missing_data() print(f"Missing: {summary['missing_freq_overall']:.1%}") # Filter extreme missingness h = h.filter_variants_by_missing(max_missing_freq=0.5) # Scalar statistics (auto-normalized by accessible bases) pi = diversity.pi(h, population="pop1") tajd = diversity.tajimas_d(h, population="pop1") dxy = divergence.dxy(h, 'pop1', 'pop2') fst = divergence.fst_hudson(h, 'pop1', 'pop2') # Windowed analysis (accessible mask propagates per-window) df = windowed_analysis(h, window_size=50_000, statistics=['pi', 'theta_w', 'tajimas_d', 'fst', 'dxy'], populations=['pop1', 'pop2'])