Examples ======== Short, copy-paste-ready snippets that each demonstrate one feature. For longer walk-throughs that pair simulated data with narrative explanation -- one page per packaged script under ``examples/`` -- see :doc:`tutorials`. Minimal Workflow ---------------- A small end-to-end example: load a VCF, assign populations, and compute one statistic from each major module. .. code-block:: python from pg_gpu import HaplotypeMatrix, ld_statistics, diversity, selection import numpy as np # Load VCF data h = HaplotypeMatrix.from_vcf("example.vcf") # Define populations h.sample_sets = { "CEU": [0, 1, 2, 3, 4], "YRI": [5, 6, 7, 8, 9] } # Diversity pi_ceu = diversity.pi(h, population="CEU") pi_yri = diversity.pi(h, population="YRI") # Divergence from pg_gpu import divergence fst = divergence.fst(h, "CEU", "YRI") # LD counts, n_valid = h.tally_gpu_haplotypes() r2 = ld_statistics.r_squared(counts, n_valid=n_valid) # Selection scans ihs_scores = selection.ihs(h, population="CEU") Two-Population LD ----------------- Computes the Hill-Robertson moments-LD statistics (:math:`D^2`, :math:`Dz`, :math:`\pi_2`, and the related two-locus moments) binned by physical distance, using all within-population and between-population sample pairs from ``pop1`` and ``pop2``. Returns a per-bin DataFrame with one column per statistic. .. code-block:: python # Memory-efficient chunked computation stats = h.compute_ld_statistics_gpu_two_pops( bp_bins=np.array([0, 1000, 5000, 10000, 50000]), pop1="CEU", pop2="YRI", chunk_size='auto' ) Batch Statistics ---------------- Fused operations may be used to compute multiple summary statistics in a single GPU pass, more efficiently than if they were computed independently. Pass an iterable of statistic names; the kernel reuses the intermediate haplotype-tally counts across all of them. .. code-block:: python # Multiple LD statistics in one call (single GPU launch) results = ld_statistics.compute_ld_statistics( counts, statistics=['dd', 'dz', 'pi2', 'r_squared'], ) Integration with moments ------------------------ The two-locus moments-LD statistics computed below are exactly the quantities used by `moments.LD `_ for demographic-model fitting (Ragsdale & Gravel 2019). Because they are pairwise across loci they are expensive to compute on a CPU -- which is exactly where GPU acceleration shines: pg_gpu can replace ``moments.LD.Parsing.compute_ld_statistics()`` as a drop-in backend that returns the same dictionary structure. .. code-block:: python from pg_gpu import HaplotypeMatrix h = HaplotypeMatrix.from_vcf("data.vcf") h.sample_sets = {"pop1": list(range(10))} # Compute LD statistics with GPU acceleration # chunk_size='auto' adapts to available GPU memory ld_stats = h.compute_ld_statistics_gpu_single_pop( bp_bins=[0, 1000, 5000], chunk_size='auto' ) For a complete end-to-end workflow using pg_gpu as a drop-in replacement for ``moments.LD.Parsing.compute_ld_statistics()``, see the :doc:`tutorials/moments_integration` tutorial. LD Pruning ---------- LD pruning thins a SNP set down to (approximately) unlinked variants -- useful before PCA, GRM construction, or any analysis whose theory assumes independent sites. ``locate_unlinked`` slides a window of ``size`` variants across the matrix and, within each window, drops variants whose pairwise :math:`r^2` exceeds ``threshold``; ``step`` controls the window stride. The companion ``windowed_r_squared`` returns the empirical LD-decay curve as a function of physical distance, which is useful for choosing a sensible threshold. .. code-block:: python # Find variants in approximate linkage equilibrium unlinked = h.locate_unlinked(size=100, step=20, threshold=0.1) h_pruned = h.get_subset(np.where(unlinked)[0]) print(f"Kept {np.sum(unlinked)} of {h.num_variants} variants") # Windowed r-squared decay (median r^2 in each bp bin) bins = np.arange(0, 100001, 1000) r2_decay, counts = h.windowed_r_squared(bins, percentile=50) See ``examples/ld_blocks.py`` for an end-to-end demo that uses pairwise :math:`r^2` to detect LD-block boundaries. Selection Scan Pipeline ----------------------- .. code-block:: python from pg_gpu import selection # iHS with standardization by allele count ihs_raw = selection.ihs(h) # Get allele counts for binned standardization dac = np.sum(h.haplotypes.get(), axis=0) ihs_std, bins = selection.standardize_by_allele_count(ihs_raw, dac) # Cross-population scans xpehh_scores = selection.xpehh(h, "CEU", "YRI") xpehh_std = selection.standardize(xpehh_scores) # Garud's H in sliding windows h1, h12, h123, h2_h1 = selection.moving_garud_h(h, size=200, step=50) SFS and Admixture ----------------- .. code-block:: python from pg_gpu import sfs, admixture # Joint SFS jsfs = sfs.joint_sfs(h, "CEU", "YRI") # Patterson's D with block-jackknife d, se, z, vb, vj = admixture.average_patterson_d( h, "popA", "popB", "popC", "popD", blen=100 ) print(f"D = {d:.4f}, standard error = {se:.4f}, Z = {z:.2f}") PBS (Population Branch Statistic) --------------------------------- .. code-block:: python from pg_gpu import divergence # Detect selection in pop1 relative to pop2 and pop3 pbs_vals = divergence.pbs(h, "pop1", "pop2", "pop3", window_size=50) # Un-normalized PBS pbs_raw = divergence.pbs(h, "pop1", "pop2", "pop3", window_size=50, normed=False) PCA and Dimensionality Reduction --------------------------------- .. code-block:: python from pg_gpu import decomposition # GPU-accelerated PCA (up to 56x faster than allel) coords, var_ratio = decomposition.pca(h, n_components=10, scaler='patterson') # Randomized PCA for very large datasets coords, var_ratio = decomposition.randomized_pca( h, n_components=10, random_state=42) # Pairwise genetic distance (batched for memory safety) dist = decomposition.pairwise_distance(h, metric='euclidean') # PCoA from distance matrix coords, var_ratio = decomposition.pcoa(dist, n_components=5) # Population-specific PCA coords_ceu, _ = decomposition.pca(h, n_components=5, population="CEU") GPU-Native Windowed Statistics ------------------------------ Compute multiple statistics across thousands of windows without Python loops: .. code-block:: python from pg_gpu.windowed_analysis import windowed_statistics import numpy as np # Define windows across 1Mb region bp_bins = np.arange(0, 1_000_001, 10_000) # 100 windows of 10kb # Compute 4 diversity stats in one GPU pass result = windowed_statistics( h, bp_bins, statistics=('pi', 'theta_w', 'tajimas_d', 'segregating_sites') ) # Results are numpy arrays, one value per window print(f"Mean pi: {np.nanmean(result['pi']):.6f}") print(f"Mean Tajima's D: {np.nanmean(result['tajimas_d']):.3f}") # Windowed FST between populations result = windowed_statistics( h, bp_bins, statistics=('pi', 'fst', 'dxy'), pop1='CEU', pop2='YRI' ) For end-to-end command-line workflows (``vcf_to_zarr.py``, ``genome_scan.py``), see :doc:`workflows`. Quality-Aware Filtering ----------------------- Load VCF / VCZ FORMAT and INFO arrays alongside the genotype matrix, build per-variant and per-genotype masks from them, and write the filtered survivors back as a clean VCZ that round-trips the same arrays. The full walk-through is in :doc:`tutorials/qc_fields`; the recipe in one block: .. code-block:: python from pg_gpu import HaplotypeMatrix # Bare VCF tags; pg_gpu auto-resolves INFO vs FORMAT from the source. hm = HaplotypeMatrix.from_vcf("cohort.vcf.gz", fields=["MQ", "QD", "GQ", "DP"]) # or, equivalently, from a bio2zarr-encoded VCZ: # hm = HaplotypeMatrix.from_zarr("cohort.vcz", # fields=["MQ", "QD", "GQ", "DP"]) # Shape disambiguates per-variant from per-genotype. hm.fields["MQ"] # (n_var,) — INFO hm.fields["GQ"] # (n_var, n_samples) — FORMAT # Plot directly: plt.hist(hm.fields["GQ"].ravel(), bins=50) # Mask-based filter. ``variants`` drops rows; ``genotypes`` sets # per-cell calls to -1 (both allele rows on the haplotype side). hm_clean = hm.filter( variants=(hm.fields["MQ"] >= 40.0) & (hm.fields["QD"] >= 2.0), genotypes=(hm.fields["GQ"] >= 20) & (hm.fields["DP"] >= 10), drop_all_missing=True, ) # Persist. The surviving QC arrays round-trip into the new store. hm_clean.to_zarr("cohort.clean.vcz", format="vcz", contig_name="chr1") Missing Data ------------ .. code-block:: python from pg_gpu import diversity # Check missing data summary summary = h.summarize_missing_data() print(f"Missing: {summary['missing_freq_overall']:.1%}") # Different strategies pi_include = diversity.pi(h, missing_data='include') pi_exclude = diversity.pi(h, missing_data='exclude') # LD statistics handle missing data automatically counts, n_valid = h.tally_gpu_haplotypes() dd_vals = ld_statistics.dd(counts, n_valid=n_valid)