Local PCA / lostruct

Packaged script: examples/local_pca.py

Run it from the repo root:

pixi run python examples/local_pca.py
pixi run python examples/local_pca.py --window 300 --seed 7
pixi run python examples/local_pca.py --s 0.05 --end-freq 0.3

Background

Genome-wide PCA gives one picture of population structure averaged over the whole genome. Local PCA computes one PCA per genomic window and asks where along the genome the local picture deviates from the chromosome-wide pattern. Outlier regions tend to flag features whose local genealogy differs from the genome average:

large structural variants (inversions, segmental duplications);
recent introgression tracts;
low-recombination regions where lineage sorting has run further;
completed or partial selective sweeps.

Li & Ralph (2019) introduced this idea in the lostruct R package. pg_gpu ports the four-step pipeline to the GPU, with the per-window eigendecomposition fused into a single batched cupy.linalg.eigh call – the original chokepoint at chromosome scale.

What the script does

The packaged demo runs the full pipeline on simulated data where the “truth” is known and the lostruct outliers are expected to land on the sweep:

Simulate. A 10 Mb chromosome under msprime with SweepGenicSelection at the midpoint, configured as a partial sweep that reaches frequency 0.5 (so the local population structure differs from the rest of the chromosome but the variant itself is still segregating).
Per-window local PCA. Run windowed_analysis with the local_pca statistic, which does a single GPU pass over all windows and returns a LocalPCAResult containing the top-\(k\) eigenvalues and eigenvectors per window.
Window-by-window distance. pc_dist computes the Frobenius distance between windows’ low-rank covariance representations, producing an \(n_{\text{windows}} \times n_{\text{windows}}\) distance matrix.
MDS embedding. Classical MDS (pcoa) reduces the window-by-window distances to a 2-D embedding.
Cluster + outlier detection. 1-D k-means on MDS1 partitions windows into neutral / linked / sweep regimes, ordered by how far each cluster’s centroid sits from the chromosome-wide median. corners independently picks the extreme windows on the MDS embedding – a complementary, distribution-free approach used by the original lostruct paper.

The output figure (local_pca.png) plots the MDS scatter coloured by regime alongside MDS1 and Garud’s H12 along the chromosome. The sweep cluster on the MDS plot and the H12 peak both land on the sweep focal site, validating the pipeline against an independent selection summary.

Python API

The whole pipeline in one call:

from pg_gpu import HaplotypeMatrix, lostruct

res = lostruct(hm, window_size=500, step_size=250,
               window_type='snp', k=2)

res.local_pca      # LocalPCAResult: per-window eigvals / eigvecs / windows
res.distance       # (n_windows, n_windows) Frobenius distance
res.mds            # (n_windows, n_components) MDS coordinates
res.corner_indices # (n_per_corner, n_corners) outlier-window indices

Pass jackknife=True to also compute the block-within-window jackknife standard error of the per-window eigenvectors – useful as a signal-to-noise diagnostic for filtering noisy windows before downstream analysis. The standard error is computed in the same window pass as the base eigendecomposition, so this is much cheaper than calling local_pca_jackknife separately:

res = lostruct(hm, window_size=500, step_size=250,
               window_type='snp', k=2,
               jackknife=True, n_blocks=10)
res.jackknife_se   # (n_windows, k) per-PC standard error

If you want intermediate access – e.g. to swap in a different distance metric, or recompute MDS at a different n_components without redoing the per-window eigendecomposition – you can call the four constituent functions directly. lostruct() is a thin wrapper over them:

from pg_gpu import HaplotypeMatrix, windowed_analysis
from pg_gpu.decomposition import pc_dist, pcoa, corners

# `local_pca` statistic routes through the GPU-batched eigh pipeline
result = windowed_analysis(
    hm, window_size=500, step_size=250,
    statistics=['local_pca'], window_type='snp', k=2)

dist = pc_dist(result, npc=2, normalize='L1')
mds, _ = pcoa(dist, n_components=2)
extremes = corners(mds, prop=0.05, k=3)

# With jackknife standard error (shares per-window matrix prep with local_pca)
result = windowed_analysis(
    hm, window_size=500, step_size=250,
    statistics=['local_pca', 'local_pca_jackknife'],
    window_type='snp', k=2, n_blocks=10)
result.jackknife_se  # (n_windows, k) standard error array

Why it’s useful as a template

Local PCA is most useful for chromosomes-scale exploratory work – “are there any regions of this chromosome that look genuinely different from the rest?” – where you don’t have a specific candidate region in mind. A few directions to extend the template:

Windowing strategy. SNP-count windows (window_type='snp') give roughly equal-information windows; bp windows (window_type='bp') make it easier to interpret distances along the genome at the cost of variable per-window SNP counts. Both are routed through the same fused kernel.
Jackknife standard error. Add 'local_pca_jackknife' to the statistics list to get per-window standard error on the eigenvalues at almost no extra cost (the per-window matrix prep is shared). Useful for filtering noisy windows before downstream MDS.
Other distance metrics. pc_dist accepts normalize='L1' | 'L2' | None; sweeping the choice and watching the MDS embedding is a good way to check whether outliers are robust.