Local PCA / lostruct
Packaged script: examples/local_pca.py
Run it from the repo root:
pixi run python examples/local_pca.py
pixi run python examples/local_pca.py --window 300 --seed 7
pixi run python examples/local_pca.py --s 0.05 --end-freq 0.3
Background
Genome-wide PCA gives one picture of population structure averaged over the whole genome. Local PCA computes one PCA per genomic window and asks where along the genome the local picture deviates from the chromosome-wide pattern. Outlier regions tend to flag features whose local genealogy differs from the genome average:
large structural variants (inversions, segmental duplications);
recent introgression tracts;
low-recombination regions where lineage sorting has run further;
completed or partial selective sweeps.
Li & Ralph (2019)
introduced this idea in the lostruct R package. pg_gpu ports
the four-step pipeline to the GPU, with the per-window
eigendecomposition fused into a single batched cupy.linalg.eigh
call – the original chokepoint at chromosome scale.
What the script does
The packaged demo runs the full pipeline on simulated data where the “truth” is known and the lostruct outliers are expected to land on the sweep:
Simulate. A 10 Mb chromosome under
msprimewithSweepGenicSelectionat the midpoint, configured as a partial sweep that reaches frequency 0.5 (so the local population structure differs from the rest of the chromosome but the variant itself is still segregating).Per-window local PCA. Run
windowed_analysiswith thelocal_pcastatistic, which does a single GPU pass over all windows and returns aLocalPCAResultcontaining the top-\(k\) eigenvalues and eigenvectors per window.Window-by-window distance.
pc_distcomputes the Frobenius distance between windows’ low-rank covariance representations, producing an \(n_{\text{windows}} \times n_{\text{windows}}\) distance matrix.MDS embedding. Classical MDS (
pcoa) reduces the window-by-window distances to a 2-D embedding.Cluster + outlier detection. 1-D k-means on MDS1 partitions windows into neutral / linked / sweep regimes, ordered by how far each cluster’s centroid sits from the chromosome-wide median.
cornersindependently picks the extreme windows on the MDS embedding – a complementary, distribution-free approach used by the original lostruct paper.
The output figure (local_pca.png) plots the MDS scatter coloured
by regime alongside MDS1 and Garud’s H12 along the chromosome. The
sweep cluster on the MDS plot and the H12 peak both land on the
sweep focal site, validating the pipeline against an independent
selection summary.
Python API
The whole pipeline in one call:
from pg_gpu import HaplotypeMatrix, lostruct
res = lostruct(hm, window_size=500, step_size=250,
window_type='snp', k=2)
res.local_pca # LocalPCAResult: per-window eigvals / eigvecs / windows
res.distance # (n_windows, n_windows) Frobenius distance
res.mds # (n_windows, n_components) MDS coordinates
res.corner_indices # (n_per_corner, n_corners) outlier-window indices
Pass jackknife=True to also compute the block-within-window
jackknife standard error of the per-window eigenvectors – useful as a
signal-to-noise diagnostic for filtering noisy windows before
downstream analysis. The standard error is computed in the same window
pass as the base eigendecomposition, so this is much cheaper than
calling local_pca_jackknife separately:
res = lostruct(hm, window_size=500, step_size=250,
window_type='snp', k=2,
jackknife=True, n_blocks=10)
res.jackknife_se # (n_windows, k) per-PC standard error
If you want intermediate access – e.g. to swap in a different
distance metric, or recompute MDS at a different n_components
without redoing the per-window eigendecomposition – you can call the
four constituent functions directly. lostruct() is a thin wrapper
over them:
from pg_gpu import HaplotypeMatrix, windowed_analysis
from pg_gpu.decomposition import pc_dist, pcoa, corners
# `local_pca` statistic routes through the GPU-batched eigh pipeline
result = windowed_analysis(
hm, window_size=500, step_size=250,
statistics=['local_pca'], window_type='snp', k=2)
dist = pc_dist(result, npc=2, normalize='L1')
mds, _ = pcoa(dist, n_components=2)
extremes = corners(mds, prop=0.05, k=3)
# With jackknife standard error (shares per-window matrix prep with local_pca)
result = windowed_analysis(
hm, window_size=500, step_size=250,
statistics=['local_pca', 'local_pca_jackknife'],
window_type='snp', k=2, n_blocks=10)
result.jackknife_se # (n_windows, k) standard error array
Why it’s useful as a template
Local PCA is most useful for chromosomes-scale exploratory work – “are there any regions of this chromosome that look genuinely different from the rest?” – where you don’t have a specific candidate region in mind. A few directions to extend the template:
Windowing strategy. SNP-count windows (
window_type='snp') give roughly equal-information windows; bp windows (window_type='bp') make it easier to interpret distances along the genome at the cost of variable per-window SNP counts. Both are routed through the same fused kernel.Jackknife standard error. Add
'local_pca_jackknife'to thestatisticslist to get per-window standard error on the eigenvalues at almost no extra cost (the per-window matrix prep is shared). Useful for filtering noisy windows before downstream MDS.Other distance metrics.
pc_distacceptsnormalize='L1' | 'L2' | None; sweeping the choice and watching the MDS embedding is a good way to check whether outliers are robust.