Admixture Detection End-to-End
Packaged script: examples/admixture_detection.py
Run it from the repo root:
pixi run python examples/admixture_detection.py
pixi run python examples/admixture_detection.py --length 20_000_000 --samples 20
Background
Patterson’s \(D\) (also called the ABBA-BABA statistic) is the
canonical genome-wide test for gene flow between populations. Given a
rooted four-population topology \((((A, B), C), D)\) – where
\(D\) is the outgroup – \(D\) measures the excess of
ABBA site patterns (derived in B and C, ancestral in A and D)
relative to BABA patterns (derived in A and C, ancestral in B
and D). Under the strict tree topology with no gene flow,
\(\mathbb{E}[D] = 0\). A positive \(D\) is evidence of
admixture from C into B (or vice versa); a negative \(D\) flips
the direction.
Genome-wide \(D\) is a ratio of sums (numerator and denominator each summed across sites), so the appropriate uncertainty measure is not a per-site standard error but a block jackknife: drop one contiguous chromosomal block at a time, recompute \(\hat D\), and use the variance of the leave-one-out estimates. This handles the correlation between linked sites correctly.
What the script does
Simulate two 4-population
msprimetree sequences sharing the same null topology – one without admixture (the “null” scenario) and one with a 10 % \(C \to B\) admixture pulse (the “admixed” scenario).Load each tree sequence into a
HaplotypeMatrixviaHaplotypeMatrix.from_ts.Run
admixture.average_patterson_don each, which computes the per-site numerator and denominator of \(D\), partitions them into contiguous blocks, and returns the point estimate, the block-jackknife standard error, the z-score, and the per-block leave-one-out values.Plot the per-block \(D\) distribution and the point estimates with their 95 % confidence intervals (point estimate \(\pm\,1.96\) standard errors) for both scenarios on a single figure.
The expected (and observed) outcome: the null scenario’s CI brackets zero, while the admixed scenario’s CI excludes zero with a clearly negative \(\hat D\) consistent with \(C \to B\) gene flow. Also, you can see the per-block \(D\) distribution is shifted to the left (more negative values) in the admixed scenario, showing the genome-wide signal of admixture.
Why it’s useful as a template
The structure – simulate (or load) two scenarios -> compute the same statistic on each with a block-jackknife wrapper -> compare the intervals – is the core of any null-vs-alternative power analysis or calibration check. A few directions to extend:
Different statistic.
average_patterson_dhas a siblingaverage_patterson_f3for the F3 admixture test; the two share the ratio-of-sums structure, so the wrapper code is interchangeable.Different demographic model. Edit the
msprimemodel to add bottlenecks, varying admixture proportions, ghost populations, etc. This is also the right structure for calibrating a statistic on simulated null data before applying it to your real data.Block size sensitivity. The script uses a fixed
blen=100but real-data analyses typically check that the estimated standard error is stable across a range of block sizes (1-10 Mb is a common target). Sweepingblenand overlaying the resulting CIs is a one-line change.