Admixture Detection End-to-End

Packaged script: examples/admixture_detection.py

Run it from the repo root:

pixi run python examples/admixture_detection.py
pixi run python examples/admixture_detection.py --length 20_000_000 --samples 20

Background

Patterson’s \(D\) (also called the ABBA-BABA statistic) is the canonical genome-wide test for gene flow between populations. Given a rooted four-population topology \((((A, B), C), D)\) – where \(D\) is the outgroup – \(D\) measures the excess of ABBA site patterns (derived in B and C, ancestral in A and D) relative to BABA patterns (derived in A and C, ancestral in B and D). Under the strict tree topology with no gene flow, \(\mathbb{E}[D] = 0\). A positive \(D\) is evidence of admixture from C into B (or vice versa); a negative \(D\) flips the direction.

Genome-wide \(D\) is a ratio of sums (numerator and denominator each summed across sites), so the appropriate uncertainty measure is not a per-site standard error but a block jackknife: drop one contiguous chromosomal block at a time, recompute \(\hat D\), and use the variance of the leave-one-out estimates. This handles the correlation between linked sites correctly.

What the script does

Simulate two 4-population msprime tree sequences sharing the same null topology – one without admixture (the “null” scenario) and one with a 10 % \(C \to B\) admixture pulse (the “admixed” scenario).
Load each tree sequence into a HaplotypeMatrix via HaplotypeMatrix.from_ts.
Run admixture.average_patterson_d on each, which computes the per-site numerator and denominator of \(D\), partitions them into contiguous blocks, and returns the point estimate, the block-jackknife standard error, the z-score, and the per-block leave-one-out values.
Plot the per-block \(D\) distribution and the point estimates with their 95 % confidence intervals (point estimate \(\pm\,1.96\) standard errors) for both scenarios on a single figure.

The expected (and observed) outcome: the null scenario’s CI brackets zero, while the admixed scenario’s CI excludes zero with a clearly negative \(\hat D\) consistent with \(C \to B\) gene flow. Also, you can see the per-block \(D\) distribution is shifted to the left (more negative values) in the admixed scenario, showing the genome-wide signal of admixture.

Why it’s useful as a template

The structure – simulate (or load) two scenarios -> compute the same statistic on each with a block-jackknife wrapper -> compare the intervals – is the core of any null-vs-alternative power analysis or calibration check. A few directions to extend:

Different statistic. average_patterson_d has a sibling average_patterson_f3 for the F3 admixture test; the two share the ratio-of-sums structure, so the wrapper code is interchangeable.
Different demographic model. Edit the msprime model to add bottlenecks, varying admixture proportions, ghost populations, etc. This is also the right structure for calibrating a statistic on simulated null data before applying it to your real data.
Block size sensitivity. The script uses a fixed blen=100 but real-data analyses typically check that the estimated standard error is stable across a range of block sizes (1-10 Mb is a common target). Sweeping blen and overlaying the resulting CIs is a one-line change.