Admixture Detection End-to-End
==============================

Packaged script: ``examples/admixture_detection.py``

Run it from the repo root:

.. code-block:: bash

   pixi run python examples/admixture_detection.py
   pixi run python examples/admixture_detection.py --length 20_000_000 --samples 20

Background
----------

Patterson's :math:`D` (also called the ABBA-BABA statistic) is the
canonical genome-wide test for gene flow between populations. Given a
rooted four-population topology :math:`(((A, B), C), D)` -- where
:math:`D` is the outgroup -- :math:`D` measures the excess of
``ABBA`` site patterns (derived in B and C, ancestral in A and D)
relative to ``BABA`` patterns (derived in A and C, ancestral in B
and D). Under the strict tree topology with no gene flow,
:math:`\mathbb{E}[D] = 0`. A positive :math:`D` is evidence of
admixture from C into B (or vice versa); a negative :math:`D` flips
the direction.

Genome-wide :math:`D` is a ratio of sums (numerator and denominator
each summed across sites), so the appropriate uncertainty measure is
not a per-site standard error but a *block jackknife*: drop one
contiguous chromosomal block at a time, recompute :math:`\hat D`, and
use the variance of the leave-one-out estimates. This handles the
correlation between linked sites correctly.

What the script does
--------------------

1. Simulate two 4-population ``msprime`` tree sequences sharing the
   same null topology -- one without admixture (the "null" scenario)
   and one with a 10 % :math:`C \to B` admixture pulse (the
   "admixed" scenario).
2. Load each tree sequence into a ``HaplotypeMatrix`` via
   ``HaplotypeMatrix.from_ts``.
3. Run ``admixture.average_patterson_d`` on each, which computes the
   per-site numerator and denominator of :math:`D`, partitions them
   into contiguous blocks, and returns the point estimate, the
   block-jackknife standard error, the z-score, and the per-block
   leave-one-out values.
4. Plot the per-block :math:`D` distribution and the point estimates
   with their 95 % confidence intervals (point estimate
   :math:`\pm\,1.96` standard errors) for both scenarios on a single
   figure.

The expected (and observed) outcome: the null scenario's CI brackets
zero, while the admixed scenario's CI excludes zero with a clearly
negative :math:`\hat D` consistent with :math:`C \to B` gene flow.
Also, you can see the per-block :math:`D` distribution is shifted to the left (more negative values) in the admixed scenario, showing the genome-wide signal of admixture.

Why it's useful as a template
-----------------------------

The structure -- *simulate (or load) two scenarios -> compute the same
statistic on each with a block-jackknife wrapper -> compare the
intervals* -- is the core of any null-vs-alternative power analysis or
calibration check. A few directions to extend:

* **Different statistic.** ``average_patterson_d`` has a sibling
  ``average_patterson_f3`` for the F3 admixture test; the two share
  the ratio-of-sums structure, so the wrapper code is interchangeable.
* **Different demographic model.** Edit the ``msprime`` model to add
  bottlenecks, varying admixture proportions, ghost populations, etc.
  This is also the right structure for calibrating a statistic on
  simulated null data before applying it to your real data.
* **Block size sensitivity.** The script uses a fixed ``blen=100``
  but real-data analyses typically check that the estimated standard
  error is stable across a range of block sizes (1-10 Mb is a common
  target). Sweeping ``blen`` and overlaying the resulting CIs is a
  one-line change.