Bootstrap CI on Tajima's D under a Sweep
========================================

Packaged script: ``examples/sweep_tajimas_d_bootstrap.py``

Run it from the repo root:

.. code-block:: bash

   pixi run python examples/sweep_tajimas_d_bootstrap.py
   pixi run python examples/sweep_tajimas_d_bootstrap.py --seed 7 --n-replicates 5000

Background
----------

Tajima's D contrasts two estimators of the population mutation rate
:math:`\theta`: the average pairwise diversity :math:`\hat{\pi}` and
Watterson's segregating-sites estimator :math:`\hat\theta_W`. Under
neutrality at equilibrium, :math:`\mathbb{E}[D] = 0`. A recent hard
sweep removes intermediate-frequency variation in the linked region
and leaves an excess of singletons as new mutations enter on the
sweep haplotype, so :math:`\hat\pi` decreases relative to
:math:`\hat\theta_W` and :math:`D` becomes negative in the local
neighborhood of the selected site.

Detecting the negative-D dip is the easy part. The harder part is
quantifying how confident you are that the dip is real, given that
windows from the same chromosome are correlated and that the mean
over a small number of windows has substantial sampling variance. The
block bootstrap addresses this directly: resample whole windows (so
LD between adjacent windows is preserved) and use the resulting
distribution of the per-region mean as an empirical sampling
distribution.

What the script does
--------------------

1. Simulate a 10 Mb chromosome under ``msprime`` with
   ``SweepGenicSelection`` targeting fixation at the chromosome
   midpoint. The default selection coefficient (``s=0.05``) gives a
   strong sweep with a clear local D dip.
2. Compute windowed Tajima's D with ``windowed_analysis``, taking
   each window as a "block" for the bootstrap.
3. Define two regions of interest: a *sweep-local* window around the
   selected site and a *distal* window on the same chromosome that
   should look neutral.
4. Use ``block_bootstrap`` (default 2000 replicates) on the
   per-window D values in each region to obtain 95 % confidence
   intervals on the mean Tajima's D.
5. Plot the per-window D track with the two regions highlighted, the
   bootstrap distributions side-by-side, and the point estimates with
   their CIs.

Under a completed sweep the sweep-local CI excludes zero (D is
significantly negative) while the distal CI brackets zero. The output
figure (``sweep_tajimas_d_bootstrap.png``) makes the contrast
visible at a glance.

Why it's useful as a template
-----------------------------

The recipe -- *windowed_analysis* -> *region selection* ->
*block_bootstrap on the per-window means* -- generalizes to any
windowed scalar statistic and any candidate region. Swap
``tajimas_d`` for ``fst``, ``garud_h12``, ``mean_nsl``, ``dxy`` etc.
to put a confidence interval on the effect size in any region of
interest, with no change to the surrounding code.

A few things to keep in mind when adapting the template:

* The bootstrap "block" is a window. Block size is therefore the
  ``window_size`` argument to ``windowed_analysis``. It should be
  large compared to the per-site LD scale -- if windows are too
  small, sample-to-sample correlation between blocks will deflate
  the bootstrap CI. For better-calibrated CIs in regions with very
  uneven recombination (or where the LD-decay scale is unknown),
  partition the chromosome into LD blocks first (see
  :doc:`ld_blocks`) and pass those as the bootstrap unit instead of
  fixed-size bp windows.
* ``block_bootstrap`` accepts a tuple ``(numer, denom)`` for
  ratio-of-sums statistics (FST, F3 / D, etc.); pass per-window
  numerator and denominator instead of a per-window scalar.
* For very small effects the default ``n_replicates=2000`` may give
  noisy CIs; bump to 5000-10000 for publication figures.