Bootstrap CI on Tajima’s D under a Sweep

Packaged script: examples/sweep_tajimas_d_bootstrap.py

Run it from the repo root:

pixi run python examples/sweep_tajimas_d_bootstrap.py
pixi run python examples/sweep_tajimas_d_bootstrap.py --seed 7 --n-replicates 5000

Background

Tajima’s D contrasts two estimators of the population mutation rate \(\theta\): the average pairwise diversity \(\hat{\pi}\) and Watterson’s segregating-sites estimator \(\hat\theta_W\). Under neutrality at equilibrium, \(\mathbb{E}[D] = 0\). A recent hard sweep removes intermediate-frequency variation in the linked region and leaves an excess of singletons as new mutations enter on the sweep haplotype, so \(\hat\pi\) decreases relative to \(\hat\theta_W\) and \(D\) becomes negative in the local neighborhood of the selected site.

Detecting the negative-D dip is the easy part. The harder part is quantifying how confident you are that the dip is real, given that windows from the same chromosome are correlated and that the mean over a small number of windows has substantial sampling variance. The block bootstrap addresses this directly: resample whole windows (so LD between adjacent windows is preserved) and use the resulting distribution of the per-region mean as an empirical sampling distribution.

What the script does

Simulate a 10 Mb chromosome under msprime with SweepGenicSelection targeting fixation at the chromosome midpoint. The default selection coefficient (s=0.05) gives a strong sweep with a clear local D dip.
Compute windowed Tajima’s D with windowed_analysis, taking each window as a “block” for the bootstrap.
Define two regions of interest: a sweep-local window around the selected site and a distal window on the same chromosome that should look neutral.
Use block_bootstrap (default 2000 replicates) on the per-window D values in each region to obtain 95 % confidence intervals on the mean Tajima’s D.
Plot the per-window D track with the two regions highlighted, the bootstrap distributions side-by-side, and the point estimates with their CIs.

Under a completed sweep the sweep-local CI excludes zero (D is significantly negative) while the distal CI brackets zero. The output figure (sweep_tajimas_d_bootstrap.png) makes the contrast visible at a glance.

Why it’s useful as a template

The recipe – windowed_analysis -> region selection -> block_bootstrap on the per-window means – generalizes to any windowed scalar statistic and any candidate region. Swap tajimas_d for fst, garud_h12, mean_nsl, dxy etc. to put a confidence interval on the effect size in any region of interest, with no change to the surrounding code.

A few things to keep in mind when adapting the template:

The bootstrap “block” is a window. Block size is therefore the window_size argument to windowed_analysis. It should be large compared to the per-site LD scale – if windows are too small, sample-to-sample correlation between blocks will deflate the bootstrap CI. For better-calibrated CIs in regions with very uneven recombination (or where the LD-decay scale is unknown), partition the chromosome into LD blocks first (see LD Block Partitioning) and pass those as the bootstrap unit instead of fixed-size bp windows.
block_bootstrap accepts a tuple (numer, denom) for ratio-of-sums statistics (FST, F3 / D, etc.); pass per-window numerator and denominator instead of a per-window scalar.
For very small effects the default n_replicates=2000 may give noisy CIs; bump to 5000-10000 for publication figures.