Accessibility Masks and Windowed Diversity
==========================================

Packaged script: ``examples/accessibility_mask.py``

Run it from the repo root:

.. code-block:: bash

   pixi run python examples/accessibility_mask.py
   pixi run python examples/accessibility_mask.py --window 20_000

Background
----------

Per-base summary statistics like :math:`\pi` divide a sum of
pairwise differences (numerator) by a count of bases (denominator).
On real sequencing data, only a fraction of bases are reliably
*callable*: repetitive regions are masked, low-coverage stretches
fail filtering, ambiguous mapping yields gaps. If a windowed scan
divides by the *total* base count (callable + uncallable), the
denominator includes positions where variation could never be
observed and per-bp diversity looks artificially low in
low-callability regions. This is a frequent confounder for genome
scans -- selection signatures and accessibility deserts produce the
same downward dip.

An *accessibility mask* fixes this by restricting both the variants
visible to the analysis and the denominator of the per-bp
normalization to the callable subset. ``set_accessible_mask`` does
both in one call: it filters the matrix's variants to those falling
in accessible regions, and it tells span normalization to use the
accessible-base count (rather than the genomic span) as the
denominator. Both ``windowed_analysis`` and the scalar diversity /
divergence functions read this through automatically.

What the script does
--------------------

This tutorial sets up a controlled scenario where the "truth" is
known, then verifies that the mask recovers it:

1. Simulate a 1 Mb chromosome under ``msprime`` with a 200 kb central
   block of 100x lower mutation rate -- a stand-in for a low-callability
   region. The genealogy is identical inside and outside the block, so
   the per-site pattern of coalescence is the same; only the *probability of
   observing* a mutation differs. Outside the block, per-bp
   diversity is :math:`4 N_e \mu_{\text{high}}`.
2. Compute windowed :math:`\pi` twice from the same simulation:

   * once with no accessibility mask, and
   * once with the low-mutation block flagged inaccessible via an
     in-memory ``numpy`` boolean array passed to
     ``set_accessible_mask``.

3. Plot both traces on the same axes with the excluded region shaded.

Without the mask, :math:`\pi` drops misleadingly across the
low-callability block -- mutations are simply rarer, so :math:`\pi`/bp
looks lower even though the genealogy is unchanged. With the mask:

* Fully-inaccessible windows return ``NaN`` and render as visual
  gaps -- the scan is silent where it shouldn't speak rather than
  reporting a spurious value.
* Partially-inaccessible windows divide their numerator by the
  accessible-base count, so the flanking :math:`\pi` sits at the
  expected :math:`4 N_e \mu_{\text{high}}` value.

Why it's useful as a template
-----------------------------

This example is applicable to any region where the
*probability of observing variation* differs from the surrounding
genome (repeat masks, low-coverage regions, hard-to-map duplications,
centromeres) biases per-bp summary statistics unless the denominator
is corrected. ``set_accessible_mask`` accepts:

* A BED file path. The path is parsed, the BED 0-based half-open
  intervals are converted to a 1-based mask aligned to the matrix
  positions, and the BED's full extent (not just the matrix range)
  is honored.
* A NumPy boolean array, useful for simulated data or for masks
  built programmatically at runtime.
* A pre-built ``AccessibleMask`` instance, useful for sharing one
  mask across multiple matrices.

For the underlying machinery -- mask construction, span
normalization rules, and the ``n_callable_sites`` /
``n_segregating_sites`` / ``n_invariant_sites`` decomposition --
see :doc:`../missing_data`.