Accessibility Masks and Windowed Diversity

Packaged script: examples/accessibility_mask.py

Run it from the repo root:

pixi run python examples/accessibility_mask.py
pixi run python examples/accessibility_mask.py --window 20_000

Background

Per-base summary statistics like \(\pi\) divide a sum of pairwise differences (numerator) by a count of bases (denominator). On real sequencing data, only a fraction of bases are reliably callable: repetitive regions are masked, low-coverage stretches fail filtering, ambiguous mapping yields gaps. If a windowed scan divides by the total base count (callable + uncallable), the denominator includes positions where variation could never be observed and per-bp diversity looks artificially low in low-callability regions. This is a frequent confounder for genome scans – selection signatures and accessibility deserts produce the same downward dip.

An accessibility mask fixes this by restricting both the variants visible to the analysis and the denominator of the per-bp normalization to the callable subset. set_accessible_mask does both in one call: it filters the matrix’s variants to those falling in accessible regions, and it tells span normalization to use the accessible-base count (rather than the genomic span) as the denominator. Both windowed_analysis and the scalar diversity / divergence functions read this through automatically.

What the script does

This tutorial sets up a controlled scenario where the “truth” is known, then verifies that the mask recovers it:

  1. Simulate a 1 Mb chromosome under msprime with a 200 kb central block of 100x lower mutation rate – a stand-in for a low-callability region. The genealogy is identical inside and outside the block, so the per-site pattern of coalescence is the same; only the probability of observing a mutation differs. Outside the block, per-bp diversity is \(4 N_e \mu_{\text{high}}\).

  2. Compute windowed \(\pi\) twice from the same simulation:

    • once with no accessibility mask, and

    • once with the low-mutation block flagged inaccessible via an in-memory numpy boolean array passed to set_accessible_mask.

  3. Plot both traces on the same axes with the excluded region shaded.

Without the mask, \(\pi\) drops misleadingly across the low-callability block – mutations are simply rarer, so \(\pi\)/bp looks lower even though the genealogy is unchanged. With the mask:

  • Fully-inaccessible windows return NaN and render as visual gaps – the scan is silent where it shouldn’t speak rather than reporting a spurious value.

  • Partially-inaccessible windows divide their numerator by the accessible-base count, so the flanking \(\pi\) sits at the expected \(4 N_e \mu_{\text{high}}\) value.

Why it’s useful as a template

This example is applicable to any region where the probability of observing variation differs from the surrounding genome (repeat masks, low-coverage regions, hard-to-map duplications, centromeres) biases per-bp summary statistics unless the denominator is corrected. set_accessible_mask accepts:

  • A BED file path. The path is parsed, the BED 0-based half-open intervals are converted to a 1-based mask aligned to the matrix positions, and the BED’s full extent (not just the matrix range) is honored.

  • A NumPy boolean array, useful for simulated data or for masks built programmatically at runtime.

  • A pre-built AccessibleMask instance, useful for sharing one mask across multiple matrices.

For the underlying machinery – mask construction, span normalization rules, and the n_callable_sites / n_segregating_sites / n_invariant_sites decomposition – see Missing Data Handling.