Accessibility Masks and Windowed Diversity
Packaged script: examples/accessibility_mask.py
Run it from the repo root:
pixi run python examples/accessibility_mask.py
pixi run python examples/accessibility_mask.py --window 20_000
Background
Per-base summary statistics like \(\pi\) divide a sum of pairwise differences (numerator) by a count of bases (denominator). On real sequencing data, only a fraction of bases are reliably callable: repetitive regions are masked, low-coverage stretches fail filtering, ambiguous mapping yields gaps. If a windowed scan divides by the total base count (callable + uncallable), the denominator includes positions where variation could never be observed and per-bp diversity looks artificially low in low-callability regions. This is a frequent confounder for genome scans – selection signatures and accessibility deserts produce the same downward dip.
An accessibility mask fixes this by restricting both the variants
visible to the analysis and the denominator of the per-bp
normalization to the callable subset. set_accessible_mask does
both in one call: it filters the matrix’s variants to those falling
in accessible regions, and it tells span normalization to use the
accessible-base count (rather than the genomic span) as the
denominator. Both windowed_analysis and the scalar diversity /
divergence functions read this through automatically.
What the script does
This tutorial sets up a controlled scenario where the “truth” is known, then verifies that the mask recovers it:
Simulate a 1 Mb chromosome under
msprimewith a 200 kb central block of 100x lower mutation rate – a stand-in for a low-callability region. The genealogy is identical inside and outside the block, so the per-site pattern of coalescence is the same; only the probability of observing a mutation differs. Outside the block, per-bp diversity is \(4 N_e \mu_{\text{high}}\).Compute windowed \(\pi\) twice from the same simulation:
once with no accessibility mask, and
once with the low-mutation block flagged inaccessible via an in-memory
numpyboolean array passed toset_accessible_mask.
Plot both traces on the same axes with the excluded region shaded.
Without the mask, \(\pi\) drops misleadingly across the low-callability block – mutations are simply rarer, so \(\pi\)/bp looks lower even though the genealogy is unchanged. With the mask:
Fully-inaccessible windows return
NaNand render as visual gaps – the scan is silent where it shouldn’t speak rather than reporting a spurious value.Partially-inaccessible windows divide their numerator by the accessible-base count, so the flanking \(\pi\) sits at the expected \(4 N_e \mu_{\text{high}}\) value.
Why it’s useful as a template
This example is applicable to any region where the
probability of observing variation differs from the surrounding
genome (repeat masks, low-coverage regions, hard-to-map duplications,
centromeres) biases per-bp summary statistics unless the denominator
is corrected. set_accessible_mask accepts:
A BED file path. The path is parsed, the BED 0-based half-open intervals are converted to a 1-based mask aligned to the matrix positions, and the BED’s full extent (not just the matrix range) is honored.
A NumPy boolean array, useful for simulated data or for masks built programmatically at runtime.
A pre-built
AccessibleMaskinstance, useful for sharing one mask across multiple matrices.
For the underlying machinery – mask construction, span
normalization rules, and the n_callable_sites /
n_segregating_sites / n_invariant_sites decomposition –
see Missing Data Handling.