Quality-Aware Filtering: GQ, DP, MQ from VCF/VCZ

Packaged script: examples/vcf_qc_filter.py

Run it from the repo root:

pixi run python examples/vcf_qc_filter.py
pixi run python examples/vcf_qc_filter.py --min-ac 8 --min-aa-prob 0.95
pixi run python examples/vcf_qc_filter.py --region X:8000000-12000000

Background

Every routine call set comes with per-variant and per-genotype quality annotations – INFO/MQ, INFO/QD, FMT/GQ, FMT/DP, FMT/AD, and so on.

bio2zarr preserves all those FORMAT and INFO fields when it writes a VCZ; the arrays sit next to call_genotype on disk. The fields= kwarg on HaplotypeMatrix.from_vcf / from_zarr (and the same on GenotypeMatrix) surfaces them to Python; filter() applies the masks; to_zarr round-trips the survivors so the resulting “clean” VCZ is self-describing and re-loadable with the same fields= set.

This tutorial walks through that pipeline end to end on the real Anopheles X-chromosome VCZ that ships under examples/data/ – a 5.3 M-variant, 100-diploid store that bio2zarr preserved with all its INFO and FORMAT arrays. The packaged script reads a 4 Mb window so a single GPU run takes a couple of seconds.

What the script does

Opens a 4 Mb window of examples/data/gamb.X.phased.n100.vcz with fields=['AC', 'AAProb', 'PQ']. AC lands as per-variant (n_var,); AAProb is per-variant (n_var, 1) (bio2zarr’s shape for INFO with Number=A); PQ is per-genotype (n_var, n_samples). The reader auto-resolves which kind each is.
Summarizes each field with one np.percentile call. The arrays are plain numpy, so any matplotlib / seaborn snippet works for plotting – e.g. plt.hist(hm.fields['AC'], bins=50) for the allele-count distribution.
Builds a per-variant mask from AC and AAProb – drop singletons / doubletons (AC >= 4) and keep only confidently polarized sites (AAProb >= 0.9).
Constructs a per-genotype mask placeholder (PQ >= -1). The gamb fixture carries -1 everywhere in call_PQ because the source VCF didn’t populate phasing quality, so this mask is True for every call. On a VCZ derived from a VCF that DID carry FMT/GQ or FMT/DP the same line would read (hm.fields['GQ'] >= 20) & (hm.fields['DP'] >= 10).
Runs hm.filter and writes the survivor to a fresh VCZ. The surviving variant_* / call_* arrays are preserved in the clean store; a reload with the same fields= set returns them byte-exact.

A typical run on the default settings looks like:

Opening examples/data/gamb.X.phased.n100.vcz
  region: X:1000000-5000000
  fields: ['AC', 'AAProb', 'PQ']
  loaded 1,041,651 variants x 100 diploids

Field summaries (5 / 25 / 50 / 75 / 95 percentiles):
        AC: shape=(1041651,),     5/25/50/75/95% = 0.000 / 0.000 / 0.000 / 1.000 / 4.000
    AAProb: shape=(1041651, 1),   5/25/50/75/95% = 0.991 / 1.000 / 1.000 / 1.000 / 1.000
        PQ: shape=(1041651, 100), 5/25/50/75/95% = -1.000 / -1.000 / -1.000 / -1.000 / -1.000

Filtering: variants kept where AC >= 4 and AAProb >= 0.9
  variants kept: 52,826 / 1,041,651 (5.1 %)
  after filter: 52,826 variants
Round-trip OK -- reloaded fields match the filtered matrix.

Recipe

The whole pipeline reduces to four method calls:

from pg_gpu import HaplotypeMatrix

# 1. Load with quality fields. Bare VCF tags; pg_gpu auto-resolves
#    INFO vs FORMAT from the VCF header (or zarr layout).
hm = HaplotypeMatrix.from_zarr(
    "cohort.vcz", fields=["MQ", "QD", "GQ", "DP"], streaming="never",
)

# 2. Inspect. Per-variant fields are (n_var,); per-genotype are
#    (n_var, n_samples). Shape tells you which is which.
hm.fields["MQ"]   # shape (n_var,)
hm.fields["GQ"]   # shape (n_var, n_samples)

# 3. Filter. ``variants`` drops rows; ``genotypes`` sets cells to -1
#    on the haplotype matrix (both allele rows for each sample).
hm_clean = hm.filter(
    variants=(hm.fields["MQ"] >= 40.0) & (hm.fields["QD"] >= 2.0),
    genotypes=(hm.fields["GQ"] >= 20) & (hm.fields["DP"] >= 10),
    drop_all_missing=True,
)

# 4. Persist. The surviving QC arrays round-trip into the new VCZ
#    under the same field names.
hm_clean.to_zarr("cohort.clean.vcz", format="vcz", contig_name="chr1")

The same four lines work against a VCF source – swap the loader:

hm = HaplotypeMatrix.from_vcf(
    "cohort.vcf.gz", fields=["MQ", "QD", "GQ", "DP"],
)
# ... filter + to_zarr same as above

For very large VCFs the recommended path is “encode once via pg_gpu.zarr_io.vcf_to_zarr (bio2zarr under the hood, which preserves every FORMAT and INFO field by default), then iterate on thresholds against the VCZ.” Re-encoding a multi-GB VCF dwarfs every other step in the workflow; doing it once and tuning filters read-side is a real time saver.

What gets read

For each bare tag, pg_gpu probes the source for the per-variant entry first and falls back to per-genotype:

VCZ (bio2zarr output): variant_<tag> then call_<tag>.
scikit-allel zarr: variants/<tag> then calldata/<tag> (works for both flat and chromosome-grouped layouts).
VCF (allel.read_vcf): the INFO section first, then FORMAT, resolved via allel.read_vcf_headers.

Tags missing from the source emit a UserWarning and are silently dropped from hm.fields. The shape of each returned array disambiguates per-variant (ndim=1) from per-genotype (ndim=2), so a single dict can hold both kinds without parallel namespaces.

When you load with region="chr1:1_000_000-2_000_000", the QC arrays are sliced down to the same variant range as hm.haplotypes – no manual realignment.

Filter semantics

hm.filter(variants=..., genotypes=..., drop_all_missing=True):

variants – (n_var,) bool. False rows are dropped from every array on the returned matrix (haplotypes, positions, every entry in fields).
genotypes – (n_var, n_samples) bool. False cells set the haplotype matrix to -1 at that position (both allele rows for the same sample). fields arrays are not zeroed out – the QC values stay accessible for downstream inspection.
drop_all_missing – after applying both masks, drop variants whose every call is -1. Default True; turn off when you want the genotype mask to keep variants with at least one valid call.

The returned matrix is a fresh allocation, not a view. samples, sample_sets, and chrom_start / chrom_end are propagated; n_total_sites and any attached accessibility mask are not (the variant axis changed; if you need span normalization on the result, re-attach an accessibility mask explicitly).

Current Limitations

The streaming opener (streaming='always') does not yet accept fields=; both raise NotImplementedError when combined.
to_zarr(format='scikit-allel') does not round-trip hm.fields yet; combining the two raises so values are not silently dropped. Stick to format='vcz' (the default) for the clean output.