Quality-Aware Filtering: GQ, DP, MQ from VCF/VCZ
Packaged script: examples/vcf_qc_filter.py
Run it from the repo root:
pixi run python examples/vcf_qc_filter.py
pixi run python examples/vcf_qc_filter.py --min-ac 8 --min-aa-prob 0.95
pixi run python examples/vcf_qc_filter.py --region X:8000000-12000000
Background
Every routine call set comes with per-variant and per-genotype quality
annotations – INFO/MQ, INFO/QD, FMT/GQ, FMT/DP, FMT/AD,
and so on.
bio2zarr preserves all those FORMAT and INFO fields when it
writes a VCZ; the arrays sit next to call_genotype on disk. The
fields= kwarg on HaplotypeMatrix.from_vcf / from_zarr (and
the same on GenotypeMatrix) surfaces them to Python; filter()
applies the masks; to_zarr round-trips the survivors so the
resulting “clean” VCZ is self-describing and re-loadable with the same
fields= set.
This tutorial walks through that pipeline end to end on the real
Anopheles X-chromosome VCZ that ships under examples/data/ – a
5.3 M-variant, 100-diploid store that bio2zarr preserved with all its
INFO and FORMAT arrays. The packaged script reads a 4 Mb window so a
single GPU run takes a couple of seconds.
What the script does
Opens a 4 Mb window of
examples/data/gamb.X.phased.n100.vczwithfields=['AC', 'AAProb', 'PQ'].AClands as per-variant(n_var,);AAProbis per-variant(n_var, 1)(bio2zarr’s shape for INFO withNumber=A);PQis per-genotype(n_var, n_samples). The reader auto-resolves which kind each is.Summarizes each field with one
np.percentilecall. The arrays are plain numpy, so any matplotlib / seaborn snippet works for plotting – e.g.plt.hist(hm.fields['AC'], bins=50)for the allele-count distribution.Builds a per-variant mask from
ACandAAProb– drop singletons / doubletons (AC >= 4) and keep only confidently polarized sites (AAProb >= 0.9).Constructs a per-genotype mask placeholder (
PQ >= -1). The gamb fixture carries-1everywhere incall_PQbecause the source VCF didn’t populate phasing quality, so this mask is True for every call. On a VCZ derived from a VCF that DID carryFMT/GQorFMT/DPthe same line would read(hm.fields['GQ'] >= 20) & (hm.fields['DP'] >= 10).Runs
hm.filterand writes the survivor to a fresh VCZ. The survivingvariant_*/call_*arrays are preserved in the clean store; a reload with the samefields=set returns them byte-exact.
A typical run on the default settings looks like:
Opening examples/data/gamb.X.phased.n100.vcz
region: X:1000000-5000000
fields: ['AC', 'AAProb', 'PQ']
loaded 1,041,651 variants x 100 diploids
Field summaries (5 / 25 / 50 / 75 / 95 percentiles):
AC: shape=(1041651,), 5/25/50/75/95% = 0.000 / 0.000 / 0.000 / 1.000 / 4.000
AAProb: shape=(1041651, 1), 5/25/50/75/95% = 0.991 / 1.000 / 1.000 / 1.000 / 1.000
PQ: shape=(1041651, 100), 5/25/50/75/95% = -1.000 / -1.000 / -1.000 / -1.000 / -1.000
Filtering: variants kept where AC >= 4 and AAProb >= 0.9
variants kept: 52,826 / 1,041,651 (5.1 %)
after filter: 52,826 variants
Round-trip OK -- reloaded fields match the filtered matrix.
Recipe
The whole pipeline reduces to four method calls:
from pg_gpu import HaplotypeMatrix
# 1. Load with quality fields. Bare VCF tags; pg_gpu auto-resolves
# INFO vs FORMAT from the VCF header (or zarr layout).
hm = HaplotypeMatrix.from_zarr(
"cohort.vcz", fields=["MQ", "QD", "GQ", "DP"], streaming="never",
)
# 2. Inspect. Per-variant fields are (n_var,); per-genotype are
# (n_var, n_samples). Shape tells you which is which.
hm.fields["MQ"] # shape (n_var,)
hm.fields["GQ"] # shape (n_var, n_samples)
# 3. Filter. ``variants`` drops rows; ``genotypes`` sets cells to -1
# on the haplotype matrix (both allele rows for each sample).
hm_clean = hm.filter(
variants=(hm.fields["MQ"] >= 40.0) & (hm.fields["QD"] >= 2.0),
genotypes=(hm.fields["GQ"] >= 20) & (hm.fields["DP"] >= 10),
drop_all_missing=True,
)
# 4. Persist. The surviving QC arrays round-trip into the new VCZ
# under the same field names.
hm_clean.to_zarr("cohort.clean.vcz", format="vcz", contig_name="chr1")
The same four lines work against a VCF source – swap the loader:
hm = HaplotypeMatrix.from_vcf(
"cohort.vcf.gz", fields=["MQ", "QD", "GQ", "DP"],
)
# ... filter + to_zarr same as above
For very large VCFs the recommended path is “encode once via
pg_gpu.zarr_io.vcf_to_zarr (bio2zarr under the hood, which
preserves every FORMAT and INFO field by default), then iterate on
thresholds against the VCZ.” Re-encoding a multi-GB VCF dwarfs every
other step in the workflow; doing it once and tuning filters
read-side is a real time saver.
What gets read
For each bare tag, pg_gpu probes the source for the per-variant entry first and falls back to per-genotype:
VCZ (
bio2zarroutput):variant_<tag>thencall_<tag>.scikit-allel zarr:
variants/<tag>thencalldata/<tag>(works for both flat and chromosome-grouped layouts).VCF (
allel.read_vcf): the INFO section first, then FORMAT, resolved viaallel.read_vcf_headers.
Tags missing from the source emit a UserWarning and are silently
dropped from hm.fields. The shape of each returned array
disambiguates per-variant (ndim=1) from per-genotype (ndim=2),
so a single dict can hold both kinds without parallel namespaces.
When you load with region="chr1:1_000_000-2_000_000", the QC arrays
are sliced down to the same variant range as hm.haplotypes –
no manual realignment.
Filter semantics
hm.filter(variants=..., genotypes=..., drop_all_missing=True):
variants–(n_var,)bool. False rows are dropped from every array on the returned matrix (haplotypes, positions, every entry infields).genotypes–(n_var, n_samples)bool. False cells set the haplotype matrix to-1at that position (both allele rows for the same sample).fieldsarrays are not zeroed out – the QC values stay accessible for downstream inspection.drop_all_missing– after applying both masks, drop variants whose every call is-1. DefaultTrue; turn off when you want the genotype mask to keep variants with at least one valid call.
The returned matrix is a fresh allocation, not a view. samples,
sample_sets, and chrom_start / chrom_end are propagated;
n_total_sites and any attached accessibility mask are not (the
variant axis changed; if you need span normalization on the result,
re-attach an accessibility mask explicitly).
Current Limitations
The streaming opener (
streaming='always') does not yet acceptfields=; both raiseNotImplementedErrorwhen combined.to_zarr(format='scikit-allel')does not round-triphm.fieldsyet; combining the two raises so values are not silently dropped. Stick toformat='vcz'(the default) for the clean output.