Quality-Aware Filtering: GQ, DP, MQ from VCF/VCZ ================================================ Packaged script: ``examples/vcf_qc_filter.py`` Run it from the repo root: .. code-block:: bash pixi run python examples/vcf_qc_filter.py pixi run python examples/vcf_qc_filter.py --min-ac 8 --min-aa-prob 0.95 pixi run python examples/vcf_qc_filter.py --region X:8000000-12000000 Background ---------- Every routine call set comes with per-variant and per-genotype quality annotations -- ``INFO/MQ``, ``INFO/QD``, ``FMT/GQ``, ``FMT/DP``, ``FMT/AD``, and so on. bio2zarr preserves all those FORMAT and INFO fields when it writes a VCZ; the arrays sit next to ``call_genotype`` on disk. The ``fields=`` kwarg on ``HaplotypeMatrix.from_vcf`` / ``from_zarr`` (and the same on ``GenotypeMatrix``) surfaces them to Python; ``filter()`` applies the masks; ``to_zarr`` round-trips the survivors so the resulting "clean" VCZ is self-describing and re-loadable with the same ``fields=`` set. This tutorial walks through that pipeline end to end on the real Anopheles X-chromosome VCZ that ships under ``examples/data/`` -- a 5.3 M-variant, 100-diploid store that bio2zarr preserved with all its INFO and FORMAT arrays. The packaged script reads a 4 Mb window so a single GPU run takes a couple of seconds. What the script does -------------------- 1. Opens a 4 Mb window of ``examples/data/gamb.X.phased.n100.vcz`` with ``fields=['AC', 'AAProb', 'PQ']``. ``AC`` lands as per-variant ``(n_var,)``; ``AAProb`` is per-variant ``(n_var, 1)`` (bio2zarr's shape for INFO with ``Number=A``); ``PQ`` is per-genotype ``(n_var, n_samples)``. The reader auto-resolves which kind each is. 2. Summarizes each field with one ``np.percentile`` call. The arrays are plain numpy, so any matplotlib / seaborn snippet works for plotting -- e.g. ``plt.hist(hm.fields['AC'], bins=50)`` for the allele-count distribution. 3. Builds a per-variant mask from ``AC`` and ``AAProb`` -- drop singletons / doubletons (``AC >= 4``) and keep only confidently polarized sites (``AAProb >= 0.9``). 4. Constructs a per-genotype mask placeholder (``PQ >= -1``). The gamb fixture carries ``-1`` everywhere in ``call_PQ`` because the source VCF didn't populate phasing quality, so this mask is True for every call. On a VCZ derived from a VCF that DID carry ``FMT/GQ`` or ``FMT/DP`` the same line would read ``(hm.fields['GQ'] >= 20) & (hm.fields['DP'] >= 10)``. 5. Runs ``hm.filter`` and writes the survivor to a fresh VCZ. The surviving ``variant_*`` / ``call_*`` arrays are preserved in the clean store; a reload with the same ``fields=`` set returns them byte-exact. A typical run on the default settings looks like: .. code-block:: text Opening examples/data/gamb.X.phased.n100.vcz region: X:1000000-5000000 fields: ['AC', 'AAProb', 'PQ'] loaded 1,041,651 variants x 100 diploids Field summaries (5 / 25 / 50 / 75 / 95 percentiles): AC: shape=(1041651,), 5/25/50/75/95% = 0.000 / 0.000 / 0.000 / 1.000 / 4.000 AAProb: shape=(1041651, 1), 5/25/50/75/95% = 0.991 / 1.000 / 1.000 / 1.000 / 1.000 PQ: shape=(1041651, 100), 5/25/50/75/95% = -1.000 / -1.000 / -1.000 / -1.000 / -1.000 Filtering: variants kept where AC >= 4 and AAProb >= 0.9 variants kept: 52,826 / 1,041,651 (5.1 %) after filter: 52,826 variants Round-trip OK -- reloaded fields match the filtered matrix. Recipe ------ The whole pipeline reduces to four method calls: .. code-block:: python from pg_gpu import HaplotypeMatrix # 1. Load with quality fields. Bare VCF tags; pg_gpu auto-resolves # INFO vs FORMAT from the VCF header (or zarr layout). hm = HaplotypeMatrix.from_zarr( "cohort.vcz", fields=["MQ", "QD", "GQ", "DP"], streaming="never", ) # 2. Inspect. Per-variant fields are (n_var,); per-genotype are # (n_var, n_samples). Shape tells you which is which. hm.fields["MQ"] # shape (n_var,) hm.fields["GQ"] # shape (n_var, n_samples) # 3. Filter. ``variants`` drops rows; ``genotypes`` sets cells to -1 # on the haplotype matrix (both allele rows for each sample). hm_clean = hm.filter( variants=(hm.fields["MQ"] >= 40.0) & (hm.fields["QD"] >= 2.0), genotypes=(hm.fields["GQ"] >= 20) & (hm.fields["DP"] >= 10), drop_all_missing=True, ) # 4. Persist. The surviving QC arrays round-trip into the new VCZ # under the same field names. hm_clean.to_zarr("cohort.clean.vcz", format="vcz", contig_name="chr1") The same four lines work against a VCF source -- swap the loader: .. code-block:: python hm = HaplotypeMatrix.from_vcf( "cohort.vcf.gz", fields=["MQ", "QD", "GQ", "DP"], ) # ... filter + to_zarr same as above For very large VCFs the recommended path is "encode once via ``pg_gpu.zarr_io.vcf_to_zarr`` (bio2zarr under the hood, which preserves every FORMAT and INFO field by default), then iterate on thresholds against the VCZ." Re-encoding a multi-GB VCF dwarfs every other step in the workflow; doing it once and tuning filters read-side is a real time saver. What gets read -------------- For each bare tag, pg_gpu probes the source for the per-variant entry first and falls back to per-genotype: * **VCZ** (``bio2zarr`` output): ``variant_`` then ``call_``. * **scikit-allel zarr**: ``variants/`` then ``calldata/`` (works for both flat and chromosome-grouped layouts). * **VCF** (``allel.read_vcf``): the INFO section first, then FORMAT, resolved via ``allel.read_vcf_headers``. Tags missing from the source emit a ``UserWarning`` and are silently dropped from ``hm.fields``. The shape of each returned array disambiguates per-variant (``ndim=1``) from per-genotype (``ndim=2``), so a single dict can hold both kinds without parallel namespaces. When you load with ``region="chr1:1_000_000-2_000_000"``, the QC arrays are sliced down to the same variant range as ``hm.haplotypes`` -- no manual realignment. Filter semantics ---------------- ``hm.filter(variants=..., genotypes=..., drop_all_missing=True)``: * ``variants`` -- ``(n_var,)`` bool. False rows are dropped from every array on the returned matrix (haplotypes, positions, every entry in ``fields``). * ``genotypes`` -- ``(n_var, n_samples)`` bool. False cells set the haplotype matrix to ``-1`` at that position (both allele rows for the same sample). ``fields`` arrays are *not* zeroed out -- the QC values stay accessible for downstream inspection. * ``drop_all_missing`` -- after applying both masks, drop variants whose every call is ``-1``. Default ``True``; turn off when you want the genotype mask to keep variants with at least one valid call. The returned matrix is a fresh allocation, not a view. ``samples``, ``sample_sets``, and ``chrom_start`` / ``chrom_end`` are propagated; ``n_total_sites`` and any attached accessibility mask are not (the variant axis changed; if you need span normalization on the result, re-attach an accessibility mask explicitly). Current Limitations --------------------- * The streaming opener (``streaming='always'``) does not yet accept ``fields=``; both raise ``NotImplementedError`` when combined. * ``to_zarr(format='scikit-allel')`` does not round-trip ``hm.fields`` yet; combining the two raises so values are not silently dropped. Stick to ``format='vcz'`` (the default) for the clean output.