Workflows
The utils/ directory ships standalone command-line scripts for
common end-to-end workflows. They are thin wrappers around the
library API, so reading their source is also a good way to see how the
pieces fit together for a real analysis.
Run any of them via pixi run to pick up the GPU-enabled
environment:
pixi run python utils/<script>.py --help
vcf_to_zarr.py – VCF to VCZ Conversion
Converts a bgzipped, indexed VCF (or BCF) into a bio2zarr VCZ store. Subsequent loads
via HaplotypeMatrix.from_zarr are dramatically faster than parsing
the VCF on every run, which is the usual reason to do this step once
up front.
pixi run python utils/vcf_to_zarr.py input.vcf.gz output.zarr
pixi run python utils/vcf_to_zarr.py input.vcf.gz output.zarr --workers 8
Arguments:
vcf– path to bgzipped/indexed VCF (.vcf.gz) or BCF.zarr– output zarr store path.--workers N– number of worker processes (default: all CPUs).
genome_scan.py – End-to-End Genome Scan
A complete chromosome-scan workflow: load VCF or zarr data, optionally assign populations, compute windowed diversity / divergence / Garud’s H plus scalar summaries on the GPU, and write a multi-panel scan figure (PDF). Two populations trigger automatic computation of divergence statistics (FST, dxy) and the joint SFS; a single population produces a diversity-only scan.
# Single-population scan from a VCF, restricted to one chromosome
pixi run python utils/genome_scan.py data.vcf.gz --region 3R
# Two-population scan from a zarr store, with population assignments
pixi run python utils/genome_scan.py data.zarr \
--region 3R --pop-file pops.tsv
# With an accessibility mask and custom windows
pixi run python utils/genome_scan.py data.vcf.gz \
--region 3R --pop-file pops.tsv \
--accessible-bed mask.bed \
--window-size 50000 --step-size 10000
The --pop-file argument expects a tab-delimited file with columns
sample and pop – one row per VCF sample.
Arguments:
input– VCF (.vcf,.vcf.gz,.bcf) or zarr store.--region– chromosome orchrom:start-end.--pop-file– tab-delimited (sample, pop). Optional.--accessible-bed– BED file of accessible regions. Optional.--window-size– bp window size (default: 100,000).--step-size– bp step between windows (default: 10,000).-o,--output– output figure path (default:genome_scan.pdf).
Adding Your Own Workflow
The scripts in utils/ are deliberately short – under ~250 lines
each – and follow the same shape: parse arguments, build a
HaplotypeMatrix, optionally attach a mask and population
assignments, hand the matrix to windowed_analysis (or a scalar
statistics function), and write the result. Forking one of them is
usually the fastest way to assemble a custom scan.