![]()
NOTE: This step is optional and currently only implemented for two-way admixed populations. If your analyses do not require long-range haplotypes, you can skip this step to save compute time.
This page describes scripts to detect and correct switch errors in phased data using local ancestry, helping to recover long-range tracts. For context, see Figure 1 in our manuscript. A switch error occurs when ancestries swap strands within a ~1 cM window, conditioned on heterozygous ancestry dosage. This tract recovery is currently limited to 2-way admixed populations, but all downstream analyses are compatible with multi-way admixed cohorts.
Note: Tract recovery is not required for analyses that do not use haplotypes, such as standard GWAS.
The first step identifies strand flips in local ancestry calls.
.msp.tsv fileunkink_2way_mspfile.py*_switches)*_unkinked)Think of this as “unkinking a garden hose”: the script recovers long-range tracts disrupted by phasing and records the locations of detected strand switches.
Example usage:
./unkink_2way_mspfile.py --msp FILENAME_STEM
Next, switch errors are corrected in the phased genotype (VCF) file to improve haplotype continuity and long-range tract distribution. This switches file generated from the previous step is used to determine the positions that need to be flipped in the VCF file.
Input:
*_switches file from Step 1unkink_2way_genofile.pyExample usage:
./unkink_2way_genofile.py --switches SWITCHES_FILE --genofile INPUT_VCF
.msp.tsv files from RFMix.| instead of /) and recommends stripping INFO and FORMAT annotations for parsing:bgzip file.vcf
tabix -p vcf file.vcf.gz
bcftools annotate -x INFO,FORMAT file.vcf.gz > stripped_file.vcf