Tractor-tutorial

NOTE: This step is optional and currently only implemented for two-way admixed populations. If your analyses do not require long-range haplotypes, you can skip this step to save compute time.

This page describes scripts to detect and correct switch errors in phased data using local ancestry, helping to recover long-range tracts. For context, see Figure 1 in our manuscript. A switch error occurs when ancestries swap strands within a ~1 cM window, conditioned on heterozygous ancestry dosage. This tract recovery is currently limited to 2-way admixed populations, but all downstream analyses are compatible with multi-way admixed cohorts.

Note: Tract recovery is not required for analyses that do not use haplotypes, such as standard GWAS.

Step 1: Detecting and correcting switch errors in local ancestry

The first step identifies strand flips in local ancestry calls.

Think of this as “unkinking a garden hose”: the script recovers long-range tracts disrupted by phasing and records the locations of detected strand switches.

Example usage:

./unkink_2way_mspfile.py --msp FILENAME_STEM

Step 2: Correcting switch errors in phased genotypes

Next, switch errors are corrected in the phased genotype (VCF) file to improve haplotype continuity and long-range tract distribution. This switches file generated from the previous step is used to determine the positions that need to be flipped in the VCF file.

Example usage:

./unkink_2way_genofile.py --switches SWITCHES_FILE --genofile INPUT_VCF

Notes

bgzip file.vcf
tabix -p vcf file.vcf.gz
bcftools annotate -x INFO,FORMAT file.vcf.gz > stripped_file.vcf