Pipeline Summary

The pipeline is designed into three independent modules:

Module 1: Phasing (using SHAPEIT5)
Module 2: Local Ancestry Inference (choice of RFMix2, GNomix, FLARE)
Module 3: Tractor GWAS

This page describes launching Tractor NXF Workflow for the complete pipeline, and will thus run all steps from Module 1 to 3.

Input Expectations
Parameters Documentation
- Complete Workflow Parameters (Module 1, 2 and 3)
  - Initialization inputs
  - SHAPEIT5 specific inputs
  - Local Ancestry Inference
  - Pre-Tractor-specific inputs
  - Tractor-specific inputs
Launching an NXF Workflow (Assume RFMix2 for LAI)
Output/Results Overview
- Directory Structure

Input Expectations

1. Input Data Requirements

The workflow expects the input VCF file to be QC’d. While exact QC steps may vary depending on your dataset, we recommend:
- Filter for sample/variant missingness
- Filter for high-quality variants
- Filter for common variants, i.e. apply a MAF threshold (0.5%–1%) for GWAS
- Optionally, keep only high-quality SNPs (no indels)
While SHAPEIT5 can phase rare variants, this workflow is designed to phase common variants only using SHAPEIT5’s phase_common.

2. Chromosome-Specific Design

Many reference files (genetic map files, chunk files) are specific to chromosomes, so our workflow is designed to operate per chromosome. Please split your QC’d Input VCF file by chromosomes.
To analyze all chromosomes, you must launch separate workflow jobs for each one.

3. Output Naming

Use the argument --output_prefix to specify the desired output name.
The workflow will automatically append the chromosome number to output prefix before saving, you do not need to add chr no. to the output prefix.

4. Dependencies and Installation

Ensure that all required dependencies and tools are installed, tested, and in your PATH.
See Installation instructions.

5. Required Reference Files

Download the following reference files before running the workflow:

SHAPEIT5: Genomic chunk file, Genetic map file.
RFMix2/GNomix: Genetic map file. SHAPEIT5 genetic map files have been adapted to meet requirements of these tools.
FLARE: Genetic map file (Must be in PLINK format, see FLARE README)
Reference panels for LAI: Sample map files for 2-way AFR–EUR and 3-way AFR–EUR–AMR panels, based on the TGP–HGDP joint-call dataset are available here.
TGP–HGDP joint-call dataset: Released as part of Koenig et al. 2024 in GRCh38 format (TBD: where). We have lifted this dataset to GRCh37 using GATK Picard’s liftOver tool and re-phased it using SHAPEIT5 on a set of filtered variants. We are working on making this version publicly available, please reach out if you’re in need of access until these are publicly released.

Parameters Documentation

false value as a default in Nextflow worfklow simply implies that specific parameter is not being utilized. These flags might have their own defaults for specific softwares. For example, Tractor step’s --chunksize default argument is 10000. A false value simply allows the script/software to use its default, if any.

Complete Workflow Parameters (Module 1, 2 and 3)

Initialization inputs

Flag	Required / Optional (Default)	Description
`--outdir`	Required	Path to the output directory. Can be the same for all chromosomes.
`--output_prefix`	Required	Prefix for naming output files. The chromosome number will be automatically appended before saving.
`--mode`	Required	Workflow mode to run. Use `complete` for the full workflow. Other modes are listed in the Modular Workflow Execution section.
`--lai_tool`	Required (except in `phasing_only` or `tractor_only` mode)	Choice of LAI method: `rfmix2`, `gnomix`, or `flare`.

SHAPEIT5 specific inputs

Flag	Required/Optional (Default)	Description	Original Tool Equivalent Flag
`--input_vcf`	Required	Path to input VCF file	`--input`
`--chunkfile`	Required	Path to genomic chunk file	Each chunk within the chunkfile will be passed to `--region`
`--genetic_map`	Required	Path to SHAPEIT5-compatible Genetic Map File	`--map`
`--reference_vcf`	Optional	Path to reference VCF file	`--reference`
`--filter_maf`	Optional	float value of MAF threshold	`--filter-maf`

For links to SHAPEIT5 genomic chunks and genetic map files, please take a look at Required Reference File section.

RFMix2 specific inputs

RFMix2 repository is available through its Github link

Flag	Required/Optional (Default)	Description	Original Tool Equivalent Flag
`--rfmix2_ref_vcf`	Required	Path to Reference VCF	`--reference-file`
`--rfmix2_sample_map`	Required	Path to Sample Map file	`--sample-map`
`--rfmix2_genetic_map`	Required	Path to RFMix2-compatible Genetic Map file	`--genetic-map`
`--reanalyze_ref`	Optional	true,false - Whether reference file should be reanalzyed	`--reanalyze-reference`
`--em_iterations`	Optional	int value of number of EM iterations	`-e`
`--crf_spacing`	Optional	CRF spacing (# of SNPs)	`-c`
`--rf_window_size`	Optional	RF window size (class estimation window size)	`-s`
`--node_size`	Optional	Terminal node size for RF trees	`-n`
`--trees`	Optional	# of tree in RF to estimate population class probability	`-t`

GNomix specific inputs

GNomix repository is available through its Github link
config.yaml from this repository is used to define model parameters (--gnomix_config).
- Advanced users may edit this file if they wish to change model parameters.

Flag	Required/Optional	Description	Original Tool Argument
`--gnomix_dir`	Required	Path to GNomix repository (must contain gnomix.py)	Path to GNomix’s Github repository
`--gnomix_config`	Required	Path to GNomix config file (in gnomix directory)	config.yaml file as found in Github repository
`--gnomix_ref_vcf`	Required	Path to Reference VCF	`reference_file`
`--gnomix_sample_map`	Required	Path to Sample Map file	`sample_map_file`
`--gnomix_genetic_map`	Required	Path to GNomix-compatible Genetic Map file	`genetic_map_file`
`--gnomix_phase`	Required	true, false - whether GNomix phasing correction should be performed	`phase`

If the user opts for GNomix-based workflow, the model will be trained from scratch before estimates are predicted. i.e. previously generated model cannot be used using this option.

If phasing correction is enabled with –gnomix_phase true, a new VCF with corrected phase is generated. The ancestry estimates correspond to this corrected VCF file, hence this file is subsequently used in the extract_tracts step.

FLARE specific inputs

FLARE repository is available through its Github link
Defaults for optional arguments will be used as per FLARE’s own defaults

Flag	Required/Optional (Default)	Description	Original Tool Equivalent Flag
`--flare_dir`	Required	Path to FLARE directory (must contain flare.jar)	FLARE’s Github repository
`--flare_ref_vcf`	Required	Path to Reference VCF	`ref`
`--flare_sample_map`	Required	Path to Sample Map File	`ref-panel`
`--flare_genetic_map`	Required	Path to FLARE-compatible Genetic Map file	`map`
`--flare_array`	Optional (false)	true,false - Is input data a SNP array?	`array`
`--flare_min_maf`	Optional	Min. MAF in reference VCF for a marker to be included	`min-maf`
`--flare_min_mac`	Optional	Min. MAC in reference VCF for a marker to be included	`min-mac`
`--flare_probs`	Optional (false)	true,false - Output posterior ancestry probabilities?	`probs`
`--flare_gen`	Optional	No. of generations since admxiture	`gen`
`--flare_model`	Optional	Path to model parameter file	`model`
`--flare_em`	Optional (true)	true,false - Should EM algorithm be used?	`em`
`--flare_gt_samples_include`^*	Optional	Path to list of samples to be included	`gt-samples`
`--flare_gt_samples_exclude`^*	Optional	Path to list of samples to be excluded	`gt-samples=^`
`--flare_gt_ancestries`	Optional	Path to file containing ancestry proportions	`gt-ancestries`
`--flare_excludedmarkers`	Optional	Path to file with markers to be excluded	`excludemarkers`
`--flare_seed`	Optional	int, sets seed for random number generation	`seed`

When the workflow runs FLARE, it automatically applies three optional arguments in the default stage: --flare_array "false", --flare_probs "false", and --flare_em "true". These correspond to FLARE v0.5.3’s default arguments.

^*Only one of these arguments can be used.

Pre-Tractor-specific inputs

Can be extract_tracts for RFMix2/GNomix or extract_tracts_flare for FLARE

Flag	Required/Optional (Default)	Description	Original Tool Equivalent Flag
`--num_ancs`	Required	Number of ancestries in this dataset	`--num-ancs`
`--output_vcf`	Optional	true,false - Whether ancestry-specific VCF files should be generated.	`--output-vcf`
`--compress_output`	Optional	true,false - Whether output files should be compressed?	`--compress-ouptut`

Tractor specific inputs

Flag	Required/Optional (Default)	Description	Original Tool Equivalent Flag
`--phenotype`	Required	Path to phenotype file	`--phenofile`
`--phenocol`	Required^*	Name of Phenotype column to be used	`--phenocol`
`--phenolist_file`	Required^*	Path to list of phenotypes to be analyzed. If provided, `--phenocol` flag is ignored	If provided, each phenotype will be iterated through `--phenocol` argument
`--covarcollist`	Required	List of covariates, separated by commas	`--covarcollist`
`--regression_method`	Required	linear, logistic	`--method`
`--sampleidcol`	Optional	Name of Sample ID column to be used	`--sampleidcol`
`--chunksize`	Optional	Number of variants to process on each thread	`--chunksize`
`--totallines`	Optional	Total number of variants	`--totallines`

By default, the phenotype file must have IID/#IID as the sample ID column and y as the phenotype column. Use --sampleidcol and --phenocol to override these.

^*If user has multiple phenotypes within the same file, user can list them in a new file and pass it with --phenolist_file (and skip the --phenocol argument).

Launching an NXF Workflow (Assume RFMix2 for LAI)

Nextflow scripts have a different syntax than normal bash scripts. In Nextflow (which follows groovy), // is used for commenting.

Running a Nextflow Workflow

To run a Nextflow workflow, you need two things:

A main script (main.nf) that defines the workflow. This is provided in the workflows/ directory in the TractorWorkflow repository.
A configuration file (nextflow.config) that specifies resources and execution settings.

Example command

nextflow run workflows/main.nf \
    -c workflows/nextflow.config \
    -profile local \
    --outdir /path/to/output/tractor_run1 \
    --output_prefix "output1" \
    --mode "complete" \
    --lai_tool "rfmix2" \
    <add other relevant arguments>

Example config file (minimal profile)

// nextflow.config
profiles {
    local {
        process {
            executor = 'local'
            cpus     = 4
            memory   = '16.GB'
            time     = '12h'
        }
    }
}

What’s happening here?

Nextflow runs main.nf, which contains the main workflow, and will automatically take care of running all necessary jobs (Phasing, LAI, Tractor).
The workflow uses a specific nextflow.config file that defines resources (CPUs, memory, runtime, etc.) and execution settings. Every user should adapt this file based on the resources available in their environment. We discuss config file in detail later in this documentation.
The -profile local flag tells Nextflow to use the local profile defined in the config file.
- Config files can contain multiple profiles (e.g., local for testing, slurm for HPC clusters).
- You can define different profiles for different executors depending on your environment.
Additional parameters as defined above are passed with --param_name value (E.g. --outdir /path/to/output/tractor_run1 or --lai_tool "rfmix2")
- At minimum, you must provide all required arguments; others are optional depending on your use case.

Example implementation on SLURM

In this section, we provide a more detailed walkthrough of how to run the workflow on a High-Performance Computing (HPC) system with the SLURM job scheduler.

Let’s assume the following:

We will run the complete workflow, i.e. --mode "complete"
We will perform Local Ancestry Inference (LAI) using RFMix2, i.e. --lai_tool "rfmix2"
The workflow will be run across 22 chromosomes
- Since the workflow is chromosome-specific, each chromosome will be run as a separate job in its own run directory.
Execution will happen on an HPC cluster managed by SLURM.

Steps for the code below

Iterate across all chromosomes
Load all required software modules - will vary by systems (e.g. Java, Nextflow, any software/tool dependencies)
Create separate run/work directories for each chromosome run
Define required and optional parameters (e.g. output prefix, mode, LAI tool)
Launch Nextflow workflow (with all necessary arguments)

SLURM Job Script Example

We recommend testing this workflow with a test dataset, which includes a README describing all the files.

When you are reading to run analyses on your own dataset, please review the Run Readiness Checklist.

#!/usr/bin/bash
#SBATCH --time=5-00:00:00
#SBATCH --mem=4G
#SBATCH --cpus-per-task=1
#SBATCH --partition=<ATTN: add-partition-name>
#SBATCH --job-name=launch_nxf_tractor
#SBATCH --output=launch_nxf_tractor_%a_%j.out
#SBATCH --array=1-22

# Step 1: Iterate across all chromosomes
#   #SBATCH --array command will launch 22 independent jobs for each array ID
# Capture the chromosome number from the SLURM array task ID.
chr=${SLURM_ARRAY_TASK_ID}

# Step 2: Load all necessary modules (not required if in PATH)
## We only load java here, but you can load all necessary modules
module load java/jdk-21.0.2
## Print versions (for documentation)
java -version && echo
nextflow info && echo && echo

# ATTN: Update variables and paths in Step 3, 4 and 5

# Step 3: Create separate run/work directories for each run/chromosome
basedir="/path/to/project_dir"
workdir="${basedir}/nxf_runs/work_chr${chr}"                # Work directory for chromosome ${chr}
rundir="${basedir}/nxf_runs/run_chr${chr}"                  # Run directory for chromosome ${chr}

## Create run directories and navigate to them to launch the workflow.
mkdir -p ${rundir}
cd ${rundir}

# Step 4: Define all required and optional variables for each step
workflow_dir="/path/softwares/TractorWorkflow/workflows/"  # Location of the Nextflow workflow
outdir="${basedir}/results1"
output_prefix="output_prefix"
workflow_mode="complete"          # Select any of the modes described here: https://atkinson-lab.github.io/TractorWorkflow/docs/documentation/modular_workflow_execution.html
lai_tool="rfmix2"                 # not required if "phasing_only" or "tractor_only"

## SHAPEIT5 (Mandatory Arguments)
shapeit5_input_vcf="/path/to/test_data/admixed_cohort/ASW.unphased.vcf.gz"
shapeit5_chunkfile="/path/to/TractorWorkflow/resources/genomic_chunks/chunks_fullchromosome/chunks_chr${chr}.txt"
shapeit5_genetic_map="/path/to/shapeit5/resources/maps/b37/chr${chr}.b37.gmap.gz"

## SHAPEIT5 (Optional Arguments)
# shapeit5_ref_vcf=""
# shapeit5_filter_maf=""

# NOTE: Update LAI-related arguments below to match your chosen LAI tool.  
# NOTE: Mandatory arguments must be provided exactly as described in
#       the Parameters section above
# NOTE: Step 5 in this example shows arguments for RFMix2.  
#       If you are using GNomix or FLARE, replace these with the correct arguments.  
# NOTE: If you need additional (optional) arguments, add them in Step 5 as well.  

## RFMix2 (Mandatory Arguments)
rfmix2_ref_vcf="/path/to/test_data/references/TGP_HGDP_QC_hg19_chr${chr}.vcf.gz"
rfmix2_sample_map="/path/to/test_data/references/YRI_GBR_samplemap.txt"
rfmix2_genetic_map="/path/to/TractorWorkflow/resources/genetic_maps/shapeit5_genetic_map_b37_LAIformat1.txt"

## RFMix2 (Optional Arguments)
# rfmix2_reanalyze_ref=""
# rfmix2_em_iterations=""
# rfmix2_crf_spacing=""
# rfmix2_rf_window_size=""
# rfmix2_node_size=""
# rfmix2_trees=""

## GNomix (Mandatory Arguments)
# gnomix_dir="/path/software/gnomix"
# gnomix_config="${gnomix_dir}/config.yaml"
# gnomix_ref_vcf="/path/to/test_data/references/TGP_HGDP_QC_hg19_chr${chr}.vcf.gz"
# gnomix_sample_map="/path/to/test_data/references/YRI_GBR_samplemap.txt"
# gnomix_genetic_map="/path/to/TractorWorkflow/resources/genetic_maps/shapeit5_genetic_map_b37_LAIformat1.txt"
# gnomix_phase="false"

## FLARE (Mandatory Arguments)
# flare_dir="/path/softwares/flare"
# flare_ref_vcf="/path/to/test_data/references/TGP_HGDP_QC_hg19_chr${chr}.vcf.gz"
# flare_sample_map="/path/to/test_data/references/YRI_GBR_samplemap.txt"
# flare_genetic_map="/path/beagle_genetic_map_files/plink.chr${chr}.GRCh37.map"

# FLARE (Optional Arguments)
# flare_array="false"            # FLARE’s default (v0.5.3) is false (https://github.com/browning-lab/flare)
# flare_min_maf=""
# flare_min_mac=""
# flare_probs="false"            # FLARE’s default (v0.5.3) is false (https://github.com/browning-lab/flare)
# flare_gen=""
# flare_model=""
# flare_em="true"                # FLARE’s default (v0.5.3) is true (https://github.com/browning-lab/flare)
# flare_gt_samples_include=""    # Either flare_gt_samples_include or flare_gt_samples_exclude can be used.
# flare_gt_samples_exclude=""    # Either flare_gt_samples_include or flare_gt_samples_exclude can be used.
# # flare_gt_ancestries=""
# flare_excludedmarkers=""
# flare_seed=""

## Extract Tracts (Mandatory Arguments)
extracts_num_ancs=2

## Extract Tracts (Optional Arguments)
# extracts_output_vcf=""
# extracts_compress=""

## Tractor (Mandatory Arguments)
tractor_phenotype="/path/to/test_data/phenotype/Phe_linear_covars_mod1.txt"
tractor_covarcollist="age,sex"
tractor_regression_method="linear"

## Tractor (one of the following arguments is Mandatory, unless "y" is used as phenocol bc that's default)
# tractor_phenocol=""
tractor_phenolist_file="/path/to/test_data/phenotype/Phe_linear_covars_mod1_phenolist.txt"

## Tractor (Optional Arguments)
# tractor_sampleidcol=""
# tractor_chunksize=""
# tractor_totallines=""


# Step 5: Launch the Nextflow workflow
nextflow run ${workflow_dir}/main.nf \
-c ${workflow_dir}/nextflow.config \
-profile slurm \
-ansi-log false \
-resume \
-work-dir ${workdir} \
--outdir ${outdir} \
--output_prefix ${output_prefix} \
--mode ${workflow_mode} \
--lai_tool ${lai_tool} \
--input_vcf ${shapeit5_input_vcf} \
--chunkfile ${shapeit5_chunkfile} \
--genetic_map ${shapeit5_genetic_map} \
--rfmix2_ref_vcf ${rfmix2_ref_vcf} \
--rfmix2_sample_map ${rfmix2_sample_map} \
--rfmix2_genetic_map ${rfmix2_genetic_map} \
--num_ancs ${extracts_num_ancs} \
--phenotype ${tractor_phenotype} \
--phenolist_file ${tractor_phenolist_file} \
--covarcollist ${tractor_covarcollist} \
--regression_method ${tractor_regression_method}
# ATTN: Optional arguments can be added here. Only mandatory arguments have been included above.

Explanation of SLURM Directives

Note that the SLURM directives are used to launch the Tractor workflow, which in turn will schedule SLURM jobs. This will have to be configured by the user. Thus, this launch job itself doesn’t have large CPU memory requirements.

Flag (`#SBATCH`)	Description
`--time=5-00:00:00`	Maximum runtime for the job (here: 5 days). Adjust based on pipeline needs.
`--mem=4G`	Memory allocated per task; 2 GB may be sufficient.
`--cpus-per-task=1`	Allocates 1 CPU per task.
`--partition=<ATTN: add-partition-name>`	Specifies the partition or queue for job submission.
`--job-name=launch_nxf_tractor`	Custom name for the job.
`--output=launch_nxf_tractor_%a_%j.out`	File to store job logs, `%a`: array index (chr no.) and `%j`: job ID.
`--array=1-22`	Runs the job as an array i.e. 22 jobs are launched, with one task per chromosome (1–22).

Explanation of Nextflow directives

Flag	Description
`-c nextflow.config`	Path to the config file. Users must edit config files to match system requirements.
`-profile slurm`	Use the `slurm` profile from the config file.
`-ansi-log false`	Turns off colored/ANSI logging (cleaner logs).
`-resume`	Reuses results if same parameters are used instead of starting over. Useful when a job fails in middle of the workflow.
`-work-dir ${workdir}`	Directory for temporary and intermediate files.

Key Parameters in the Nextflow Command

The example above only includes mandatory parameters.

Optional parameters can be added directly to the nextflow run command (in Step 6).

For example:

For RFMix2: --reanalyze_ref "true" --em_iterations 1

Tractor GWAS for multiple phenotypes in the phenotype file: --phenolist_file list_of_phenotypes.txt

list_of_phenotypes.txt – each phenotype in new line.

The choice of optional parameters depends on your analysis goals. We recommend reviewing the documentation for the original tools (SHAPEIT5, RFMix2, Tractor) to select the most appropriate options for your workflow.

`config` file for Tractor NXF Workflow

We provide an example configuration file for running Tractor with the SLURM job scheduler, which is one of the common schedulers on HPC systems.

If you are using a different scheduler (e.g. PBS, SGE, LSF), please refer to the Nextflow executor documentation for guidance on adapting this configuration to your environment. More information on additional options for config file is available in config documentation.

Reminder: Nextflow scripts use Groovy syntax, not Bash.
Nextflow uses // for comments instead of #.

Below is an example Tractor Nextflow config file with descriptions for each section for SLURM and local modes:

// ============================================================
//               NEXTFLOW CONFIGURATION TEMPLATE               
// ============================================================
//
// This file tells Nextflow where to find Java, how to run your jobs,
// and how much CPU, memory, and time to give each step of the workflow.
//
// IMPORTANT: You MUST edit this file to fit your system
// before running the workflow. Everyone’s setup is a little different.
//
// - If you are running on your **laptop**, you can set everything
//   up locally (ignore the SLURM profile).
// - If you are running on an **HPC cluster with SLURM**, you need
//   to update the SLURM section (see below).
//
// If you have a different executor on HPC than SLURM, you can create
// and use a new profile. Reference: https://www.nextflow.io/docs/latest/executor.html


// ---- Java Path ----
// Nextflow needs Java to run.
// If you’re not sure about the path, run: `which java`
// Example output: /path/to/java/jdk-21.0.2/bin/java
NXF_JAVA_HOME = '/path/to/java/jdk-21.0.2/'

// ---- Reports ----
// Nextflow will automatically create these reports in each run directory.
// They are helpful for debugging and seeing how the workflow ran.
dag.enabled = true
report.enabled = true
timeline.enabled = true

trace {
    enabled = true
    // you can add "script" to log the code that was executed. Can be helpful to confirm that the correct code was run for each of the steps.
    fields = 'task_id,hash,native_id,process,tag,name,status,exit,module,container,cpus,time,disk,memory,attempt,submit,start,complete,duration,realtime,queue,%cpu,%mem,rss,vmem,peak_rss,peak_vmem,rchar,wchar,syscr,syscw,read_bytes,write_bytes,vol_ctxt,inv_ctxt,env,workdir,scratch,error_action'
}

// ===========================================================
// PROFILES
// ===========================================================
//
// A "profile" is a set of rules for where and how jobs run for each step.
// 
// You can make your own profile if needed. For example:
//   - local (laptop/desktop)
//   - slurm (HPC cluster)
//
// To use a profile, run:
//   nextflow run main.nf -profile slurm
//
// The example below is for SLURM. If you don’t use SLURM,
// you can either delete this or copy it to make your own "local" profile.
// ===========================================================

// ATTN: User must edit all sections below, and uncomment certain code as required.

profiles {

    // profile: slurm
    slurm {
        // Use Conda to manage dependencies (recommended).
        // Set this to your own Conda path.
        conda {
            enabled = true
            cacheDir = '/path/anaconda3/envs'
        }
        process {
            // Tell Nextflow to use SLURM as the job scheduler.
            executor = 'SLURM'

            // ===========================================================
            // EXAMPLES FOR DIFFERENT TOOLS / STEPS
            //
            // IMPORTANT:
            // - You must set cpus, memory, time, and queue to match 
            //   what your cluster allows.
            //
            // - "withLabel" means the settings apply to that step.
            //    - Phasing: shapeit5, shapeit5_ligate
            //    - LAI: rfmix2, gnomix or flare
            //    - Pre-Tractor: extract_tracts (for rfmix2,gnomix); extract_tracts_flare (for flare)
            //    - Tractor: run_tractor
            //
            // - Different tools may require different software dependencies.
            //   These can be managed with Conda environments. You can specify
            //   the path to the environment for each step if needed.
            //
            // - You may also define an error handling strategy (see the 
            //   run_tractor example below).
            // ===========================================================

            // Module 1: Phasing
            withLabel: shapeit5 {
                // conda = ''                          // Path to conda environment to use. Uncomment for use.
                cpus           = 4                     // e.g. 4 CPUs
                memory         = '16.GB'               // e.g. 16 GB RAM
                queue          = 'partition-name-here' // Replace with your SLURM partition names
                time           = '12h'                 // e.g. 12 hours. For 2 days, you can use '2d'
                // Can provide any additional cluster arguments using clusterOptions
                // Check out: https://www.nextflow.io/docs/latest/executor.html#slurm
                // Example:
                // clusterOptions = '--exclude=<partition>'
            }

            // Module 1: Phasing
            withLabel: shapeit5_ligate {
                // conda = ''
                cpus           = 1
                memory         = '10.GB'
                queue          = 'partition-name-here'
                time           = '1h'
                // clusterOptions = ''
            }

            // Module 2: LAI w/ RFMix2
            withLabel: rfmix2 {
                // conda = ''
                cpus           = 4
                memory         = '16.GB'
                queue          = 'partition-name-here'
                time           = '1d'
                // clusterOptions = ''
            }

            // Module 2: LAI w/ GNomix
            withLabel: gnomix {
                // conda = '/home/username/anaconda3/envs/py3_gnomix'
                cpus           = 8
                memory         = '32.GB'
                queue          = 'partition-name-here'
                time           = '1d'
                // clusterOptions = ''
            }

            // Module 2: LAI w/ FLARE
            withLabel: flare {
                // conda = ''
                cpus           = 4
                memory         = '16.GB'
                queue          = 'partition-name-here'
                time           = '12h'
                // clusterOptions = ''
            }

            // Module 3: Pre-Tractor
            withLabel: extract_tracts {
                // conda = '/home/username/anaconda3/envs/py3_tractor'
                cpus           = 1
                memory         = '10.GB'
                queue          = 'partition-name-here'
                time           = ''
                // clusterOptions = ''
            }

            // Module 3: Pre-Tractor (for FLARE)
            withLabel: extract_tracts_flare {
                // conda = '/home/username/anaconda3/envs/py3_tractor'
                cpus           = 1
                memory         = '10.GB'
                queue          = 'partition-name-here'
                time           = ''
                // clusterOptions = ''
            }

            // Module 3: Tractor
            withLabel: run_tractor {
                // conda = '/home/username/anaconda3/envs/py3_tractor'
                cpus           = { 4     * task.attempt }      // retry with more CPUs, based on errorStrategy
                memory         = { 20.GB * task.attempt }      // retry with more RAM, based on errorStrategy
                queue          = 'partition-name-here'
                time           = ''
                // clusterOptions = ''
                // errorStrategy  = 'retry'
                errorStrategy  = {task.attempt <= 2 ? 'retry' : 'finish'}
                maxRetries     = 3
            }
        }
    }

    // profile: local
    local {
        conda {
            enabled  = true
            cacheDir = '/path/to/your/anaconda3/envs'   // ATTN: update this
        }
        process {
            // Runs locally
            executor = 'local'

            // ATTN: Update resources for all steps (based on your local resources)
            //       Reference: https://www.nextflow.io/docs/latest/executor.html#local

            // Module 1: Phasing
            withLabel: shapeit5 {
                // conda = ''                          // Path to conda environmnet, update to use
                cpus           = 4                     // e.g. 4 CPUs
                memory         = '16.GB'               // e.g. 16 GB RAM
                time           = '12h'                 // e.g. 12 hours. Use '2d' for 2 days.
            }
            // Module 1: Phasing
            withLabel: shapeit5_ligate {
                // conda = ''
                cpus           = 1
                memory         = '10.GB'
            }
            // Module 2: LAI w/ RFMix2
            withLabel: rfmix2 {
                // conda = ''
                cpus           = 4
                memory         = '16.GB'
            }
            // Module 2: LAI w/ GNomix
            withLabel: gnomix {
                // conda = '/home/username/anaconda3/envs/py3_gnomix'
                cpus           = 4
                memory         = '16.GB'
            }
            // Module 2: LAI w/ FLARE
            withLabel: flare {
                // conda = ''
                cpus           = 4
                memory         = '16.GB'
            }
            // Module 3: Pre-Tractor
            withLabel: extract_tracts {
                // conda = '/home/username/anaconda3/envs/py3_tractor'
                cpus           = 1
                memory         = '10.GB'
            }
            // Module 3: Pre-Tractor (for FLARE)
            withLabel: extract_tracts_flare {
                // conda = '/home/username/anaconda3/envs/py3_tractor'
                cpus           = 1
                memory         = '10.GB'
            }
            // Module 3: Tractor
            withLabel: run_tractor {
                // conda = '/home/username/anaconda3/envs/py3_tractor'
                cpus           = { 4     * task.attempt }      // retry with more CPUs, based on errorStrategy
                memory         = { 10.GB * task.attempt }      // retry with more RAM, based on errorStrategy
                errorStrategy  = {task.attempt <= 1 ? 'retry' : 'finish'}
                maxRetries     = 2
            }
        }
    }

}

Helpful Nextflow Documentation

Execution Environments

Using conda

Using docker/containers

SLURM executor

Other supported executors

Process Directives

Overview of Directives

cpus, memory, time, queue (partition in SLURM)

Advanced

Dynamic computing resources

Output/Results Overview

Once the Tractor NXF Workflow completes running, the output/results directory will be organized as follows. Most of these files are symbolic links to files within the work directory, except for the final summary statistics in the 5_run_tractor directory, which are the definitive output files that the user would need.

The directory structure provided below assumes the workflow has iterated through chr 1 to 22, with two ancestries in the dataset.

Directory Structure

1_chunks_phased
2_chunks_ligated
3_lai
4_extract_tracts
5_run_tractor

In all directory listings, [1-22] indicates chromosome numbers 1 to 22.

*.log files capture the standard output and error messages from each run. We strongly recommended reviewing these logs to ensure that no warnings or errors occurred.

1_chunks_phased

SHAPEIT5 supports phasing chromosomes in smaller genomic segments (“chunks”), which can later be ligated into full chromosomes.
The --chunkfile parameter specifies the coordinates for chunking, it’s designed such that successive chunks of data has overlapping genetic regions.
Phasing can be performed on an entire chromosome at once or split into multiple chunks. For large datasets, chunk-based phasing is strongly recommended for efficiency.
This directory contains the chunk-level phased outputs (.bcf), its index and log files.

    1_chunks_phased
    ├── output_prefix_[1-22].chunk_0.shapeit5_common.bcf[.csi]
    ├── output_prefix_[1-22].chunk_0.shapeit5_common.log
    ├── output_prefix_[1-22].chunk_1.shapeit5_common.bcf[.csi]
    ├── output_prefix_[1-22].chunk_1.shapeit5_common.log
    ├── ...
    ├── output_prefix_[1-22].chunk_n.shapeit5_common.bcf[.csi]
    └── output_prefix_[1-22].chunk_n.shapeit5_common.log

2_chunks_ligated

Once data is phased in chunks, these chunks are ligated (merged) to reconstruct complete chromosomes.
This directory contains the ligated phased data for each chromosome in vcf.gz format, its index and log files.
The chunking scheme (--chunkfile) includes overlapping regions between adjacent chunks. The log files report the switch rate, which measures the consistency of phasing across these overlaps. Values above 80–90% are typically expected and indicate good phasing quality.
- Please see SHAPEIT5 ligate documentation

    2_chunks_ligated
    ├── list_ligate.[1-22].txt
    ├── output_prefix_[1-22].shapeit5_common_ligate.vcf.gz[.csi]
    └── output_prefix_[1-22].shapeit5_common_ligate.log

3_lai

The contents of the directory depend on the LAI tool used.
RFMix2 and GNomix: The .msp files contain the ancestry estimates. This along with the phased VCFs from the previous step will be used for the next step, extract_tracts. In case, --gnomix_phase true is used, which corrects the VCF, we’ll use this VCF file along with the ancestry estimates for the next step.
FLARE: Ancestry estimates are annotated directly within the VCFs alongside genotype and saved as *.anc.vcf.gz. Since genotype and ancestry estimates are within the same file, it is sufficient for the extract_tracts_flare step.

Output for RFMix2:

    3_lai
    ├── output_prefix_[1-22].fb.tsv
    ├── output_prefix_[1-22].msp.tsv
    ├── output_prefix_[1-22].rfmix.Q
    ├── output_prefix_[1-22].sis.tsv
    └── output_prefix_[1-22].rfmix2.log

Output for GNomix:

    3_lai
    ├── output_prefix_[1-22].msp
    ├── output_prefix_[1-22].fb
    ├── output_prefix_[1-22]_config.yaml        # Model parameters used
    ├── output_prefix_[1-22]_generated_data
    ├── output_prefix_[1-22]_models
    ├── output_prefix_[1-22]_phased.vcf         # Generated only if gnomix_phase="true"
    └── output_prefix_[1-22].log

Output for FLARE:

    3_lai
    ├── output_prefix_[1-22].anc.vcf.gz
    ├── output_prefix_[1-22].flare.log
    ├── output_prefix_[1-22].log
    └── output_prefix_[1-22].model

4_extract_tracts

This step extracts ancestry-specific tracts using the ancestry calls and genotypic data.
Two files per ancestry are generated: .dosage.txt for dosage information and .hapcount.txt for haplotype counts.
These files can be reused as starting points for analyzing additional phenotypes in the same dataset with Tractor GWAS, avoiding the need to repeat earlier steps like phasing and LAI.
The *.vcf files will only be generated if --output_vcf "true" is used.
Following output file structure assumes 2-way admixture, but additional files be generated if more than 2-way admixed.

    4_extract_tracts
    ├── output_prefix_[1-22].shapeit5_common_ligate.anc0.dosage.txt
    ├── output_prefix_[1-22].shapeit5_common_ligate.anc0.hapcount.txt
    ├── output_prefix_[1-22].shapeit5_common_ligate.anc0.vcf
    ├── output_prefix_[1-22].shapeit5_common_ligate.anc1.dosage.txt
    ├── output_prefix_[1-22].shapeit5_common_ligate.anc1.hapcount.txt
    ├── output_prefix_[1-22].shapeit5_common_ligate.anc1.vcf
    └── output_prefix_[1-22].extract_tracts.log

5_run_tractor

Tractor GWAS summary statistics are generated for the phenotypes specified in the phenolist_file, covering all chromosomes.
samples_excluded_from_phenotype.txt is generated only if samples present in the phenotype not present in the hapcount/dosage files.

    5_run_tractor
    ├── output_prefix_pheno1_[1-22]_sumstats.txt
    ├── output_prefix_pheno1_[1-22].run_tractor.log
    └── samples_excluded_from_phenotype.txt