VCF decoration¶

This workflow is used in IGSR for annotating/validating the VCF files produced in our project. More specifically the pipeline will take the phased VCF generated after running the PyHive::PipeConfig::Shapeit.pm pipeline and will add the following annotations:

Allele frequency for a particular variant
Alelle frequency in the different populations analyzed
Total number of alternate alleles in called genotypes
Total number of alleles in called genotypes
Number of samples with data
Information on whether a variant is within the exon pull down target boundaries
Approximate read depth

Additionally, the workflow will also check if the VCF is in a valid format

Dependencies¶

Nextflow

This pipeline uses a workflow management system named Nextflow. This software can be downloaded from:

https://www.nextflow.io/

Tabix and bgzip

Tabix and bgzip are part of the HTSlib project, which can be downloaded from:

https://github.com/samtools/htslib

BCFTools

Downloadable from:

http://www.htslib.org/download/

BEDTools

Downloadable from:

https://github.com/arq5x/bedtools2/releases/

vcf-validator

This tool can be obtained from:

https://github.com/EBIvariation/vcf-validator

IGSR-analysis code base

The scripts needed to run this workflow can be downloaded by cloning the IGSR-analysis github repo from:

https://github.com/igsr/igsr_analysis.git

How to run the pipeline¶

First, you need to create a nexflow.config file that can be used by Nextflow to set the required variables. Here goes an example of one of these files:

params.sample_panel='/homes/ernesto/lib/igsr_analysis_master/igsr_analysis/SUPPORTING/integrated_allsamples.20180619.superpopulations.panel'
params.pops='EAS,EUR,AFR,AMR,SAS' // comma separated list of populations or superpopulations that will be used for the annotation
params.exome='/nfs/production/reseq-info/work/ernesto/isgr/VARIANT_CALLING/VARCALL_ALLGENOME_13022017/COMBINING/ANNOTATION/output_1000G_Exome.v1.ensembl.bed' // path to .BED file with coordinates of the exomes
params.tabix='/nfs/production/reseq-info/work/ernesto/bin/anaconda3/bin/tabix' // path to tabix binary
params.igsr_root='/nfs/production/reseq-info/work/ernesto/isgr/SCRATCH/17_09_2018/lib/igsr_analysis/' // folder containing the igsr codebase downloaded from https://github.com/igsr/igsr_analysis.git
params.vcf_validator='/nfs/production/reseq-info/work/ernesto/bin/vcf_validator/vcf_validator_linux' // path to vcf_validator binary
params.bcftools_folder='~/bin/bcftools-1.6/' // folder containing the bcftools binary

Then, you can start your pipeline by doing:

nextflow -c nextflow.config run $IGSR_CODEBASE/scripts/VCF/ANNOTATION/decorate.nf --phased_vcf chr20.unannotated.phased.vcf.gz --ann_vcf chr20.ann.unphased.vcf.gz --region 20:1-64444167

Where:

-c option allows you to specify the path to the nextflow.config file

$IGSR_CODEBASE is the folder containing the igsr codebase downloaded from https://github.com/igsr/igsr_analysis.git

--phased_vcf is the phased VCF generated after running the PyHive::PipeConfig::INTERGRATION::Shapeit.pm pipeline that will be decorated in this workflow. You will need to create a tabix index for this VCF

--ann_vcf is the unphased VCF generated by the PyHive::PipeConfig::INTEGRATION::VCFIntegrationGATKUG.pm pipeline which will contain the ‘INFO/DP’ (depth) annotation for each particular site. You will need to create a tabix index for this VCF

--region is the region that will be analyzed

Pipeline output¶

This worklow will create a folder name results/ with 2 output files:

chr20.GRCh38.phased.vcf.gz

That will be the final annotated VCF

chr20.vcf.validation.txt

Will contain the output of the vcf-validator