VCF decoration

This workflow is used in IGSR for annotating/validating the VCF files produced in our project. More specifically the pipeline will take the phased VCF generated after running the PyHive::PipeConfig::Shapeit.pm pipeline and will add the following annotations:

  • Allele frequency for a particular variant
  • Alelle frequency in the different populations analyzed
  • Total number of alternate alleles in called genotypes
  • Total number of alleles in called genotypes
  • Number of samples with data
  • Information on whether a variant is within the exon pull down target boundaries
  • Approximate read depth

Additionally, the workflow will also check if the VCF is in a valid format

Dependencies

  • Nextflow

This pipeline uses a workflow management system named Nextflow. This software can be downloaded from:

https://www.nextflow.io/

  • Tabix and bgzip

    Tabix and bgzip are part of the HTSlib project, which can be downloaded from:

  • BCFTools
  • BEDTools
  • vcf-validator

This tool can be obtained from:

https://github.com/EBIvariation/vcf-validator

  • IGSR-analysis code base

The scripts needed to run this workflow can be downloaded by cloning the IGSR-analysis github repo from:

https://github.com/igsr/igsr_analysis.git

How to run the pipeline

  • First, you need to create a nexflow.config file that can be used by Nextflow to set the required variables. Here goes an example of one of these files:

    params.sample_panel='/homes/ernesto/lib/igsr_analysis_master/igsr_analysis/SUPPORTING/integrated_allsamples.20180619.superpopulations.panel'
    params.pops='EAS,EUR,AFR,AMR,SAS' // comma separated list of populations or superpopulations that will be used for the annotation
    params.exome='/nfs/production/reseq-info/work/ernesto/isgr/VARIANT_CALLING/VARCALL_ALLGENOME_13022017/COMBINING/ANNOTATION/output_1000G_Exome.v1.ensembl.bed' // path to .BED file with coordinates of the exomes
    params.tabix='/nfs/production/reseq-info/work/ernesto/bin/anaconda3/bin/tabix' // path to tabix binary
    params.igsr_root='/nfs/production/reseq-info/work/ernesto/isgr/SCRATCH/17_09_2018/lib/igsr_analysis/' // folder containing the igsr codebase downloaded from https://github.com/igsr/igsr_analysis.git
    params.vcf_validator='/nfs/production/reseq-info/work/ernesto/bin/vcf_validator/vcf_validator_linux' // path to vcf_validator binary
    params.bcftools_folder='~/bin/bcftools-1.6/' // folder containing the bcftools binary
    
  • Then, you can start your pipeline by doing:

    nextflow -c nextflow.config run $IGSR_CODEBASE/scripts/VCF/ANNOTATION/decorate.nf --phased_vcf chr20.unannotated.phased.vcf.gz --ann_vcf chr20.ann.unphased.vcf.gz --region 20:1-64444167
    
Where:
  • -c option allows you to specify the path to the nextflow.config file
  • $IGSR_CODEBASE is the folder containing the igsr codebase downloaded from https://github.com/igsr/igsr_analysis.git
  • --phased_vcf is the phased VCF generated after running the PyHive::PipeConfig::INTERGRATION::Shapeit.pm pipeline that will be decorated in this workflow. You will need to create a tabix index for this VCF
  • --ann_vcf is the unphased VCF generated by the PyHive::PipeConfig::INTEGRATION::VCFIntegrationGATKUG.pm pipeline which will contain the ‘INFO/DP’ (depth) annotation for each particular site. You will need to create a tabix index for this VCF
  • --region is the region that will be analyzed

Pipeline output

This worklow will create a folder name results/ with 2 output files:

  • chr20.GRCh38.phased.vcf.gz
That will be the final annotated VCF
  • chr20.vcf.validation.txt
Will contain the output of the vcf-validator