VCF decoration¶
This workflow is used in IGSR for annotating/validating the VCF files produced in our project. More specifically the pipeline will take the phased VCF generated after running the PyHive::PipeConfig::Shapeit.pm pipeline and will add the following annotations:
- Allele frequency for a particular variant
- Alelle frequency in the different populations analyzed
- Total number of alternate alleles in called genotypes
- Total number of alleles in called genotypes
- Number of samples with data
- Information on whether a variant is within the exon pull down target boundaries
- Approximate read depth
Additionally, the workflow will also check if the VCF is in a valid format
Dependencies¶
- Nextflow
This pipeline uses a workflow management system named Nextflow. This software can be downloaded from:
Tabix and bgzip
Tabix and bgzip are part of the HTSlib project, which can be downloaded from:
- BCFTools
Downloadable from:
- BEDTools
Downloadable from:
- vcf-validator
This tool can be obtained from:
- IGSR-analysis code base
The scripts needed to run this workflow can be downloaded by cloning the IGSR-analysis github repo from:
How to run the pipeline¶
First, you need to create a
nexflow.config
file that can be used by Nextflow to set the required variables. Here goes an example of one of these files:params.sample_panel='/homes/ernesto/lib/igsr_analysis_master/igsr_analysis/SUPPORTING/integrated_allsamples.20180619.superpopulations.panel' params.pops='EAS,EUR,AFR,AMR,SAS' // comma separated list of populations or superpopulations that will be used for the annotation params.exome='/nfs/production/reseq-info/work/ernesto/isgr/VARIANT_CALLING/VARCALL_ALLGENOME_13022017/COMBINING/ANNOTATION/output_1000G_Exome.v1.ensembl.bed' // path to .BED file with coordinates of the exomes params.tabix='/nfs/production/reseq-info/work/ernesto/bin/anaconda3/bin/tabix' // path to tabix binary params.igsr_root='/nfs/production/reseq-info/work/ernesto/isgr/SCRATCH/17_09_2018/lib/igsr_analysis/' // folder containing the igsr codebase downloaded from https://github.com/igsr/igsr_analysis.git params.vcf_validator='/nfs/production/reseq-info/work/ernesto/bin/vcf_validator/vcf_validator_linux' // path to vcf_validator binary params.bcftools_folder='~/bin/bcftools-1.6/' // folder containing the bcftools binary
Then, you can start your pipeline by doing:
nextflow -c nextflow.config run $IGSR_CODEBASE/scripts/VCF/ANNOTATION/decorate.nf --phased_vcf chr20.unannotated.phased.vcf.gz --ann_vcf chr20.ann.unphased.vcf.gz --region 20:1-64444167
- Where:
-c
option allows you to specify the path to thenextflow.config
file$IGSR_CODEBASE
is the folder containing the igsr codebase downloaded fromhttps://github.com/igsr/igsr_analysis.git
--phased_vcf
is the phased VCF generated after running the PyHive::PipeConfig::INTERGRATION::Shapeit.pm pipeline that will be decorated in this workflow. You will need to create a tabix index for this VCF--ann_vcf
is the unphased VCF generated by the PyHive::PipeConfig::INTEGRATION::VCFIntegrationGATKUG.pm pipeline which will contain the ‘INFO/DP’ (depth) annotation for each particular site. You will need to create a tabix index for this VCF--region
is the region that will be analyzed
Pipeline output¶
This worklow will create a folder name results/
with 2 output files:
chr20.GRCh38.phased.vcf.gz
That will be the final annotated VCF
chr20.vcf.validation.txt
Will contain the output of thevcf-validator