ML-based workflow to filter a VCF

Filtering the spurious variants from a callset is a common task in variation studies using sequencing data.

Variant discovery methods are not perfect and will produce a certain number of false positive calls, specially if the sequencing data is either noisy or the depth of coverage is not enough to distinguish a real variant from a sequencing artifact.

This is why a method for identifying these false variants is necessary. Different methods have been developed for filtering and at the time of writing I would say that the most used is GATK VQSR which works really well and is specially relevant for filtering the calls obtained with the GATK callers (UnifiedGenotyper [UG] and HaplotypeCaller [HC]).

VQSR relies on a sophisticated model that needs to be trained with the annotation profiles generated by UG or HC for the different variant sites and will also depend on the existence of reference datasets specially formatted in order to be used with VQSR.

The problem arises when you need to filter a callset obtained from a non-GATK caller and do not have the variant annotations required by VQSR or you are analysing variation data from a non-human organism for which there is not a VQSR-formatted reference call set.

If you find yourself in this situation you might find this pipeline useful.

Foundation of the filtering

This pipeline implements a supervised Machine Learning (ML) model in order to solve a binary classification problem. It is supervised because it trains the model with a gold-standard call set for which we already know what variant sites are real
and it is a binary classification problem where we have multiple numerical independent variables (annotation values for each of the variant sites) to predict or classify a binary outcome (is a certain site a real variant?). This particular type

of problem can be modelled using a Logistic regression binary classifier and more specifically our pipeline uses the implementation from the Scikit-learn Python library

This pipeline needs to be run in different stages

  1. Recursive Feature Elimination (RFE) stage (optional). This pipeline uses the Scikit-learn RFE implementation and it works by recursively removing features (annotations), building a logistic regression model using the remaining attributes and calculating the model accuracy. RFE is able to work out the combination of n attributes that contribute most to the prediction
  2. Training the ML model for the SNPs and INDELs independently
  3. Applying the fitted model generated in step 2 trained model on the VCF that you want to filter
  • Recursive Feature Elimination

This step will


This page is under construcion