BCFtools WGS variant filtering pipeline¶
In order to run this workflow we need to do the following:
- Preparing the environment
Modify your
$PYTHONPATHto include the required libraries:export PYTHONPATH=${ehive_dir}/wrappers/python3/:$PYTHONPATHModify your
$PERL5LIBto include the required libraries:export PERL5LIB=${ehive_dir}/modules/:${igsr_analysis_dir}/:${PERL5LIB}Modify your
$PATHto include the location of the eHive scripts:export PATH=${ehive_dir}/scripts/:${PATH}
Install dependency
- Clone repo by doing
git clone https://github.com/igsr/igsr_analysis.gitin the desired folderpip install ${igsr_analysis_dir}/dist/igsr_analysis-0.91.dev0.tar.gz- Modify
$PYTHONPATHto add the folder where your pip installs the Python packagesAnd you are ready to go!
Conventions used in this section
${igsr_analysis_dir}is the folder where you have clonedhttps://github.com/igsr/igsr_analysis.git${ehive_dir}is the folder where you have clonedhttps://github.com/Ensembl/ensembl-hive.git
- Databases
The pipeline uses two databases. They may be on different servers or the same server.
2.1 The ReseqTrack database
The pipeline queries a
ReseqTrackdatabase to find the VCF that will be filtered by the pipeline. It will also add file metadata for the final filtered VCF.In order to create a
ReseqTrackdatabase use the followingcommands:
mysql -h <hostname> -P <portnumber> -u <username> -p???? -e "create database testreseqtrack" # where testreseqtrack # is the name you want # to give to the ReseqTrack DB mysql -h <hostname> -P <portnumber> -u <username> -p???? testreseqtrack < $RESEQTRACK/sql/table.sql mysql -h <hostname> -P <portnumber> -u <username> -p???? testreseqtrack < $RESEQTRACK/sql/views.sql
- Conventions used in this section:
$RESEQTRACKis the folder where you have clonedhttps://github.com/EMBL-EBI-GCA/reseqtrack.git2.2 The Hive database
This is database is used by the Hive code to manage the pipeline and job submission etc. The pipeline will be created automatically when you run theinit_pipeline.plscript. Write access is needed to this database.
- Initialise the pipeline
The pipeline is initialised with the hive script
init_pipeline.pl. Here is an example of how to initialise a pipeline:init_pipeline.pl PyHive::PipeConfig::FILTER::VCFilterSamtoolsWGS \ -pipeline_url mysql://g1krw:$DB_PASS@mysql-rs-1kg-prod:4175/hive_dbname \ -db testreseqtrack \ -pwd $DB_PASS \ -hive_force_init 1The first argument is the the module that defines this pipeline. Then
-pipeline_urlcontrols the Hive database connection details, in this example:g1krw= username $DB_PASS= password mysql-rs-1kg-prod= hostname 4175= Port number hive_dbname= Hive DB nameThen
-dbis the name of the Reseqtrack database name used in the section 2.1-pwdis the ReseqTrack DB passwordThe rest of the options are documented in the PyHive::PipeConfig::FILTER::VCFilterSamtoolsWGS module file. You will probably want to override the defaults for many of these options so take a look.
- Seeding the pipeline
In order to seed the pipeline with the VCF file that will be analyzed use the hive script
seed_pipeline.pl:seed_pipeline.pl \ -url mysql://g1krw:$DB_PASS@mysql-rs-1kg-prod:4175/hive_dbname \ -logic_name find_files \ -input_id "{ 'file' => '/path/to/file/input_file.txt' }"Where
-urlcontrols the Hive database connection details and/path/to/file/input_file.txtcontains the filename of the VCF to be analyzed. This file must exist in the ReseqTrack database
- Sync the hive database
This should always be done before [re]starting a pipeline:
Run e.g.:
beekeeper.pl -url mysql://g1krw:{password}@mysql-g1k:4175/my_hive_db_name -syncwhere
-urlare the details of your hive database. Look at the output frominit_pipeline.plto see what your url is.
- Run the pipeline
Run e.g.:
beekeeper.pl -url mysql://g1krw:{password}@mysql-g1k:4175/my_hive_db_name -loop &Note the ‘&’ makes it run in the background.
Look at the pod for
beekeeper.plto see the various options. E.g. you might want to use the-hive_log_dirflag so that alloutput/errorgets recorded in files.While the pipeline is running, you can check the ‘progress’ view of the hive database to see the current status. If a job has failed, check the msg view.