BayesPI-BAR2: a new Python package for predicting functional non-coding mutations in cancer patient cohorts

This work was supported by the Norwegian Cancer Society Oslo University Hospital This work was supported by the South-Eastern Norway Regional Health Authority

Authors

Kirill Batmanov1, Jan Delabie2, and Junbai Wang1

1Department of Pathology, Norwegian Radium Hospital, PO Box 4953 Nydalen, 0424 Oslo, Norway

2Department of Pathology, University Health Network, Toronto, Ontario, Canada

Introduction

BayesPI-BAR2 [4] is a package designed to predict how non-coding somatic mutations in cancer samples affect protein-DNA binding at the mutated place. Changes in binding of transcription factors to mutated regulatory sequences can lead to disrupted gene regulation, which may promote tumorigenesis. BayesPI-BAR2 takes into account the possibility for several nearby mutations to affect binding of the same protein. The predicted effects are tested for significance in the given patient cohort, and only those that appear in patient samples more frequently than expected by chance are reported.

Getting started

BayesPI-BAR2 is written in Python 2. It includes our BayesPI2 [1][2] software in binary form, which is available for Linux and OS X operating systems. Here is the full list of dependencies:

You can use the pip install scipy matplotlib command to install the Python libraries. bedtools and samtools are included in many Linux repositories.

The BayesPI-BAR2 package can be downloaded here.

To test the basic functionality, go to the demo/melanoma_small folder and run the command python melanoma_small_pipeline.py . After downloading the reference human genome, the test pipeline should complete without errors in a few minutes and produce the result file, data/skin_cancer_small/out/foreground/block_0_5_1295228_1295253/result.tsv with several ETS factors mentioned in it.

Package contents

The package has four subfolders:

The main package is a set of command line tools residing in the python folder. Run python <tool_name.py> --help command to see the full usage information for a particular tool. The detailed description of every tool is here.

Skin cancer demonstration pipeline

The package includes an example analysis pipeline which reproduces the known result about mutations in the TERT gene promoter that create binding sites for ETS family transcription factors. The pipeline calls the main package tools in appropriate sequence, reporting the progress of the computation.

To run the pipeline, go to the demo/melanoma_full folder and run the following commands:

  1. python get_and_preprocess_data.py to download the input and reference data and preprocess it into the right format.
  2. python bayespi_bar2_pipeline.py to execute the main pipeline code. This will take about one full day of computation on a multi-core machine. The computation speed can be greatly improved if you run the pipeline on a cluster which supports the SLURM queue manager. Edit the parallel_options.txt file in the same folder to specify the desired parallelization configuration. Check the help of bayespi_bar.py from the main package to learn about the parallelization options.
  3. python make_plots.py to make the heatmaps for the significantly affected transcription factors in the foreground blocks.

The main pipeline script, bayespi_bar2_pipeline.py, is designed to be robust to interruptions. If the pipeline execution was interrupted at any point, simply run the script again, and it will resume calculation from the place it was interrupted. You can see the progress of the computation as well as the main pipeline parameters in the log file, whose location is printed on the screen when the pipeline starts.

The get_and_preprocess_data.py script will download about 2 Gb of data necessary for the pipeline. Here is the full list of additional files that will be downloaded:

Customizing the pipeline

The bayespi_bar2_pipeline.py script is the starting point for users wishing to use BayesPI-BAR2 to process their own datasets. The instructions for customizing the default pipeline can be found here.

References

  1. BayesPI - a new model to study protein-DNA interactions: a case study of condition-specific protein binding parameters for Yeast transcription factors. Wang J, Morigen. BMC Bioinformatics. 2009 Oct 20;10:345. doi: 10.1186/1471-2105-10-345. [PMID:19857274]
  2. Quality versus accuracy: result of a reanalysis of protein-binding microarrays from the DREAM5 challenge by using BayesPI2 including dinucleotide interdependence. Wang J. BMC Bioinformatics. 2014 Aug 27;15:289. doi: 10.1186/1471-2105-15-289. [PMID:25158938]
  3. BayesPI-BAR: a new biophysical model for characterization of regulatory sequence variations. Wang J, Batmanov K. Nucleic Acids Res. 2015 Dec 2;43(21):e147. doi: 10.1093/nar/gkv733 [PMID:26202972]
  4. Integrative whole-genome sequence analysis reveals roles of regulatory mutations in BCL6 and BCL2 in follicular lymphoma. Batmanov K, Wang W, Bjoras M, Delabie J, Wang J. Sci Rep. 2017 Aug 1;7(1):7040. doi: 10.1038/s41598-017-07226-4. [PMID:28765546]
  5. BayesPI-BAR2: a Python package for predicting functional non-coding mutations in cancer patient cohorts. Kirill Batmanov, Jan Delabie, and Junbai Wang (Frontiers in Genetics) Abstract