The PRODORIC project was initiated in the course of foundation of the Bioinformatics Competence Center Braunschweig "Intergenomics" in 2001. The project is funded by the German Foundation for Research (Deutsche Forschungsgemeinschaft, DFG), and PRODORIC2 is maintained at the Technical University of Braunschweig in the Institute of Microbiology.
PRODORIC is an acronym for PROkaryotIC Database Of Gene Regulation (the term "doric" refers to greek columns). This database is an integrated approach to provide information about molecular networks in prokaryotes with focus on pathogenic organisms. In the long term it aims to be a resource to model and visualize protein-host interactions. Furthermore, it will be a suitable platform to analyze high-throughput data from proteomics and transcriptomics experiments (systems biology).
Currently, PRODORIC2 contains information about operon and promoter structures including a large collection of transcription factor binding sites (TFBSs). If an appropriate number of regulatory binding sites is available, a position weight matrix (PWM) and a sequence logo are provided, which can be used to predict new binding sites by the tool Virtual Footprint. The TFBS data is collected manually by screening the original scientific literature.
Annotation and Content
All entries of PRODORIC2 are generated by manually screening the literature with exception of core data such as genomic annotations available on NCBI. PRODORIC2 now also includes TFBSs coming from high-throughput experiments such as RNA-seq, ChIP-seq, or also microarrays. Where possible, such sites are confirmed also by bioinformatic approaches and / or additional direct experiments such as EMSA. PRODORIC2 stores locally the core data for the vast majority of sequenced and closed bacterial genomes. The last strain update took place in 2015.
PRODORIC2 focuses strongly on the curation of TFBSs and associated data. TFBSs are directly linked to their respective genomes, promoter and regulated genes. They are organized in multiple alignment profiles and used to construct PWMs. Currently, more than 3900 TFBSs directly extracted from the scientific literature are stored in PRODORIC2. Based on these data a large library of over 300 profiles and PWMs is provided. In many cases species-specific PWMs of the respective regulators are available that allow a more specific prediction of potentially new TFBSs.
Since PRODORIC2 is a genome-based database, it stores locally whole bacterial genome sequences in GenBank format including the annotated ORFs. For this purpose the respective data available at the NCBI server and the genome reviews present at the EBI were downloaded and the corresponding accession numbers were imported into PRODORIC2. These fundamental data are referred to as core data as they describe the major transcription units. All data were organized into a SQLite3 relational structure to allow for modeling of various biological features and molecular interactions as accurate as possible. These included special microbial features like operon and typical prokaryotic promoter structures.
In this figure the central part of the database is shown schematically as UML model.
Virtual Footprint is a bioinformatic search tool for DNA pattern recognition. It was especially designed to analyze transcription factor binding sites in whole bacterial genomes and the underlying regulatory networks. The definition of a pattern is realized by a position weight matrix. A large library of over 300 bacterial PWMs is provided. Furthermore, the program offers the possibility to filter the results according to their genomic context. Matches in coding regions can be excluded, the size of the upstream region (distance to the start codon) can be defined and hits without genomic context (such as palindromic hits on the opposite strand) can be selected. The result is a list of potential binding sites and corresponding genes defining the whole regulon.
Core Sensitivity/Core Score
By summing up the individual weights of a position weight matrix to an overall score, less conserved positions can equiponderate well conserved positions, which can lead to an over-evaluation of matches. Consequently, this results in an accumulation of false-positive predictions. To avoid this, we implemented a core pattern which constitutes the most conserved positions in a PWM. The pre-set value in PRODORIC2 is 5 positions. The core sensitivity is pre-set to 0.9 – this value means that the threshold score is chosen so that 90% of the binding sites used for the PWM are recovered. See also Sensitivity/Score
Genomes and Replicons (preselected)
Various sequenced genomes and replicons present in the PRODORIC2 database can be selected for the analyses. The genomes in PRODORIC2 are updated every 3-6 months.
Hide Hits without Genomic Context
Palindromic matches are usually found on both strands. This option removes the hit that is oriented antisense to the downstream gene.
Maximal Distance to Gene
The maximal allowed distance of a match to a downstream gene (transcriptional start) used in the regulon analysis.
Position Weight Matrix (PWM)
PWMs offer a sensitive way to represent the similarity to a degenerated DNA pattern e.g. transcription factor binding sites. They are built on the basis of a set of aligned known sequences:
f(b,l) is the frequency of each base b at position l in the aligned binding sites (Schneider et al., 1986). We consider the nucleotide bias by using a linear correction of noise (Schreiber & Brown, 2002). Using this background model can result in differences concerning the number matches if a sequence is uploaded or directly chosen in the system. This is due to the GC-content of a genome that influences the scoring of a match. In uploaded sequences for promoter analysis, the GC-content is not considered as those sequences are usually too short and a GC content of 50% is estimated. This different scoring can result in different matches, especially for lower scoring matches. The position weight matrix m(b,l) is afterwards generated by:
This is equivalent to the individual letter size of a sequence logo (Schneider & Stephens, 1990). For the case f(b,l)=0 we additionally introduced a penalty function dependent of the sample size n instead of using pseudo-scores:
A similarity score is calculated by applying the PWM to a sequence. This is simply done by summing up the corresponding individual weights m(b,l) to an overall score (see numbers on the right of the alignment).
This value is used to adjust the accuracy of a position weight matrix search by calculation of an appropriate threshold score. Sensitivity (Sn) is defined as the rate of true-positives (TP) at a given threshold score:
Example: a value of 0.8 means that the threshold score is chosen that 80% of the binding sites used for the PWM are recovered. This option is pre-set to 0.8 in the current Virtual Footprint version.
Show only Intergenic Hits
This option restrict genome wide matches to non-coding (intergenic) regions.
Web Interface and Technology
Access to PRODORIC2 and its associated tools is free for non-commercial use by individuals at academic institutions.
We cannot guarantee the accuracy of any data or databases nor the suitability of databases, software and services for any purpose.
The authors of PRODORIC2 are not responsible for the content of links to other websites.
Please cite PRODORIC2 and results obtained by the use of the database as:
- Eckweiler, D., Dudek, C.-A., Hartlich, J., Brötje, D., & Jahn, D. (2017): PRODORIC2: the bacterial gene regulation database in 2018. Nucleic Acids Res. https://doi.org/10.1093/nar/gkx1091. [PubMed]