Distinct TATA box core promoter tí

Distinct INR core promoter

Composite core promoter d)

Null Core promoter

Four different setups of a polymerase II core promoter. Graphical symbols as in Figure 4.1.

The only apparent common denominator of transcription initiation within a promoter would be that there must be at least one (or more) core promoter element within a certain region. This assumption is wrong. Both spacing and/or sequential order of elements within the core promoter module are of utmost importance regardless of the presence or absence of individual elements (as a rule, there appear to be some exceptions). Moreover, many distinct promoters have requirements for specific upstream or downstream elements and will function with their specific TF. Moving around the initiator, the TATA box and, to some extent, also upstream elements can have profound effects on promoter functions. For example, insertion of just a few nucleotides between the TATA box and an upstream TF binding site (MyoD) in the desmin gene promoter cuts the expression levels by more than half [18]. Moreover, the promoter structure can affect later stages of gene expression like splicing [19]. It was also shown for the rat beta-actin promoter that a few mutations around the transcription start site (i.e. within the initiator) can render that gene subject to translational control [20].

As a final note, the mere concept of one general TATA box and one general INR is an oversimplification. There are several clearly distinguishable TATA boxes in different promoter classes [21, 22, 23, 24] and the same is true for the INR region which also has several functionally distinct implementations as the glucocorticoid-responsive INR in the murine thymi-dine kinase gene [25], the C/EBP binding INR in the hepatic growth factor gene promoter [26], or the YY-1 binding INR [27].

Most of the principles of variability and restrictions detailed above for the core promoter modules are also true for other promoter modules that modify transcriptional efficiency rather that determining the start point of transcription as the core promoter does. The bottom line is that the vast majority of alternative arrangements of the elements that can be seen in a particular promoter might not contribute to the function of the promoter. Module-induced restrictions are not necessarily obvious from the primary sequences. Figure 4.3 shows a schematic pol II promoter with the initiation complex assembled which illustrates that it matters where a specific protein is bound to the DNA in order to allow for proper assembly of the molecular jigsaw puzzle of the initiation complex. This is not immediately obvious from inspection of promoter sequences because there exist several (but a strictly limited set of) alternative solutions to the assembly problem. As complicated as Figure 4.3 may appear, it still ignores all aspects of chroma-tin rearrangements and nucleosomal positions which also play an important role in transcription regulation. Stein et al. [28] has detailed an example of the profound influence of these effects on promoter-protein complex assembly and function for the ostecalcin promoter in a study. However, chromatin-related effects are not yet considered in any of the promoter prediction methods. Therefore, we do not go into any more details here.


distal promoter/ enhancer

TF binding sites

TF binding sites proximal promoter

TF binding sites

iTFIIAl proximal promoter iTFIIAl

Transcription initation complex bound to a schematic promoter

Bioinformatics models for the analysis and detection of regulatory regions

Algorithms used to analyze and detect regulatory regions are necessarily based on some kind of usually simplified model of what a regulatory region should look like. All of these models are inevitably compromising between accuracy with respect to the biological model (the standard of truth) and computational feasibility of the model. For example, a computational model based on a priori three-dimensional structure prediction derived from molecular dynamics using sophisticated force fields may be the most accurate model for a region but cannot be used for the analysis of real data due to excessive demand on computational resources. On the other hand a model based on simple sequence similarities detected by IUPAC consensus (see Chapter 3 for detailed definition) sequences can be easily used on a PC but results will usually not match the biological truth in an acceptable manner. Statistical models

It has been noted several years ago that promoters and most likely also other regulatory regions like enhancers contain more transcription factor binding sites that non-regulatory sequences. An analysis of the relative frequencies of such sites within a sliding window therefore can yield some information about the potential regulatory character of a stretch of DNA, which is the prototype of simple statistical models. Several programs exist that rely to some extent on this type of statistics. Mixed models

Of course, it is quite clear from section 4.1 that a pure statistical model is an oversimplification that will adversely affect the accuracy of prediction despite its attractive ease of implementation. Therefore, mixed models are also used that take at least some regional information into consideration and can be seen as statistical models split into compartments. Within the compartments solely statistical features are considered, but promoter organization is somewhat reflected by the arrangement of the compartments which represent different promoter regions. Organizational models

The last category consists of models that try to closely follow the organizational principles of real regulatory regions. In order to accomplish that, individual promoter elements like transcription factor binding sites as well as their relative order and distances are encoded in a formal model which reflects the setup of a single promoter or a small group of functionally similar promoters. Although they are matching the biological situation best they are not yet suitable for widespread application. High quality sets of training sequences are required to generate such models because many parameters concerning nature, order and distance of elements have to be determined. This is a process far from being an easy task and limits application of such models to relatively few examples that require elaborate analysis.

122 | 4 Analyzing Regulatory Regions in Genomes 4.3

Methods for element detection

Detection of transcription factor binding sites

Transcription factor binding sites are the most important elements within regulatory DNA regions like promoters or enhancers. The majority of the known transcription factors recognize short DNA stretches of about 10-15 nucleotides in length that show different degrees of internal variation. Successful detection of protein binding sites in DNA sequences always relies on precompiled descriptions of individual binding sites. Such descriptions are usually derived from a training set of four or more authentic binding sites. However, the criteria applied for the decision whether a site is authentic or not vary considerably among authors of different publications. One of the first approaches to define protein binding sites used IUPAC consensus sequences, which indicate the predominant nucleotide or nucleotide combination at each position in a set of example sequences. The IUPAC string TGASTCA indicates that the first three positions are most frequently T, G, and A while the fourth position may be C or G followed by T, C, and A in most cases. IUPAC consensus sequences became very popular as they are extremely easy to define from even a small set of sequences and their definition does not require more than a pencil and a sheet of paper.

However, IUPAC consensus sequences strongly depend on the sequence set used for definition. The final IUPAC consensus sequence remains arbitrary depending on the rules used to determine the consensus. Cavener defined some rules that we have used at GSF for several years now and, in our experience, IUPAC consensus sequences defined that way can be useful [29]. However, IUPAC consensus sequences may reject biologically functional binding sites due to a single mismatch (or an ill-defined IUPAC sequence).

The concept of nucleotide weight matrix (NWM) descriptions has been developed in the 1980s as an alternative to IUPAC strings [30, 31]. However, although weight matrices proved to be generally superior to IUPAC strings their biggest disadvantage is the absolute requirement for predefined matrices, which are more complicated to construct than IUPAC strings and require specific software. This delayed widespread use of weight matrices for almost a decade although the methods were principally available. They remained mostly unused because only a few special matrices had been defined [32]. The situation changed when in 1995 two (overlapping) matrix libraries for TF sites were compiled and became widely available almost simultaneously [33, 34]. Matrix Search [34] transformed the TRANSFAC database as complete as possible (starting at two binding sites for one factor) into matrices using a log-odds scoring approach. The MatInspec-tor library is largely based on a stringent selection from the matrix table of the TRANSFAC database, including the matrices derived from the Con-slnspector library [35, 36, 37]. and several genuine matrices. The Information Matrix Database was compiled from the TRANSFAC matrix table and the TFD. The MatInspector library is updated with each new TRANSFAC release while IMD has not been updated recently. Available matrix detection programs have been reviewed recently [36] and a comparison of these methods by application to a test set of sequences was published [37]. For convenience Table 4.1 summarizes which methods for the detection of transcription factor binding sites are available in the internet with emphasis on programs featuring a WWW-interface.

Detection of structural elements

Regulatory sequences are associated with a couple of other individual elements or sequence properties in addition to the factor binding sites. Among these are secondary structure elements like the HIV-1 TAR region (Trans-ActivatingRegion, which constitutes an RNA enhancer [38], cruciform DNA structures [39], or simple direct repeats [40]. Three-dimensional structures like curved DNA [41] also influence promoter function. Most of these elements can be detected by computer-assisted sequence analysis [42, 43, 44, 45] but none of them is really promoter specific and all such elements can be found frequently outside of promoters. The promoter or enhancer function arises from the combination of several elements that need to cooperate to exert transcription control which none of them can achieve alone. This also illustrates the main problem of promoter recognition. It is necessary to compile several individually weak signals into a composite signal, which then indicates a potential promoter without being overwhelmed by the combinatorial complexity of potential element combinations.

Assessment of other elements

Several methods employ statistical measures of sequence composition to include features of regulatory sequences, which cannot be described by the three types discussed above. This includes frequencies of oligonulceotides (dinucleotides, trinucleotides, and hexamers are used most frequently), CpG islands (CG dinucleotides are usually underrepresented in mammalian genomes except in part of coding and regulatory sequences. CpG islands are regions where the dinucleotide is NOT underrepresented [46], and periodicity of weak sequence patterns (AA, TTetc). Definitions of such elements are usually too weak to allow any significant contribution to current prediction programs. However, this situation might well change due to

Internet accessible methods to detect promoter elements (transcription factor binding sites)





//www.gsf.de/ cgi-bin/ matsearch.pl


//genomatix.gsf.de/ cgi-bin/ matinspector/ matinspector.pl



//bimas.dcrt.nih.gov/ molbio/ signal/




TF Search








Matlnspector matrix library (includes TRANSFAC matrices)

IUPAC consensus library

IMD matrix library (TRANSFAC + TFD)

TRANSFAC matrices

TRANSFAC matrices the unprecedented amounts of continuous genomic sequences that become available in the course of the current genome sequencing projects.

Analysis of regulatory regions

Basically two different tasks can be distinguished in the analysis of regulatory regions. The first task is analysis aiming at the definition of common features based on sets of known regulatory sequences. This is a prerequisite for the definition of descriptions suitable for large-scale application for prediction of potential regulatory regions within new anonymous sequences which can be regarded as the second task.

Training set selection

One of the most important steps in comparative sequence analysis is the selection of suitable training sets of sequences. If a training set of promoters consists only of constitutively expressed sequences (constant level of expression, no or little regulation) little can be learned about any kind of tissue-specific expression regardless of the methods applied. Also inclusion of too many wrong sequences (e.g., that are no promoters at all) may prevent any meaningful analysis. Although this appears a bit trivial at first, it is a real issue when data are scarce and less well-characterized sequences have to used.

Control sets known not to be functionally similar to the training sets are about as important as the training sets themselves. However, true negative regions are even scarcer than known regulatory regions. Negative often means just "no positive functions found" which can also be due to failures or simply means that the sequences have not been tested at all. Therefore, statistical negative control sequences are often required. Random sequences can be easily generated but often are of limited use, as they do not represent several important features of natural DNA correctly. This includes under-represented features (e.g., CpG islands), asymmetric features (e.g., strand specificity), or repetitive DNA elements. Selection of appropriate control sequences can be a major effort, but is also crucial for the validity of the evaluation of any method. Common problems with controls are either known or unknown biases in the control set or circularity problems, i.e. the training and the test sets of sequences are related or overlap. The availability of large continuous stretches of genomic DNA from the genome sequencing projects constantly improves this situation. Genomic sequences should always be the first choice for controls as they reflect the natural situation.

Statistical and biological significance

The quality of sequence pattern recognition is often optimized to improve the correlation of the methods with the data (positive and negative training sets). However, in most cases it is not possible to collect sufficient data to perform a rigorous correlation analysis. Therefore, bioinformatics methods often rely on statistical analysis of their training sequences and optimize for statistically most significant features. Unfortunately, this kind of optimization does not always reflect the evolutionary optimization of regulatory sequences that is always optimizing several features at once.

The dynamics of biological function often necessitates suboptimal solutions. For example, real sequences usually do not contain binding sites with the highest affinity for their cognate protein because binding AND dissociation of the protein are required for proper function. The perfect binding site would interfere with the dissociation and is therefore strongly selected against.

Context dependency

The biological significance of any sequence element is defined by the regulatory function it can elicit. This is usually dependent on a functional context rather than being a property of individual elements. Therefore, statistical significance of the features or scores of individual elements is neither necessary nor sufficient to indicate biological significance. Recognition of the functional context in an essentially linear molecule like DNA can be achieved by correlation analysis of individual elements, which became an important part of all semi-statistical or specific modeling approaches discussed below.

Methods for detection of regulatory regions

There are several methods available for the prediction of regulatory DNA regions in new sequence data. Table 4.2 lists methods available with a special focus on programs that provide a WWW-interface. Unfortunately, there is no "one-does-it-all" method and all methods have their individual strong and weak points. A program doing an excellent job in one case might be a complete failure in another case where other methods are successful. Therefore, we will describe a whole lot of methods without intending any rank by order of discussion. We will rather follow the functional hierarchy that appears to apply to the different regulatory regions.

Available promoter/promoter regions prediction tools


Promoter prediction


Comments WWW-accessible




Promoter Scan TSSG/TSSW

Core Promoter Finder

Audic/ Claverie PromFind


http://transfac.gbf.de/dbsearch/funsitep/fsp.html http: // www-hgc.lbl. gov/proj ects/ promoter.html http://beagle.colorado.edu/~chenq/Hypertexts/PromFD.html [email protected]

http://biosci.umn.edu/ software/proscan/promoterscan.htm http:// dot.imgen.bcm.tmc.edu:9331/gene-finder/ gf.html http:// sciclio.cshl.org/ genefinder/ CPROMOTER/human.htm


[email protected]

ftp:// iubio.bio.indiana.edu/ molbio/ibmpc


ftp:// arthur.epm.ornl.gov/ pub/xgrail includes proximal promoter core promoter includes proximal promoter for further infomation includes proximal promoter includes proximal promoter triplet & hexamer frequencies discrimination analysis

Markov models hexamer frequencies for further information core promoter in gene context

Promoter module/region recognition


FastM library TargetFinder


Muscle-specific regions Xlandscape http http http


I I genomatix.gsf.de / cgi-bin/fastm2 / fastm.pl



http://www.gsf.de/biodv/genomeinspector.html ftp:// ariane.gsf.de/pub / unix/ genomeinspector/ [email protected]

ftp://beagle.colorado.edu/pub/Landscape/xland.vl. tar.Z

module of 2 TF sites uses Matlnspector module of 1 TF site combined with 1 annotated feature correlation analysis, e.g 2 TF binding sites contact for download word frequencies, not promoter specific

128 | 4 Analyzing Regulatory Regions in Genomes 4.5.1

Types of regulatory regions Matrix attachment regions

A chromatin loop is the region of chromosomal DNA located between two contact points of the DNA with the nuclear matrix marked by so-called Scaffold/Matrix Attachment Regions (S/MARs). The nuclear matrix is a mesh of proteins lining the inner surface of the nuclear envelope. Tran-scriptional regulation requires the association of DNA with this nuclear matrix, which retains a variety of regulatory proteins. S/MARs are composed of several elements including transcription factor binding sites, AT-rich stretches, potential cruciform DNA, and DNA-unwinding regions to name a few of the most important S/MAR elements. There is an excellent review on chromatin domains and prediction of MAR sequences by Bouli-kas [47], explaining S/MARs and their elements in detail. Kramer et al. [48] published a method to detect potential S/MAR elements in sequences and made the method available via WWW (http://www.ncgr.org/MAR-search/) [49]. Their method is based on a statistical compilation of the occurrence of a variety of S/MAR features (called rules). Accumulation of sufficient matches to these rules will be predicted as potential S/MAR regions. The specificity of the method depends critically on the sequence context of the potential S/MAR sequences. Therefore, results are difficult to evaluate by comparisons. However, so far this is the only method available to predict S/MAR regions. We developed another approach to define especially AT-rich MARs called SMARTest which is available on the web at http://genomatix. gsf.de-. SMARTest is based on a library of MAR-associated nucleotide weight matrices and determines S/MARs independent of any larger sequence context. Therefore, the method is suitable to test isolated S/MAR fragments. MARFinder and SMARTest are complementary and should be seen in combination rather than as alternatives. Enhancers/silencers

Enhancers are regulatory regions that can significantly boost the level of transcription from a responsive promoter regardless of their orientation and distance with respect to the promoter as long as they are located within the same chromatin loop. Silencers are basically identical to enhancers and follow the same requirements but exert a negative effect on promoter activities. At present there are no specific programs to detect enhancers and silencers. However, programs designed to detect the internal organization of promoters are probably also suitable to detect enhancers and silencers since these elements often also show a similar internal organization as promoters. Promoters

Promoters are DNA regions capable of specific initiation of transcription (start of RNA synthesis) and consist of three basic regions (See Figure 4.2). The part determining the exact nucleotide for transcriptional initiation is called the core promoter and is the stretch of DNA sequence where the RNA polymerase and its cofactors assemble on the promoter.

The region immediately upstream of the core promoter is called the proximal promoter and usually contains a number of transcription factor binding sites responsible for the assembly of an activation complex. This complex in turn recruits the polymerase complex. It is generally accepted that most proximal promoter elements are located within a stretch of about 250-500 nucleotides upstream of the actual transcription start site (TSS).

The third part of the promoter is located even further upstream and is called the distal promoter. This region usually regulates the activity of the core and the proximal promoter and also contains transcription factor binding sites. However, distal promoter regions and enhancers exhibit no principal differences. If a distal promoter region acts position and orientation independent it is called an enhancer.

Programs for recognition of regulatory sequences

There are several ways promoter recognition tools can be categorized. We will focus on the main principles and intended usage of the programs rather than technical details, as this will also be the main interest of experimentally working scientists. Two generally distinct approaches have been used so far in order to achieve in silico promoter recognition [50]. The majority of programs focused on general promoter recognition, which represents the first category. One group of programs in this category (see below) concentrates on recognition of core promoter properties and infers promoter location solely on that basis while the other group consists of programs that take into account also the proximal promoter region of about 250-300 nucleotides upstream of the TSS. General recognition models are usually based on training sets derived from the Eukaryotic Promoter Database (EPD [51]) and various sets of sequences without known promoter activities. The EPD is an excellent collection of DNA sequences that fulfil two conditions: They have been shown experimentally to function as promoters and the transcription start site is known. The beauty of these approaches is their generality which does not require any specific knowledge about a particular promoter in order to make a prediction. This appears ideal for the analysis of anonymous sequences for which no a priori knowledge is available. The bad news is that the specificity of all such general approaches implemented so far is limited. The inevitable huge burden of false-positive predictions in longer sequences precludes large scale application of such approaches to genomic sequences.

The second category of tools aims at specific promoter recognition relying on more detailed features of promoter subsets like combinations of individual elements. The beauty of this approach is excellent specificity, which is extremely helpful if only promoters of a certain class are of interest or megabases of sequences have to be analyzed. The bad news here is limited applicability, i.e., each promoter group or class requires a specifically predefined model before sequences can be analyzed for these promoters. This may result in a huge number of false negatives in large-scale analysis.

We will briefly discuss individual methods in these two categories with emphasis on the implementation of the biological principles of promoter features. Recently, a practical comparison of the majority of available tools based on general promoter models has been carried out, which has shown that none of these methods is clearly superior to its peers [50]. Therefore, I will not go into details about performance of the methods here (see Chapter 3 for details). Programs based on statistical models (general promoter prediction)

These programs aim at the detection of pol II promoters by a precompiled general promoter model that is part of the method. Despite the complicated modular structure of promoters outlined above there is a solid rational basis for this general model. All promoters must have a functional core promoter module often containing a TATA box which is the prime target of the majority of the general promoter prediction tools. This is also one of the reasons why some programs confine their analysis to the core promoter region which avoids problems with the much more diverse proximal regions. General models that include the proximal region consequently treat this part of the promoter as a purely statistical problem of TF binding site accumulations, sometimes fine tuned by some sort of weighting based on occurrence frequencies of TF binding sites in promoters as compared to a negative sequence set.

However, the cost of generality without exception is a huge number of false positive predictions (typically about one match in 10 000-30 000 nucleotides). Sacrifice of a considerable percentage of true promoters (30% or more) is also a necessity to maintain at least some specificity. There also is inevitably no clue what kind of promoter was detected in case of a match. PromoterScan

Several of the general promoter prediction programs followed the basic design of Prestridge who used the Eukaryotic Promoter Database (EPD) by Bucher [51] to train his software for promoter recognition. His program PromoterScan was the first published method to tackle this problem [52].

He utilized primate non-promoter sequences from GenBank as a negative training set and included the proximal promoter region in the prediction. The program uses individual profiles for the TF binding sites indicative of their relative frequency in promoters to accumulate scores for DNA sequences analyzed. PromoterScan employs the SignalScan IUPAC library of TF binding sites [53], introducing a good deal of biological knowledge into the method, although modular organization of the proximal region is necessarily ignored. Results of the first version were combined with the Bucher NWM for the TATA box, which served as a representation of the core promoter module [32].

There is now a new version of PromoterScan available, PromoterScan 2.0. This new version is supposed to provide more information from inside the "black box" which a promoter used to be for version 1.0 and is also able to compare a predicted promoter to EPD promoter sequences on the basis of the pattern of TF sites. Although this moves promoter comparisons with PromoterScan 2.0 effectively closer to the specific recognition of individual elements this approach is not used for the initial promoter prediction. PromoterScan is available via WWW, which is a definitive advantage for occasional promoter testing. PromFD

This program [54], extends the model behind PromoterScan in two ways. One major difference is use of the IMD matrix library [34] instead of the SIGNAL SCAN IUPAC strings. The other feature is the inclusion of patterns of strings of 5-10 nucleotides in length that were found to be overrepresented in the training set of promoters. Basically, the intrinsic model is the same as in PromoterScan but the overrepresented strings may account for some so far unknown binding sites that are missing in the libraries. PromFD requires local installation and is not available via a WWW-interface. There have been no updates so far. TSSC/TSSW

Two other methods (TSSG and TSSW) also including proximal promoter regions are available via WWW and share the basic algorithm. They center on detection of a TATA box as most prominent part of the core promoter. Promoter prediction is then based on the score of the TATA box, and nu-cleotide triplet distributions around the putative TSS. In analogy to PromFind (see below) hexamer frequencies in three 100 nucleotide wide windows upstream of the putative TSS are also considered. These data are combined with potential TF binding sites, which are predicted either based on a TFD derived compilation by Prestridge (in TSSG), or on the TRANSFAC database [11] in TSSW. Fickett and Hatzigeorgiou did not report significant improvements in the predictive capabilities by inclusion of triplets and hexamers [50]. However, the general aim of these programs is not independent promoter prediction but to assist in finding exon-intron structures of complete genes. XLandscape

The group of Stormo developed another method, which is called Xlandscape and essentially determines nucleotide strings of various lengths called words which are specifically associated with promoters, exons or introns [55]. Then a score for a sequence is determined indicating by the different promoter, exon, and intron scores whether a particular region is likely to belong to one of these groups. Although this method was not developed for promoter prediction it fares about as well as all other general promoter prediction programs although it completely ignores even the core promoter module. The program requires local installation. PromFind

This program [56] is similar to Xlandscape. It relies on the difference in hexamer frequencies between promoters and regions outside of promoters which is a more restricted view as compared to the landscaping approach. The advantage is less computational complexity. The program was also trained on sequences from the EPD and corresponding coding and non-coding regions outside the promoters taken from GenBank. Any region in which the ratio of promoter to non-promoter hexamer-frequencies reaches a threshold is considered and only the region where this measure is maximal is defined as the promoter. This again relies on pure statistics disregarding any organizational features of promoters. I am not aware of any updates since the initial publication.

This program [57] utilizes time delay neural networks to locate a TATA box combined with an initiator region and thus is a representative of the second subgroup which focuses on the minimum promoter region. Although the program does allow for variable spacing between the elements, especially distinct and null promoters will pose principal problems for this method as it includes some modeling of a TATA box. An improved version of the program was published recently and the program features a lower number of false negatives (about 1 match/kb). This is good news if a short region of DNA is to be analyzed (length should be less than 2 kb). However, long DNA regions pose a problem, as the number of false positives becomes overwhelming. The program is available via a WWW-interface (http://www-hgc.lbl.gov/projects/promoter.html). CorePromoter Search

Michael Zhang [58] published a new method to detect TATA-box containing core promoters by discrimination analysis. This method is available via a WWW-interface, which already requires restriction of the sequence length to 1 kb. Core Promoter Search and NNPP are alternative implementations of a similar general promoter model and can be applied in parallel. Programs utilizing mixed models

These programs also rely on statistical promoter models but include directly or indirectly some organizational features of promoters placing them in between the pure statistical models and attempts to approximate the biologically important structured organization of promoters. FunSiteP

This program [59] also takes into account proximal promoter regions and utilizes a collection of TF binding sites [60] with which a promoter set taken from EMBL (472 promoters) was analyzed. From this analysis a weight matrix of TF binding site localization was derived representing regions in promoters with lower or higher concentrations of TF binding sites. FunSiteP not only reports potential promoter matches but also assigns them to one of seven promoter classes. These were taken from Bucher's definition (from EPD) and consist of small nuclear RNAs, structural proteins, storage and transport proteins, enzymes, hormones, growth factors, and regulatory proteins, stress or defense related proteins, and unclassified proteins. Although these classes are very broad as defined by biological function, they represent an attempt towards more specific promoter recognition. FunSiteP is also available via a WWW-interface. Audic/Claverie

A program designed by Audic and Claverie [61] uses Markov models of vertebrate promoters generated again by training on the EPD and non-promoter sequences outside of the promoters. Markov models principally allow for the inclusion of organizational features and the nature of the training set determines whether this becomes part of the model or not. We describe the program in this section because the many different promoters in EPD most likely cause these Markov models to be more general than specific. Programs relying on organizational models

This category of methods introduces the functional context in form of heuristic rules or tries to learn the context from comparative sequence analysis.

These methods emphasize specific modeling of promoters or promoter substructures rather than general recognition. Therefore, it is not possible to directly assess the promoter prediction capabilities of these methods. However, in many cases recognizing a common substructure between promoters can be very helpful especially for experimental design. FastM

This method was derived from the program ModelGenerator [22] and takes advantage of the existence of NWM libraries. It can be accessed via a WWW-interface (http://www.genomatix.de "free services") and allows for a straightforward definition of any modules of two TF binding sites by simple selection from the Matlnspector Library [33]. This now enables definition and detection of wide variety of synergistic TF binding site pairs. These pairs often are functional promoter modules conferring a specific transcriptional function to a promoter [10, 62]. FastM models of two binding sites can successfully identify promoters sharing such composite elements but are not promoter specific. Composite elements can also be located in enhancers or similar structures. A commercial version of FastM is available that enables definition of complete, highly specific promoter class models including up to 10 individual elements. TargetFinder

Another approach aiming at modeling of promoter substructures consisting of two distinct elements is TargetFinder [63]. This method combines TF binding sites with features extracted from the annotation of a database sequence to allow selective identification of sequences containing both features within a defined length. The advantage is that TargetFinder basically also follows the module-based philosophy but allows inclusion of features that have been annotated by experimental work for which no search algorithm exists. Naturally, this excludes analysis of new anonymous sequences. The program is accessible via a WWW-interface (http://gcg.tigem.it/ TargetFinder.html).

It should be mentioned here that Fickett also employed the idea of a two TF binding site module to successfully detect a subclass of muscle-specific regulatory sequences governed by a combination of MEF2 and MyoD [64]. However, this was also a very specific approach and no general tool resulted out of that work. The MEF2/MyoD model can be used to define a corresponding module with FastM. Wasserman and Fickett [65] recently published a modeling approach based on clustering of a preselected set of NMW (defined in this study) correlated with muscle-specific gene expression. They were able to detect about 25% of the muscle-specific regulatory regions in sequences outside their training set and more than 60% in their training set. They classify their method as regulatory module detection. However, their results suggest that they probably detect a collection of different more specific modules with respect to the definition given above. Although the method is not promoter-specific and the specificity is moderate, it is a very interesting approach, which has potential for further development. The authors will make the non-commercial software used in their approach available on request (contact J. Fickett).

Generally, this group of methods achieves much higher specificity than programs following general models. However, the price for this increase in specificity is usually restriction of the promoter models to a small subset (class) of promoters. The organizational model of histone HI promoters

The specific modeling of a promoter class can be demonstrated on a well-known example from the cell-cycle regulation. Histone genes are required during the DNA replication and they show up as a group of coregulated genes in array analyses [66]. This suggests that they might have a common promoter structure. Histone H1 gene promoters for example can be found in the Eukaryotic Promoter Database (EPD, [51]).

The model for Histone H1 genes (Figure 4.4) was based on only 9 training sequences taken from homology group 17 of the EPD. They share a 100 bp fragment around the transcription start site but not much similarity elsewhere in the 600 nucleotides (standard length of EPD promoter sequences). A model containing 5 different transcription factor binding sites was derived from the set of sequences shown in Figure 4.4 using GEMS Launcher (Genomatix Software, Munich).

Notably, the model generated contains one factor, E2F that is known to be involved in cell cycle regulation of genes [67]. The model appears to be very selective and only one match per 12 million base pairs of the mammalian sections of the EMBL database was found. Almost all matches (total of 59) are known Histone H1 genes, except in the human section where the majority of matches (25) are within anonymous sequences. However, given the extraordinary specificity throughout the mammalian sections, it seems safe to assume that most of the unknown matches within the human database section actually identified new so far unannotated genes, that are subject to

Organizational model of histone H1 promoters containing six transcription factor binding sites.

Organizational model of histone H1 promoters containing six transcription factor binding sites.

a similar regulation as histones (There is no direct evidence that these are histone genes). However, this is a working hypothesis which needs to be verified.

Annotation of large genomic sequences

Almost all of the methods discussed above were developed before the databases started to be filled with sequence contigs exceeding 100 000 nucleo-tides in length. The complete human genome draft now contains more than 3 billion nucleotides. This changes the paradigm for sequence annotation. While complete annotation remains an important goal, specific annotation becomes mandatory when even individual sequences exceed the capabilities of researchers for manual inspection. Annotation of genomic sequences has to be fully automatic in order to keep pace with the rate of generation of new sequences. Simultaneously, annotations are embedded into a large natural context rather than residing within relatively short isolated stretches of DNA. This has several quite important consequences.

The balance between sensitivity and specificity

I will confine the discussion here to regulatory regions but the problems are general. A very sensitive approach will minimize the amount of false negative predictions and thus is oriented towards a complete annotation. However, this inevitably requires accepting large numbers of false positive hits, which easily outnumber the true positive predictions by one order of magnitude.

In order to avoid this methods can be designed to yield the utmost specificity (e.g., specific promoter modeling as discussed above). Here, the catch is that inevitably a high number of false-negative results, which also may obscure 70% to 90% of the true positive regions.

A little thought experiment demonstrates the dilemma of current methods for annotation of sequences. Assume we are analyzing a region of 3 billion bp of contiguous human genomic DNA (the total genome). The human genome is estimated to contain about 60 000 + 30000 promoters. On average general promoter prediction programs detect one promoter per 1000 to 10 000 nucleotides according to the respective authors. Assuming an optimistic value (1/10 000) this would result in about 300000 predictions. Given the true match rate determined by Fickett and Hatzigeorgiou [50] of less than 20% on average the result would be 20 000 true predictions versus 280 000 or more than 90% false positives.

On the other hand specific methods were shown to produce more that 50% true positive matches in their total output [21] but recognize just a small fraction of all promoters. A single specific model like the actin class model [21] matches about once every 2.5 million bps and thus would yield a total of 1200 matches, 600 of which can be expected to be true. This is great in terms of specificity, but loosing more than 90% of the true promoters present is certainly far from what sequence annotation is aiming at. The existence of 100 or, most likely, even more promoter models of the specificity of the actin model would be required to achieve specific recognition of most of the promoters present within the genome. Definition of the required number of specific models based on current technologies is not a feasible task. Therefore, new developments have already been initiated to overcome the current obstacles.

The numbers mentioned above are necessarily very rough estimates. However, two- or three-fold variations would not change the general results. It is quite evident that functional promoter analysis in laboratories is capable of dealing with several hundred or even thousand predicted regions while predicting several hundred thousand or even millions of regions remains out of reach. It is safe to assume that further improvement of laboratory high throughput technologies and enhancement of the specificity of promoter recognition in silico will meet somewhere in the future to close the gap in our knowledge about the functional regulation of the genome. It is also quite clear from the past and present developments that bioinfomatics will probably cover significantly more than half of that path.

The larger context

There will be help on the way towards more specific modeling of functional regions in genomic sequences. The almost unlimited natural context of regions in genomic sequences will allow for completely new approaches to comparative sequences analysis, which has already proven to be the most powerful approach in bioinformatics of genomic sequences (e.g., in detection of new sequence elements). Comparative analysis will be instrumental in determining the anatomy of regulatory networks including MARs, LCRS, enhancers, and silencers in addition to the promoter sequences. It can be safely assumed that context-sensitive sequence analysis will prevail in the long run over any methods dealing with short isolated sequences.

Aspects of comparative genomics

The context information is by no means restricted to other regions within the same genome. Approaches based on comparative genomics employing sequence information from two or more genomes proved to be very powerful as they introduce phylogenetic aspects into the analysis. Phylogenetic footprints have already shown this [67]. This way, discrimination between noise and functionally conserved regions can be improved resulting in easier detection of regions useful for predictive efforts.

Analysis of data sets from high throughput methods

Another field to which the bioinformatics of regulatory DNA regions can be expected to contribute significantly is the analysis of results from high throughput experiments in expression analysis (e.g., all forms of expression arrays). Due to the discontinuous nature of regulatory regions there is no way of deducing common regulatory features from the expression data directly that are usually based on coding regions. However, the general availability of the corresponding genomic regulatory regions for many (and very soon all) of the genes analyzed in an expression array experiment enables attempts to elucidate the genomic structures underlying common expression patterns of genes. Expression arrays (described in detail in Chapter 6, Volume 2) directly deliver information, which genes are where expressed under the conditions tested. However, they cannot provide any clue to why this happens or how the same genes would behave under yet untested conditions. Identification of functional features by comparative sequence analysis (e.g., promoter modules) can reveal different functional subgroups of promoters despite common regulation under specific conditions. Consequently, the detection of known functional modules can suggest expression patterns under yet untested conditions. Moreover, the organizational structures of promoters can also be used to identify additional potential target genes either within the same organism in other genomes or via comparative genomics. Given the exponential number of possibilities for combinations of conditions, bioinformatics of regulatory sequences will also become instrumental for the rational design of expression arrays as well as for selection of experimental conditions.


The experimental dissection of functional mechanisms of transcription control has gained an enormous momentum during recent years. The ever rising number of publications on this topic bears witness to this development which found one hallmark manifestation in the introduction of a new section in the Journal of Molecular and Cellular Biology entirely devoted to analysis of transcription control. The complex interleaved networks of tran scription control certainly represent one of the cornerstones on which to build our understanding of how life functions, in terms of embryonic development, tissue differentiation, and maintenance of the shape and fitness of adult organisms throughout life. This is also the reason why both the experimental analysis as well as the bioinformatics of transcription control will move more and more into the focus of medical/pharmaceutical research. A considerable number of diseases are directly or indirectly connected to alterations in cellular transcription programs (e.g., most forms of cancer). Furthermore, many drugs influence transcription control via signaling pathways (triggering transcription factors) which could also be connected to certain side effects of drugs. The human genome sequencing project will provide us with a complete catalog of the components of a human probably within a year. This will constitute a blueprint of the material basis of a human. However, only the analysis of the regulatory part of the genome and the corresponding expression patterns and the complex metabolic networks will provide deeper insight into how the complex machinery called life is working. Definition and detection of regulatory regions by bio-informatics will contribute to this part of the task.


1 Werner, T. (1999) Identification and characterization of promoters in eukaryotic DNA sequences Mammalian Genome, 10, 168-175.

2 Kodadek, T. (1998) Mechanistic parallels between DNA replication, recombination and transcription. Trends Biochem. Sci. 23, 79-83.

3 Rothenberg, E. V., Ward, S. B. (1996) A dynamic assembly of diverse transcription factors integrates activation and celltype information for interleukin 2 gene regulation. Proc. Natl. Acad. Sci. USA 93, 9358-9365.

4 Ioshikhes, I., Bolshoy, A., Derenshteyn, K., Borodovsky, M., Trifonov, E. N. (1996) Nucleosome DNA sequence pattern revealed by multiple alignment of experimentally mapped sequences. J. Mol. Biol. 262, 129-139.

5 Sloan, L. S., Schepartz, A. (1998) Sequence determinants of the intrinsic bend in the cyclic AMP response element. Biochemistry 37, 7113-7118.

6 Yamauchi, M., Ogata, Y., Kim, R. H., Li, J. J., Freedman, L. P., Sodek, J. (1996) AP-1 regulation of the rat bone sialoprotein gene transcription is mediated through a TPA response element within a glucocorticoid response unit in the gene promoter. Matrix Biol. 15, 119-130.

7 Sap, J., Munoz, A., Schmitt, J., Stunnenberg, H., Vennstrom, B. (1989) Repression of transcription mediated at a thyroid homone response element by the v-erb-A oncogene product. Nature 340, 242-244.

8 Boutillier, A. L., Monnier, D., Lorang, D., Lundbiad, J. R., Roberts, J. L., Loeffler, J. P. (1995) Corticotropin-releasing hormone stimulates proopiomelanocortin transcription by cFos-dependent and -independent pathways: Characterization of an AP1 site in exon 1. Molecular Endocrinol 9, 745-755.

9 Bergers, G., Graninger, P., Braselmann, S., Wrighton, C., Busslinger , M. (1995) Transcriptional activation of the fra-1 gene by AP-1 is mediated by regulatory sequences in the first intron. Mol Cell Biol 15, 3748-3758.

10 Kel, O. V., Romaschenko, A. G., Kel, A. E., Wingender, E., Kolchanov, N. A. (1995) A compilation of composite regulatory elements affecting gene transcription in vertebrates. Nucleic Acids Res 23, 4097-4103.

11 Heinemeyer, T., Wingender, E., Reuter, I., Hermjakob, H., Kel, A. E., Kel, O. V., Ignatieva, E. V., Ananko, E. A., Podkolodnaya, O. A., Kolpakov, F. A., Podkolodny, N. L., Kolchanov, N. A. (1998) Databases on transcriptional regulation: TRANSFAC, TRRD and COMPEL. Nucleic Acids Res 26, 362-367.

12 Arnone, M. I., Davidson, E. H. (1997) The hardwiring of development: organization and function of genomic regulatory systems. Development 124, 1851-1864.

13 Zawel, L., Reinberg, D. (1995) Common themes in assembly and function of eukaryotic transcription complexes. Annu. Rev. Biochem. 64, 533-561.

14 Conaway, J. W., Conaway, R. C. (1991) Initiation of Eukaryotic Messenger RNA Synthesis. J Biol Chem 266, 17721-17724.

15 Gilinger, G., Alwine, J. C. (1993) Transcriptional activation by simian virus-40 large T-Antigen - requirements for simple promoter structures containing either TATA or initiator elements with variable upstream factor binding sites. J Virol 67, 6682-6688.

16 Colgan, J., Manley, J. L. (1995) Cooperation between core promoter elements influences transcriptional activity in vivo. Proc Natl Acad Sci USA 92, 1955-1959.

17 Novina, C. D., Roy, A. L. (1996) Core promoters and transcriptional control. Trends in Genetics 12, 351355.

18 Li, H., Capatanaki, Y. (1994) An E box in the desmin promoter cooperates with the E-box and MEF-2 sites of a distal enhancer to direct muscle-specific transcription. EMBO J 13, 3580-3589.

19 Cramer, P., Pesce, C. G., Baralle, F. E., Kornblihtt, A. R. (1997) Functional association between promoter structure and transcript alternative splicing. Proc Natl Acad Sci USA 94, 11456-11460.

20 Biberman, Y., Meyuhas, O. (1997) Substitution of just five nucleotides at and around the transcription start site of rat beta-actin promoter is sufficient to render the resulting transcript a subject for translational control. FEBS Lett 405, 333-336.

21 Frech, K., Quandt, K., Werner, T. (1998) Muscle actin genes: A first step towards computational classification of tissue specific promoters. In Silico Biol. 1, 129-138.

22 Frech, K., Danescu-Mayer, J., Werner, T. (1997a) A novel method to develop highly specific models for regulatory units detects a new LTR in GenBank which contains a functional promoter. J Mol Biol 270, 674-687.

23 Frech, K., Werner, T. (1996a) Specific modelling of regulatory units in DNA sequences. In Pacific Symposium on Biocomputing '97, R. B. Altman, A. K. Dunker, L. Hunter and T. E. Klein eds. (World Scientific), 151-162.

24 Frech, K., Brack-Werner, R., Werner, T. (1996b) Common modular structure of lentivirus LTRs. Virology 224, 256-267.

25 Rhee, K., Thompson, E. A. (1996) Glucocorticoid regulation of a transcription factor that binds an initiator-like element in the murine thymidine kinase (Tk-1) promoter. Molecular Endocrinol 10, 1536-1548.

26 Jiang, J. G., Zarnegar, R. (1997) A novel transcriptional regulatory region within the core promoter of the hepatocyte growth factor gene is responsible for its inducibility by cytokines via the C/EBP family of transcription factors. Mol Cell Biol 17, 5758-5770.

27 Usheva, A., Shenk, T. (1996) YY1 transcriptional initiator: Protein interactions and association with a DNA site containing unpaired strands. Proc Natl Acad Sci USA 93, 13571-13576.

28 Stein, G. S., Vanwijnen, A. J., Stein, J., Lian, J. B., Montecino, M. (1995) Contributions of nuclear architecture to transcriptional control. Int. Rev. Cytol. 162A, 251-278.

29 Cavener, D. R. (1987) Comparison of the consensus sequence flanking translational start sites in Drosophila and vetrebrates. Nucleic Acids Res 15, 1353-1361.

30 Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res. 12, 505-519.

31 Stormo, G. D., Hartzell III, G. W. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc Natl Acad Sci USA 86, 1183-1187.

32 Bucher, P. (1990) Weight matrix description of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J Mol Biol 212, 563-578.

33 Quandt, K., Frech, K., Karas, H., Wingender, E., Werner, T. (1995) Matlnd and Matlnspector: New fast and versatile tools for detection of consensus matches in nucleotide sequence data. Nucleic Acids Res 23, 4878-4884.

34 Chen, Q. K., Hertz, G. Z., Stormo, G. D. (1995) MATRIX SEARCH 1.0: A computer program that scans DNA sequences for transcriptional elements using a database of weight matrices. Comp Appl Biosci 11, 563-566.

35 Frech, K., Herrmann, G., Werner, T. (1993) Computerassisted prediction, classification, and delimitation of protein binding sites in nucleic acids. Nucleic Acids Res. 21, 16551664.

36 Frech, K., Quandt, K., Werner, T. (1997b) Finding protein-binding sites in DNA sequences: The next generation. Trends Biochem Sci 22, 103-104.

37 Frech, K., Quandt, K., Werner, T. (1997c) Software for the analysis of DNA sequence elements of transcription. Comp Appl Biosci 13, 89-97.

38 Bohjanen, P. R., Liu, Y., GarciaBianco, M. A. (1997) TAR RNA decoys inhibit Tat-activated HIV-1 transcription after preinitiation complex formation. Nucleic Acids Res 25, 44814486.

39 Wang, W. D., Chi, T. H., Xue, Y. T., Zhou, S., Kuo, A., Crabtree, G. R. (1998) Architectural DNA binding by a high-mobility-group/kinesin-like subunit in mammalian SWI/SNF-related complexes. Proc Natl Acad Sci USA 95, 492-498.

40 Beii, P. J. L., Higgins, V. J., Dawes, I. W., Bissinger, P. H. (1997) Tandemly repeated 147 bp elements cause structural and functional variation in divergent MAL promoters of Saccharomyces cerevisiae. Yeast 13, 1135-1144.

41 Kim, J., Kiooster, S., Shapiro, D. J. (1995) Intrinsically bent DNA in a eukaryotic transcription factor recognition sequence potentiates transcription activation. J Biol Chem 270, 12821288.

42 Chetouani, F., Monestie, P., Thebauit, P., Gaspin, C., Michot, B. (1997) ESSA: an integrated and interactive computer tool for analysing RNA secondary structure. Nucleic Acids Res 25, 3514-3522.

43 Schuster, P., Stadier, P. F., Renner, A. (1997) RNA structures and folding: From conventional to new issues in structure predictions. Curr Op Struct Biol 7, 229-235 Staden, R. (1984) Computer methods to locate signals in nucleic acid sequences. Nucleic Acids Res 12, 505-519.

44 Nakaya, A., Yamamoto, K., Yonezawa, A. (1995) RNA secondary structure prediction using highly parallel computers. Comp Appl Biosci 11, 685-692.

45 Nieisen, D. A., Novoradovsky, A., Goidman, D. (1995) SSCP primer design based on single-strand DNA structure predicted by a DNA folding program. Nucleic Acids Res 23, 2287-2291.

46 Shago, M., Giguere, V. (1996) Isolation of a novel retinoic acid-responsive gene by selection of genomic fragments derived from CpG-island-enriched DNA. Mol Cell Biol 16, 4337-4348.

47 Bouiikas, T. (1996) Common structural features of replication origins in all life forms. J Cell Biochem 60, 297-316.

48 Kramer, J. A., Singh, G. B., Krawetz, S. A. (1996) Computer-assisted search for sites of nuclear matrix attachment. Genomics 33, 305-308.

49 Singh, G. B., Kramer, J. A., Krawetz, S. A. (1997) Mathematical model to predict regions of chromatin attachment to the nuclear matrix. Nucleic Acids Res. 25, 1419-1425.

50 Fickett, J. W., Hatzigeorgiou, A. C. (1997) Eukaryotic promoter recognition. Genome Res 7, 861-878.

51 Perier, R. C., Junier, T., Bucher, P. (1998) The Eukaryotic Promoter Database EPD. Nucleic Acids Res 26, 353-357.

52 Prestridge, D. S. (1995) Predicting Pol II promotor sequences using transcription factor binding sites. J Mol Biol 249, 923-932.

53 Prestridge, D. S. (1991) SIGNAL SCAN: a computer program that scans DNA sequences for eukaryotic transcriptional elements. Comp Appl Biosci 7, 203-206.

54 Chen, Q. K., Hertz, G. Z., Stormo, G. D. (1997) PromFD 1.0: A computer program that predicts eukaryotic pol II promoters using strings and IMD matrices. Comp Appl Biosci 13, 29-35.

55 Levy, S., Compagnoni, L., Myers, E. W., Stormo, G. D. (1998) Xlandscape: the graphical display of word frequencies in sequences. Bioinformatics 14: p. 74-80.

56 Hutchinson, G. B. (1996) The prediction of vertebrate promoter regions using differential hexamer frequency analysis. Comp Appl Biosci 12, 391-398.

57 Reese, M. G., Harris, N. L., Eeckman, F. H. (1996) Large Scale Sequencing Specific Neural Networks for Promoter and Splice Site Recognition. In: Proceedings of the Pacific Symposium on Biocomputing, Eds Hunter l., Klein T.

58 Zhang, M. Q. (1998) Identification of human gene core promoters in silico. Genome Res. 8, 319-326.

59 Kondrakhin, Y. V., Kel, A. E., Kolchanov, N. A., Romashchenko, A. G., Milanesi, L. (1995) Eukaryotic promoter recognition by binding sites for transcription factors. Comp Appl Biosci 11, 477-488.

60 Faisst, S., Meyer, S. (1992) Compilation of Vertebrate-Encoded Transcription factors. Nucleic Acids Res 20, 3-26.

61 Audic, S., Claverie, J. M. (1997) Detection of Eukaryotic promoters using Markov transition matrices. Comp Chem 21, 223-227.

62 Klingenhoff, A., Frech, K., Quandt, K., Werner, T. (1999) Functional promoter modules can be defected by formal models independent of overall nucleoside sequence similarity. Bioinformatics 15, 180-186.

63 Lavorgna, G., Guffanti, A., Borsani, G., Baliabio, A., Boncinelli, E. (1999) TargetFinder: searching annotated sequence databases for target genes of transcription factors. Bioinformatics 15, 172-173.

64 Fickett, J. W. (1996) Coordinate positioning of MEF2 and myogenin binding sites (Reprinted from Gene-Combis, vol 172, pg GC19-GC32, 1996). Gene 172, GC19-GC32.

65 Wassermann, W. W., Fickett, J. W. (1998) Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 278, 167-181.

66 Eisen, M. B., Spellman, P. T., Brown, P. O., Botstein, D. (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95, 14863-14868.

67 Duret, L., Bucher, P. (1997) Searching for regulatory elements in human noncoding sequences. Curr. Op. Struct. Biol. 7, 399-406.

68 Yang, R., Muller, C., Huynh, V., Fung, Y. K., Yee A. S., Koeffler, H. P. (1999) Functions of cyclin A1 in the cell cycle and its interactions with transcription factor E2F-1 and the Rb family of proteins. Mol. Cell. Biol. 19, 2400-2407.

Continue reading here: Homology Modeling in Biology and Medicine

Was this article helpful?

0 0