Multiple gene prediction using the HMM approach

A gene sequence can be considered as a succession of segments Xi representing exons, introns, 5' and 3 '-untranslated regions and the like.and These segments can be considered as different sequence states. There are 35 states describing the eukaryotic gene model considering direct and reverse chains as possible gene locations (Figure 3.8). In the current gene prediction approaches non-coding 5'- and 3'-exons (and introns) are not considered, because the absence of protein coding characteristics accounts for lower accuracy of their prediction. Also they do not code any protein sequences. The remaining 27 states include 6 exon states (first, last single and 3 types of internal exons in 3 possible reading frame) and 7 noncoding states (3 intron, noncoding 5'- and 3'-promoter and poly-A) in each chain plus noncoding intergenic region. The latter 27 states are connected with solid arrows in Figure 3.8, the eight unused states are connected with broken arrows.

The predicted gene structure can be considered as an ordered set of states/sequence pairs, 0 = {(q1, x1), (q2, x2),..., (qi, xi)}, called the parse, such that probability P(X, 0) of generating X according to 0 is maximal over

Io

II

12

I3

/ *

II

12

\

I5

Fig. 3.8 respectively (i = 0,1,2 reflect 3 possible

Different sequence states and transitions different ORF). E marks non-coding exons in eukaryotic gene model. E,' and 1/ are and 15/13 are 5'- and 3'-introns adjacent to different exon and intron states, non-coding exons.

Fig. 3.8 respectively (i = 0,1,2 reflect 3 possible

Different sequence states and transitions different ORF). E marks non-coding exons in eukaryotic gene model. E,' and 1/ are and 15/13 are 5'- and 3'-introns adjacent to different exon and intron states, non-coding exons.

all possible parses (or some score is optimal in some meaningful sense, i.e., best explains the observations [93]):

P(X, p) = P(qi)^][[ P(xi \ l(xi), qi)P(l(n)\ qi)(P(qi+i, qt x P(x \ l(xi), qi)P(l(xi)\qi)

where P(q1) is denote the initial state probabilities; P(xi \ l(x1), qi)P(l(xi) \ qi)

and P(qi+1, qi) are the independent joint probabilities of generating the subsequence x. of length l in state qi and transiting to state i + 1, respectively.

Successive states of this HMM model are generating output according to the Markov process with inclusion of explicit state duration density. A simple technique based on the dynamic programming method for finding the optimal parse (or the single best state sequence) is called the Viterbi algorithm [92]. The algorithm requires on the order of N2D2L calculations, where N is the number of states, D is the longest duration and L is the sequence length [93]. A helpful technique was introduced by Burge [28] to reduce the number of states and simplify computations by modeling non-coding state length with a geometric distribution. We consider shortly the algorithm of gene finding using this technique, which was initially implemented in the Genscan program [32] and used later in the Fgenesh program [21]. Since any valid parse will consist of only alternating series of Non-coding and Coding states: NCNCNC,..., NCN, we need only 11 variables, corresponding to the different types of N states. At each step corresponding to some sequence position we select the maximum joint probability to continue in the current state or to move to another non-coding state defined by a coding state (from a precomputed list of possible coding states), which is ending in analyzed sequence position.

Let us define the best score (highest joint probability) yi( j) of the optimal parse of the subsequence s1j, which ends in state qi at position j. Let we have a set Aj of coding states {-Ck-} of lengths {dk}, starting at positions {mk} and ending at position j, which have the previous states {bk}. The length distribution of state ck is denoted by fck(d). The searching procedure can be stated as follows:

Initialization:

Yi(1)= mPi(S1)Pi, i = 1, ...11. Recursion:

Continue reading here: 1

Was this article helpful?

0 0