Annotation of sequences from genome sequencing projects

The first task in analyzing these sequences is finding the genes. Knowledge of genes opens a new way of performing biological studies called 'functional genomics'. The other problem is to find out what all these new genes do, how they interact and are regulated [104]. Comparisons between genes of different genomes can provide additional insights into the details of the structure and function of genes.

We cannot predict exactly all gene components due to the limitation of our knowledge of the complex biological processes and signals regulating gene expression. In this respect, computer analysis of the genetically well studied Adh region of D. meianogaster by several gene-finding approaches [105] gives us a unique opportunity to define the reliability and limitations of our predictions and provide the strategy of right interpretation of prediction results in analysis of new genomic sequences. The predictions were evaluated by using two standards, one based on previously unreleased high quality full-length cDNA sequences and a second based on the set of annotations generated as part of an in-depth study of the region by a group of Drosophila experts. The performance of several accurate annotations is presented in Table 3.17. The CGG1 annotation comprised the non-

Perfomance of several programs on the Adh region of Drosophila. The std3 contains 222 genes and 909 exons, the stdl set contains 43 genes and 123 exons. The annotated exons are taken from the set presented by organizers of the Genome Annotation Assessment Project (GASP1) at the time of the initial data analysis. Pe - number of predicted exons, Ce - number of correctly predicted exons. Pg - number of predicted genes, Pe - number of correctly predicted genes. Sn - sensitivity (in %), Sp -specificity (in %). At the exon level the second figure shows sensitivity taking into account exactly predicted and overlapped exons.

CCC1

Fgenesh

Fgenesh pruned

Cenie

Cenie EST

Base Level

Sn stdl

89

98

98

96

97

Sn std3

87

92

88

79

79

Sp std3

77

7l

86

92

9l

Exon Level

Pe

lllS

l67l

979

786

849

Ce stdl

80

l00

l00

86

95

Sn stdl

65/89

8l/97

8l/97

70

77

Ce std3

544

60l

565

447

470

Sn std3

60/82

66/89

62/82

49

52

Sp std3

49

36

58

57

52

Gene Level

Pg

288

530

262

24l

246

Cg stdl

22

3l

3l

24

28

Sn stdl

5l

72

72

56

65

Cg std3

l02

l08

l06

86

92

Sn std3

46

49

48

39

4l

Sp std3

36

20

39

37

38

ambiguous gene set. The genes were included by the following rules (descending in priority):

a) all genes that were predicted by Fgenesh+;

b) genes predicted identically by both Fgenes (human parameters) and Fgenesh (drosophila parameters) programs;

c) in the regions of overlapped (but not exactly coincide) predictions, only one predicted gene was included with the priority given to the genes producing the longer protein.

Fgenesh+ was used to improve the accuracy of prediction for 49 genes. 37 of them were predicted using D. melanogaster's own proteins, already deposited in protein databases. Analysis of these predictions demonstrates that even for such cases prediction of accurate gene structure may not be trivial, although in most cases Fgenesh+ improved the prediction accuracy relative to ab initio methods.

The annotation CGG1 predicts about 87% of the real coding nucleotides and has just about 23% false positives (some of them might happen to be coding due to absence of experimental data in many regions). 89% of the exons are predicted exactly or with overlapping. These data show that ab initio predictions can provide information about practically all of protein coding genes (just 13% of coding region was not predicted) and can serve as a reasonable base for further experimental analysis.

It is interesting to note that the usage of two programs provided stable prediction accuracy on both (stdl and std3) sets. The Genie program, for example, demonstrated 20% decrease of sensitivity (Table 3.17). Because there is no version of Fgenes with all parameters computed on Drosophila genes, the optimal variant of automatic annotation was performed by using only the Fgenesh program.

It was found that the Fgenesh pruned predictions provided the best accuracy of annotation of the 2.9 MB Adh sequence. In this simple variant from the set of predicted genes all low-scoring genes (with average gene score less than 15) were discarded. This yields 98% of the coding nucleotide prediction on the set of 43 verified genes and 88% accurate coding nucleotide predictions with only 14% of false positives on a 222 gene set (where the significant parts of the genes were derived from Genscan predictions). The results demonstrate that most of the annotated genes in std3 are at least partially covered by predictions. For example, just 5 genes from the std3 set do not overlap with Fgenesh predictions (two of them are also included in the std1 set). From these 5 genes, 4 are located inside introns of other genes and 4 are single-exon genes. So one of the limitations of current gene-finding programs is that they cannot detect nested genes, i.e. genes located inside introns of other genes and this is one of the future directions for improvement of gene-finding software. Another drawback of the current gene-finding programs is that predictions of terminal exons are generally

InfoGene viewer [15] visualization of fgenes and fgenesh predictions in Adh region. In the lower panel fragments of annotations are presented that are marked in the upper panel. The last level presents the std3 manual annotation. Coding exons are marked by red color and introns by grey color. Green triangles show the start of transcription and red ones mark the poly-A signal.

much worse than the internal ones. This results in splitting up of some actual genes and/or joining some other multiple genes into a single one. Several examples of such a situation can be clear seen with the InfoGene Java viewer [15] (Figure 3.11) developed to present information about gene structures described in Genbank (collecting information about a gene from many entries) or annotated using gene prediction programs. On the std1 set Fgenesh predicts all internal exons correctly (100%), while only 72% of initial exons and 77% of terminal exons were predicted correctly. Thus better predicting the terminal exons and the related problem of better recognizing the beginnings (transcription start sites) and endings (polyA sites) of genes are the other areas of possible future improvements of gene-finding programs.

Continue reading here: Info Cene database of known and predicted genes

Was this article helpful?

0 0