Structural characteristics of eukaryotic genes

The gene is a fragment of nucleic sequence that carries the information representing a particular polypeptide or RNA molecule. In eukaryotes, genes lie in a linear array on chromosomes, which consist of a long molecule of duplex DNA and chromatin proteins (mostly histones that form a structure called a nucleosome). The complex of DNA and proteins (chromatin) can maintain genes in an inactive state by restricting access to RNA polymerase and its accessory factors. To activate a gene, the chromatin encompassing that gene and its control region must be modified to permit transcription [1]. The principal steps in gene expression of protein-coding genes are transcription and post-transcriptional processing of messenger RNA precursors including 5 '-capping, 3 '-polyadenilation and splicing. The order in which these events occur is not entirely clear: some splicing events could take place during transcription [2]. The processing events produce the mature mRNA in the nucleus and then it is transported to the cytoplasm for translation. The mature mRNA includes sequences called exons that encode the protein product according to the rules of the genetic code. However, the gene sequence often includes non-coding regions: introns that are removed from the primary transcript during RNA splicing and 5'- and 3'-untranslated regions. Model and stages of gene expression of a typical protein-coding gene are presented in Figure 3.1.

Knowledge about structural gene characteristics is accumulated in GenBank and EMBL nucleotide sequence databases, where one gene can be described in dozens of entries assigned to partially sequenced gene regions, alternative splicing forms or mRNAs. Processing GenBank [3] data, a gene-centered database InfoGene was created [4, 5], which contains description of known genes and their basic functional signals and structural components. All major organisms are presented in the separate divisions. The Human InfoGene division contains about 21000 genes (including 16 000 partially sequenced genes), 53 435 coding regions, 83 488 exons and about

Bioinformatics - From Genomes to Drugs. Volume I: Basic Technologies. Edited by Thomas Lengauer Copyright © 2002 WILEY-VCH Verlag GmbH, Weinheim ISBN: 3-527-29988-2

Estrutura Gene

Model of eukaryotic gene structure and gene expression stages. BEs are boundary elements that bind proteins to prevent the enhancer effects on outside genes. MARs are matrix attachment regions.

Model of eukaryotic gene structure and gene expression stages. BEs are boundary elements that bind proteins to prevent the enhancer effects on outside genes. MARs are matrix attachment regions.

58 000 donor and acceptor splice sites. Table 3.1 shows the major structural characteristics of Human, Mouse, D. melanogaster, C. elegans, S. cerevisiae and A. thaliana genes deposited in GenBank, release 119.

We do not observe any significant difference in the size of protein coding mRNAs in different types of organisms, but the gene sizes are often larger in vertebrates and especially in primates. We can notice that the human coding exons are significantly shorter that the sizes of the respective genes. The average size of an exon is about 200 bp, that is close to the DNA length associated with the nucleosome particle. Coding and non-coding exons can be as short as several bases and as long as dozens or thousands of bases. Usually protein coding exons occupy just a few percent of the gene size. Different kind of repeats cover 41% of sequenced human DNA, and coding exons account only for 2-3% of the genomic sequence.

The structural characteristics of eukaryotic genes considered above create two major problems in computational gene identification.

1. Low quality recognizers will generate a lot of false positive predictions, the number of which is comparable with the true exon number (Figure 3.2);

2. Recognition of small exons (1-20 bp) can not be done using any composition-based coding measure that is often successful for prokaryotes.

We need to develop gene prediction approaches that are based significantly on the recognition of functional signals encoded in the gene sequence.

The main information about exon location is encoded in splice site sequences. In the next section we will consider essential characteristics of splice sites. The other functional signals as promoter, poly-A, start and stop

Structural characteristics of genes in eukaryotic model organisms. The numbers reflect gene characteristics described in GenBank (Release 119, 2000), which might deviate from the average parameters for organisms. Gene numbers were calculated for DNA loci only. Many long genes have partially sequence introns, therefore the actual average gene size is bigger. The range (or maximal) and average values are shown. For donor and acceptor sites the percentage of annotated canonical (GT-AG) splice pairs is provided.

Structural characteristics of genes in eukaryotic model organisms. The numbers reflect gene characteristics described in GenBank (Release 119, 2000), which might deviate from the average parameters for organisms. Gene numbers were calculated for DNA loci only. Many long genes have partially sequence introns, therefore the actual average gene size is bigger. The range (or maximal) and average values are shown. For donor and acceptor sites the percentage of annotated canonical (GT-AG) splice pairs is provided.

Homo sapiens

Mus musculus

Drosophila melanogaster

C. elegans

S. cerevisiae

Arabidopsis thaliana

CDS/partial

53435/29404

24527/13060

20314/1510

20634/526

12635/1016

31194/1461

Exons/partial

83488/21342

24508/7913

66960/19343

122951/38293

13572/13127

145942/42844

Genes/partial

20791/16141

7428/5573

17435/1154

19658/1263

12513/1098

28346/1023

Alternative splicing

2167, 10.4%

749, 10%

1785, 10%

1194, 6.1%

598, 4.7%

227, 0.1%

No introns genes

1552, 7.4%

748, 10%

3583, 20%

669, 3%

11070, 88.5%

5776, 19.7%

Number of exons

117, 5.7

64, 4.72

50, 3.88

52, 6.1

3, 1.03

78, 5.1

Exon length

1-1088, 201.6

1-6642, 207.1

6-10785, 419.5

1-14975, 22125

1-7471, 1500.0

3-75916, 192.0

Intron length

259776, 2203.5

42573, 818.3

205244, 613.7

19397, 244.0

7317, 300

118637, 174.4

Gene length

401910, 9033

150523, 3963

155515, 2854

45315, 2624

14733, 1462

170191, 2027

Donor sites

58707, 98.0%

6225, 96.9%

49592, 98.0%

102872, 99.5%

471, 93.0%

117658, 99.2%

Acceptor sites

58112, 98.53%

5627, 97.5%

49602, 97.9%

102933, 99.7%

475, 95.6%

121917, 96.9%

Bacterial DNA

Non-coding regions

False positives

Bacterial DNA

Non-coding regions

False positives

Non-coding regions Fig. 3.2

The huge amount of non-coding DNA in the human genome can significantly increase the number of false positive

Score

Coding regions reflects the value of a given classifying function to assign a sequence region to coding or non-coding classes. The y-axis presents the number of sequence

Non-coding regions Fig. 3.2

The huge amount of non-coding DNA in the human genome can significantly increase the number of false positive

Score

Coding regions reflects the value of a given classifying function to assign a sequence region to coding or non-coding classes. The y-axis presents the number of sequence coding exon predictions. The score x-axis fragments with a particular score.

of translation will be discussed later with description of their recognition functions used in many gene prediction programs.

Continue reading here: Info

Was this article helpful?

0 0