Prediction of donor and acceptor splice junctions
Recognition of RNA splice sites by the spliceosome is very precise [50-52] indicating the presence of specific signals for their function. Splice site patterns are mainly defined by nucleotides at the ends of introns, because deletions of large intron parts often turn out not to effect their selection [7, 53]. A sequence of 8 nucleotides is highly conserved at the boundary between an exon and an intron (donor or 5'-splice site) and a sequence of 4 nucleotides, preceded by a pyrimidine-rich region, is also highly conserved between an exon and an intron (acceptor or 3'-splice site). The third less conserved sequence of about 5-8 nucleotides and containing an adenosine residue, lies within the intron, usually between 10 and 50 nucleotides upstream of the acceptor splice site (branch site). These sequences provide specific molecular signals by which the RNA splicing machinery can select the precise splice sites . There are many efforts to analyze the sequences around these conserved regions [7, 18, 54]. It was shown that their consensus differ slightly between different classes of organisms [18, 55] and certain important information may be provided by the sequences outside the short conserved regions. Scoring schemes based on consensus sequences or weight matrices, which take into account information on open reading frames, free energy of base-pairing of RNA with snRNA and other peculiarities, yield an accuracy of about 80% for the prediction splice site positions [56, 57]. More accurate prediction is shown by neural network algorithms [58, 59]. The optimal network architecture (15 nucleotides window and 40 hidden units for donor sites and 41 nucleotides window and 20 hidden units) has an accuracy about of 94% (111/118) for predicting donor and 85% (100/118) for predicting acceptor splice sites in a test set of 118 examples . Using joint coding/noncoding and splice site classification the prediction accuracy of about 95% with a low level of false positive sites was obtained. This work provides a good benchmark for testing new predictive algorithms, because the authors analysed large learning (331 examples) and testing (118 examples) sets. Because practically all donor sites contain the conserved dinucleotide GT and all acceptor sites contain AG, Mural et al.  used for splice sites selection the preferences of tabulated triplets in and about authentic splice junctions and also in and about pseudo-junctions which contain either a GT or an AG base pair. The authors obtained a good 91% accuracy for donor and 94% accuracy for acceptor splice sites prediction for primate genes. However, this data was based on small learning (about 150) and test (about 50) sets of splice junctions.
The overall view on the difference of triplet composition in splice and pseudosplice sequences is shown in Figure 3.6 . This figure clearly demonstrates the various functional parts of splice sites. We see that only short regions around splice junctions have a great difference in triplet composition. Their sequences usually are used as the only determinants of donor or acceptor splice site positions. However, dissimilarity in many other regions can also be seen: for the donor site coding region, the G-rich intron region may be distinguished; for acceptor sites - intron G-rich region, branch point region, poly(T/C)-tract, and coding.
Splice site prediction methods using a linear function that combines triplet preferences around splice junction and preferences to be coding and intron of adjacent regions have been developed [45, 61]. 692 sequences with
Difference of the triplet composition within donor and GT-containing non-donor sequences (left); around acceptor and AG-containing non-acceptor sequences (right) in 692 human genes. Each column ttci rar reçioft ~
DBH£ h-II". Mien presents the difference of specific triplet numbers between sites and pseudosites in a specific position. For comparison the numbers were calculated for equal quantities of sites and pseudosites.
Difference of the triplet composition within donor and GT-containing non-donor sequences (left); around acceptor and AG-containing non-acceptor sequences (right) in 692 human genes. Each column puml ttci rar reçioft ~
Codjij ICR n presents the difference of specific triplet numbers between sites and pseudosites in a specific position. For comparison the numbers were calculated for equal quantities of sites and pseudosites.
2037 donor splice sites and 2054 acceptor splice sites having the GTand AG conserved dinucleotide in flanking intron positions were extracted from GenBank . Also, 89417 pseudodonor and 134150 pseudoacceptor sites that contain either a GT or an AG base pair (and are not annotated as splice sites) were extracted from these sequences. The characteristics of sequences around splice sites and pseudosites were used for developing and testing a human splice site recognition function to distinguish them. The training set includes 2/3 of all sequences, and the test set contains the remaining ones. The data set for computing octanucleotide preferences in coding and intron regions included 4074593 bp of coding regions and 1 797 572 bp of intron sequences.
Continue reading here: Donor splice site recognition
Was this article helpful?