Analysis of individual sequences

This Chapter focuses on analyses that can be performed based solely on the primary sequence of a protein. Several rationales can be applied. Physico-chemical characteristics of individual amino acids are one basis for predictions of gross structural features. For example, particular repetitive patterns may suggest a coiled-coil structure while in general secondary structure can be predicted based on an a statistical analysis of the primary sequence. The definition of signals recognized by the cellular transport machinery allow the prediction of subcellular location. Although somewhat unsystematic such observations can provide valuable hints as to the structure and/or function of a protein.

Amino acids side chains differ in their physico-chemical features. For example, some like to be exposed to water, i.e., they are hydrophilic, while the hydrophobic amino acids tend to avoid exposure to water. Charge, size, or flexibility in the backbone are only some of the other examples of amino acid parameters. These parameters are usually measured on a numerical scale such that for every parameter there exists a table assigning a number to each amino acid. For the case of hydrophobicity two such scales have become famous. The first is due to Hopp and Woods [1] while the other is due to Kyte and Doolittle [2]. A large collection of amino acid parameters

ProtSca1e output for OPSD HUMAN

Hphet i t & Doo little

Hphet i t & Doo little

ProtSca1e output for OPSD HUMAN

58 180 158 £00 £56 339

Position

Hydrophobicity plot for the transmembrane protein opsin. Note the peaks in hydrophobicity which roughly denote the putative membrane spanning regions.

58 180 158 £00 £56 339

Position

Hydrophobicity plot for the transmembrane protein opsin. Note the peaks in hydrophobicity which roughly denote the putative membrane spanning regions.

have been collected by Argos [3] who found that subgroups of them are correlated with each other such that the real information content of this large number of parameters is in fact lower than it seems. He selected the following parameters as a non-redundant set: hydrophobicity, turn preference [4], residue bulk [5], refractivity index [6], and antiparallel strand preference [7].

Standard sequence analysis software today offers programs that plot various parameters for a given protein (Figure 2.1). Serious software packages tend to provide the user with a selection of informative and non-redundant parameters similar to the one given. Some other packages pretend to offer new insights by plotting large numbers of parameters. In practice, the various parameters are used to plot a curve along the amino acid chain. Values are averaged within a sliding window to smoothen the curve. The selection of the window width is, of course, arbitrary but values between 7 and 15 would generally seem appropriate. Within a window a weighting scheme may be applied which would typically assign more importance to residues in the middle of the window than to the ones on the edges.

Features of the individual amino acids also play a key role in protein secondary structure formation. Based on experimental data, scales have been assigned to the amino acids describing their preference for assuming, e.g., a helical structure [8]. Consequently, early secondary structure prediction methods have assigned preferences to the amino acids according to which secondary structure they tend to assume. For example, Glutamate is frequently found in alpha helices while Valine has a preference for beta strands and Proline is known to be strongly avoided in helices. Modern secondary structure prediction methods are more involved though (see below).

The functional features of proteins that are grasped by such parameters are manifold. Hydrophobic amino acids tend to occur in the interior of globular proteins, while at the surface of a protein one will preferentially find hydrophilic residues. One application of the latter is the prediction of antigenic epitopes because these are assumed to coincide with patches of hydrophilic residues. In transmembrane proteins, the regions of the chain that span the membrane tend to be strongly hydrophobic and the recognition of several such regions is an integral part of methods aiming at the recognition of transmembrane proteins based on their sequence alone [9, 10]. Recognition of transmembrane regions has been found to be remarkably successful, also leading to the acceptance of the output of these programs as annotation in sequence databases.

Certain periodicities in the occurrence of hydrophilic and hydrophobic residues may indicate particular secondary structures. Exposed helices tend to display a hydrophilic residue every 3-4 positions, while the other residues are rather hydrophobic. This is well visible in a "helical wheel representation" where the amino acid sequence is printed on a circle with the appropriate 3.6 amino acids per 360°. Likewise, a strand whose one side is exposed may display a hydrophilic residue at every other position. Based on these observations, Eisenberg devised the method of hydrophobic moments [11]. Many structural proteins interact with each other through intertwined helices forming a so-called coiled-coil structure. The physical constraints on the interface between these helices is reflected in a certain periodic arrangement of hydrophilic and hydrophobic amino acids. Generally, the sequence of an alpha helix that participates in a coiled-coil region will display a periodicity with a repeated unit of length 7 amino acids. Denote those 7 positions by a through g, then position e and g tend to be charged or polar while a and d are hydrophobic [12]. Prediction methods for coiled-coil regions are making use of these preferences [13] and even attempt to distinguish between two-stranded and three-stranded coiled-coils [14].

The primary sequence of a protein also contains the information whether a protein is secreted or which cellular compartment it is destined for. In particular, secretory proteins contain a N-terminal signal peptide for the recognition of which programs have been developed [15]. Other signals like, e.g., nuclear localization signals are notoriously hard to describe. Many posttranslational modifications, on the other hand, are linked to particular amino acid patterns as described in the Prosite database.

Continue reading here: Secondary structure prediction

Was this article helpful?

0 0