studies on various protein families

Several protein families of different nature were studied for genetic relationship, correct alignment at non-homologous fragments, optimal sequence consensus construction, and confirmation of their actual relevance. A comparison of the genetic semihomology approach with statistical approaches indicates a high accuracy and cognition significance of the former. This is particularly pronounced in the study of related proteins that show a low degree of homology. The sequence multiple alignments were verified and corrected with respect to the questionable, non-homologous fragments. The verified alignments were the basis for consensus sequence formation. The frequency of six-codon amino acids occurrence versus position variability was studied and their possible role in amino acid mutational exchange at variable positions is discussed.

Theoretical comparative studies on proteins and nucleic acids have become powerful and advanced research tools commonly used in biochemistry, molecular biology, genetics, protein modeling and structure/function prediction.The informative and predictive value of such studies has been admitted in both protein and nucleic acid research.There are over 400 amino acid indices and at least 42 mutation matrices described so far (Tomii & Kanehisa, 1996).Actually they are based on much fewer algorithms, most of which are modifications of several original ones.The most current tools are based on principles derived from the Dayhoff matrix (Dayhoff & Eck, 1968; Dayhoff et al., 1979).The indices used for comparative sequence analysis are mainly of BLOSUM or PAM type with different parameters according to the kind of protein sequences analyzed.
Most algorithms and programs use statistical matrices of amino acid replacement.They consider the probability of replacement from the statistical point of view, but do not refer to the biological mechanisms of replacement probability.The matrix indices (e.g., PAM250 used in Mutation Data Matrix) that reflect similarity and/or relationship as well as most methods used for se-quence multiple alignment are focused entirely on the statistical calculations of the observed changes.The probability of amino acid replacement based on their genetic code is often not considered at all.Also there is no reference to the possible mutation mechanism or type.Many examples of such an approach are available within the tools accompanying the protein and/or genomic databases, like the Swiss-Prot Expert Protein Analysis System (ExPASy), the European Molecular Biology Laboratory (EMBL) database, the National Center of Biotechnology Information (NCBI), the European Bioinformatics Institute (EBI) database and GenBank.
The statistical scoring matrices are very useful for studies on similarity and variability within protein families.However, they are dependent on the database used.The same algorithm can lead to different scoring matrices depending on the type and number of protein sequences used as a database (Tomii & Kanehisa, 1996).Its application with identical parameters but in different programs may also give different results of the analysis (Leluk, 2000a).The statistical algorithms cannot give detailed information about the biological mechanism of protein variability.For example, their application to predict a new possible sequence within a protein family is quite limited, even if the probability of amino acid replacement among homologous sequences is well described.
The accuracy of the scoring matrix of amino acid replacement is dependent on the proper alignment of the sequences being compared.Therefore the most efficient programs combine both stepsa replacement scoring matrix and the alignment procedure.They are provided with several matrices, which give a possibility to choose the best one for specified purposes and proteins.However, all of them may be defined as statistical ones.The alignment strategy also varies in different programs (residue-to-residue, segment-to-segment, motif search etc.).Usually the alignment tools are well defined from the mathematical and statistical point of view, but they seldom refer to the biological principles of mutational mechanisms among proteins and/or the genetic code.A typical example of such a theoretical approach is the MAST sequence homology search algorithm (Bailey & Gribskov, 1998a) and the program MEME (Bailey & Gribskov, 1997;Bailey & Elkan, 1995;Bailey & Gribskov, 1998b;Grundy et al., 1997).In this program the part concerning match scores, error parameters and cross validation estimators is very expanded, but the basic constituents of the biological molecule are assumed as just one-letter symbols of the string (the biological characteristics and genetic relationship are ignored).
The algorithm of genetic semihomology (Leluk, 1998;2000a, b) assumes a close relations between amino acids and their addons for analysis of various relationships between proteins belonging to a given family and different protein families.It differs from the others by its non-statistical approach and a lack of scoring scale (the amino acid replacement within the related proteins is not represented by numerical index values, or replacement probability factors, as it is in PAM or BLOSUM matrices).Instead of a scoring matrix it is supported with a three-dimensional diagram including all theoretically possible amino acid replacements by one nucleotide exchange in their codons.More details concerning the algorithm of genetic semihomology, its construction and basic assumptions are described in the article devoted to the algorithm itself (Leluk, 1998).Besides the standard assignments it can be used for the study on cryptic mutations, the mechanisms of variability, prediction of the gene nucleotide sequence, and of new possible protein sequences within the same family.It was successfully applied to the study of long-distance mutations correlation and their effect on protein conformation (Leluk, 2000c).A special advantage concerns the confirmation of true relationship for proteins revealing low homology.The results of the genetic semihomology approach to various protein families are presented in this article.
The frequency of occurrence of six-codon amino acids (Ser, Arg and Leu) as a function of the position variability was calculated and analyzed according to the principles of the genetic semihomology algorithm (Leluk 1998;Leluk, 2000a).The same algorithm supported the study on the cryptic mutation role, occurrence and mechanism within different protein families.

RESULTS AND DISCUSSION
The alignment strategy for proteins revealing high and low homology The initial step of alignment with the algorithm of genetic semihomology and most other algorithms is the same.At first the sequences are checked for the best alignment that gives the maximum number of identities (without gaps).This alignment selection is the starting point to further adjustment of particular fragments with the use of gaps.For high homologies (60% and more) and low gap contribution the results are concordant regardless of the method used.Thus the multiple alignments for proteinase inhibitors from squash or Bowman-Birk inhibitors (Fig. 1) look almost identical when they are performed with ClustalW (Thompson et al., 1994), MultAlin (Corpet, 1988) or the genetic semihomology algorithm (Leluk, 1998).The differences appear where gap contribution occurs (especially at non-conservative regions) and when proteins reveal low homology to each other (30% or less).For such proteins the application of the same statistical algorithm with the same analysis parameters, but in different programs, may bring different results (Leluk, 2000a).In order to set the actual alignment at those regions, the genetic semihomology algorithm considers the genetic relationship between particular residue pairs.This relationship assumes single point mutation of the codon as the most likely mechanism of change.This approach makes the algorithm different from the statistical approaches, where mainly the statistical frequency of amino acid replacement is considered.The genetic approach to amino acid exchange not only enables the proper alignment of non-identical (but related) fragments, but also gives some information about the evolutionary mechanism.It also allows predicting the other hypothetical residues that may occur at a certain position.
It is obvious that for cysteine-rich proteins cysteine distribution along the chain serves as the initial data for correct alignment (Fig. 1).Of course, it concerns the cysteines involved in the disulfide bridges formation.This is useful espe-cially for studies on proteinase inhibitors which are divided into several families according to the cysteine (and disulfide bridges) topology.But the 24 J. Leluk and others 2001 Proteinase inhibitors from squash seeds The Bowman-Birk inhibitor family  The essential conservative residues are marked as white characters on black background.The shadowed characters indicate the semihomology relationship between the residues.The results of genetic semihomology analysis.The labels' meaning is the same as in Fig. 1.See text for details.
cysteine parameter cannot be used for such proteins like cysteine free eglin-like proteins (Fig. 2) or membrane bound spectrins (Fig. 3).In such cases it is necessary to recognize and localize the consensus positions which are sufficiently conservative.Depending on the protein group these may be the positions occupied by Trp, Pro, or long chain hydrophobic residues.The next step is to estimate the distance between the consensus residues for all proteins being aligned and to consider the occurrence of gaps.If gaps (deletions) are present, they must be located properly.For that purpose all pairs occurring between consensus residues (the consensus loop) are checked for the genetic relationship.Each pair is checked for the possibility of exchange of one residue to another by a single replacement of one nucleotide of its codon (actual or hypothetical).The residues that genetically do not correlate to any from the compared fragment of the other protein -are considered as inserted, and the gaps are located at the positions corresponding to them.If the fragments being compared are the same in length and do not show genetic semihomology, they can be interpreted as a result of a process different from single point mutation.The genetic semihomology analysis of the proteins revealing low similarity allows to establish whether the similarity is real or casual.Figure 4 presents the alignments of sequences homologous to the 10th segment of human a-spectrin, obtained with ClustalW and MultAlin, and the alignment consistent with the genetic semihomology algorithm.According to the algorithm chosen the contribution of conservative consensus positions (white on black) and the re-Vol.48 Genetic semihomology studies on proteins 25  The essential conservative residues are marked as white characters on black background.The shadowed characters stand for typical residues for aligned positions (significant conservativity) in the MultAlin alignment and for the semihomology relationship in the genetic semihomology approach.The segment consensus sequence calculated by each method is presented below each multiple alignment.
lated positions (grey) is different.The gap contribution is very limited and the identity is up to 60% -which proves the significant relationship between these sequences.However, the results are different for each analysis.ClustalW gives the least amount of information.The results obtained with MultAlin show higher similarity and the consensus is more complete (although this program uses the same scoring matrix as ClustalW).Significantly more information is obtained with the genetic semihomology approach.The contribution of conservative positions is the fullest -almost all residues at corresponding positions show possible genetic relationship.The concentration of non related residues (white background) suggests a variability mechanism different from single point mutation or more intensive mutational changes at these spots.The gap setting is also different in several cases.The occurence of specified consensus positions over "blank" positions is the highest.A meaningful consensus can also be obtained with the DIALIGN program that is a segment-to-segment approach to multiple sequence alignment (Morgenstern et al.,1996;Morgenstern, 1999).However, the informative value of this result is not very high.Not all positions are considered to be aligned, there is one additional gap close to the C-terminus and gap distribution conforms to the other alignments only in general.Similar comparative studies were done for human erythrocyte a-spectrin repeats (23 domains) (Fig. 3).The homology among them does not exceed 30% (for most of them it is less than 25%).The high appearance of possible genetic relationship may be overestimated in this case, because of the very high position variability.The very variable positions (the ones that accept more than 8 residues) should not be analyzed for semihomologous relationship only because of too many possible ways of codon transformation.However, the genetic semihomology approach identifies many more consensus positions than the other methods do (see the chapter "Construction of a sequence consensus for related proteins").
The eglin-like proteins were aligned in the same manner as spectrin repeats (Fig. 2).The eglin itself is cysteine-free, therefore cysteine contribu-tion in the structure alignment is almost none, but the homology between the sequences is higher than for spectrin repeats (up to 50%).A thorough analysis and verification with the genetic semihomology approach exposed a considerable amount of consensus residues.

Construction of a sequence consensus for related proteins
The aligned sequences of homologous proteins are used to construct a consensus sequence specifying the most conservative residues at the positions significant for a protein family.The important parameter in the consensus construction is the ratio (r) of a residue occurrence (n) in the aligned position per number of aligned sequences (N): Usually the value of this ratio is proportional to the homology degree of the aligned proteins.For very coherent high homologies the consensus residues reveal the r value of 0.7 to 1.0.For lower homologies (30% or less) the accepted r value may be lower, but it should not be less than 0.4.To get a good consensus the ratio should not be lower than 0.5 if there are less than 20 sequences aligned.In questionable cases the accompanying residues at the aligned position may be considered as well.For example, if there is a leucine of the r value of about 0.4 and that position is also occupied by isoleucine (similar physicochemically, and semihomologous genetically to leucine) then leucine may be assumed as the consensus residue if the r value for both Leu and Ile is significantly high (e.g., 0.7).However, it is better to use a more restrictive design strategy for more correct recognition of the actual homologies.
The sequence similarity search results for the a-spectrin consensus constructed by different methods are shown in Fig. 5.The consensus refers to the general 106 amino-acid repeat of human erythrocyte a-spectrin (the repeats reveal homology less than 30% between each other; Fig. 3).The consensus designed by Sahr and co-workers (Sahr et al., 1990)   The results of two BLOSUM approaches (ClustalW and MultAlin) and the genetic semihomology analysis.The essential conservative residues are marked as white characters on black background.The shadowed characters indicate the positions of significant relationship in the BLOSUM alignments and the semihomology relationship in the genetic semihomology approach.MultAlin consensus symbols: "%" is any of FY; "#" is any of NDQEBZ.ClustalW consensus symbols: "*" stands for identical or conserved residues in all sequences in the alignment; ":" indicates conserved subsitutions; "." indicates semi-conserved substitutions.See text for details.

Dot-matrix presentation of low but significant homologies
In this study 23 human erythrocyte a-spectrin repeats were taken as an example of low homology sequences of common origin.The homology between the segments is usually less than 25% (regarding identities).It is evident that all a-spectrin repeats have evolved from one ancestral gene encoding the initial 106-residue segment by several contiguous duplications (Speicher & Marchesi, 1984;Wasenius et al., 1989).The sequence homology is hardly visible (Fig. 3), much more similarity concerns the secondary and tertiary structure of the spectrin segments, each possessing triple helical character (Speicher & Marchesi, 1984;Yan et al., 1993).The dot plot reveals repeated internal homology along the a-spectrin chain when appropriate frame setting and identity threshold are used (Fig. 6).The common features of most repeats can be confirmed and more details can be concluded when the dot plot is run for a-spectrin versus the consensus segment sequence of Sahr and co-workers (Sahr et al., 1990).This plot shows even more details when the genetic semihomology consensus is used and the plot is run in the semihomology mode (visualization of identical and semi-homologous pairs) (Fig. 7).The repeating structure of the spectrin chain is then clearer, and the conservative spots can be localized more easily.The dot-plot analysis of a-spectrin chain with the consensus gives also some information about the evolutionary distance among the segments.The segments 10 and 21-23 are not visible on these plots at all, which indicates their high divergence in comparison with the other repeats.These conclusions are concordant with earlier reports (Speicher & Marchesi, 1984;Wasenius et al., 1989).Additionally, this approach is a useful tool for detailed analysis of structurally and functionally essential fragments as well as of the differentiation mechanism of each segment.

Application of the genetic semihomology algorithm to the study of the six-codon amino acid distribution as a function of position variability A possible role of cryptic mutations in protein differentiation
Different numbers of codons (1 to 6) encode individual amino acids.Among them are three amino acids which have the maximum number of codons (6) -arginine, leucine and serine.The commonly used algorithms and programs do not consider the number of possible codons in the study of mutational replacement of a particular residue.The well-developed alignment procedures do not respect this parameter either.In this chapter the theoretical significance of multiple-codon residues and cryptic mutations in increased frequency of the amino acid replacement is discussed.Also the contribution and possible role of these residues in variable positions is presented.
The three six-codon amino acids differ in the diversity in codon composition.The least diversity is among leucine codons (CTX and TTR), the most difference is observed for the codons of serine (AGY and TCX).Arginine codons (AGR and CGX) are described as more diverse than leucine codons, since the latter always have a pyrimidine at the first position.There are mutations possible other than the replacement of the third nucleotide that do not change the encoded amino acid.For these amino acids even multiple mutations may not change the residue.Such mutations are de-Vol.48 Genetic semihomology studies on proteins 29 For the best visualization of the repeats, the identity threshold is set as 15, and frame size as 75.Details in the Results and Discussion.
fined as cryptic.The genetic code matrix scores (GCM) (George et al., 1990) for these specific cryptic mutations are within the range +1 to +2 for Arg and Leu and 0 to +2 for Ser.That means that for Arg and Leu a replacement of two codon positions does not change the amino acid and serine may remain even after replacement at all three positions of its codon (e.g., AGA ® TCC).It is obvious that cryptic mutations should not have any evolutionary consequences at the protein level, since the protein remains identical.Therefore these mutations are not limited by structural or functional requirements of the protein.On the other hand, it is evident that among the mutations affecting the structure and function, the most common and most likely are single-base replacements.Thus the cryptic mutations may serve as a "passage" to increase the number of residues at a variable position.Theoretically a single point mutation can transform leucine to 10 amino acids.Arginine and serine may be changed to 12 amino acids each.For comparison the four-codon amino acids may be transformed to 7-8 other residues by a single point mutation.
The genetic semihomology planar diagram (Fig. 8) shows the cryptic passages for Leu, Arg and Ser.If this mechanism is true then a higher frequency of these three amino acids should be observed at the very variable positions, where eight or more types of amino acids occur (Leluk, 1998;Leluk 2000a).Moreover, serine should dominate over arginine and arginine over leucine at those positions.
Several protein families were subjected to Ser-Arg-Leu occurrence analysis as a function of the degree of position variability (Fig. 9).Generally the theoretical prediction of their frequency was confirmed.However, for some protein families (e.g., eglin-like proteins) the results were dif- ferent than expected.The analysis of all sequences from all families combined showed a distinct domination of serine frequency at the positions occupied by seven and more residues (Fig. 10).Leucine contribution is the least, and the rise in frequency as the number of residues in-creases is not as clear for leucine as for serine and arginine.In conclusion, it may be assumed that the mechanism of cryptic passages of six-codon amino acids plays a special role in increasing the position variability.
The authors wish to thank Miss Monika Grabiec and Miss Monika Sobczyk for their contribution in the part of work concerning the multiple alignment verification and consensus formation.Note that leucine frequency increase is not as regular as for arginine and serine.Except for the eglin-like proteins the serine and arginine occurrence is generally more significant at the most variable positions than leucine occurrence.The calculations concern 2686 residues at 606 corresponding positions.

Figure 1 .
Figure 1.Two examples of multiple alignment of highly homologous, cysteine-rich proteins, verified by the algorithm of genetic semihomology.

Figure 2 .
Figure 2. Multiple alignment of proteins homologous to eglin C from Hirudo medicinalis.

Figure 3 .
Figure 3. Multiple alignment of human erythrocyte a-spectrin 23 segments achieved with the MultAlin (BLOSUM62) method and the genetic semihomology approach.

Figure 4 .
Figure 4. Multiple alignment of proteins homologous to the 10th segment of human erythrocyte a-spectrin.
CHAIN, BRAIN (SPECTRIN, NON-ERYTHROID ALPHA CHAIN) P14085 TYROSINE-PROTEIN KINASE TRANSFORMING PROTEIN SRC FROM AVIAN SARCOMA VIRUS (P60-SRC) P39688 MOUSE PROTO-ONCOGENE TYROSINE-PROTEIN KINASE FYN (P59-FYN) P00526 TYROSINE-PROTEIN KINASE TRANSFORMING PROTEIN SRC FROM ROUSE SARCOMA VIRUS(P60-SRC) P09769 HUMAN PROTO-ONCOGENE TYROSINE-PROTEIN KINASE FGR (P55-FGR) (C-FGR) P07948 HUMAN TYROSINE-PROTEIN KINASE LYN P05433 P47(GAG-CRK) PROTEIN FROM AVIAN SARCOMA VIRUS P07947 HUMAN PROTO-ONCOGENE TYROSINE-PROTEIN KINASE YES (P61-YES) (C-YES) P08631 HUMAN TYROSINE-PROTEIN KINASE HCK (P59-HCK AND P60-HCK) (HEMOPOIETIC CELL KINASE) P06240 MOUSE PROTO-ONCOGENE TYROSINE-PROTEIN KINASE LCK (P56-LCK) (LSK) P42683 CHICKEN PROTO-ONCOGENE TYROSINE-PROTEIN KINASE LCK (PROTEIN-TYROSINE KINASE C-TKL) P08487 BOVINE 1-PHOSPHATIDYLINOSITOL-4,5-BISPHOSPHATE PHOSPHODIESTERASE GAMMA 1 (PLC-GAMMA-1) (PHOSPHOLIPASE C-GAMMA-1)(PLC-II)(PLC-148) and specifically considers features typical for a-spectrin.It works much better than the consensus obtained with the MultAlin (BLOSUM62) approach.The MultAlin consensus results could be obtained only when the expect threshold (the statistical significance threshold in BLAST similarity searches for reporting matches against data-base sequences) setting is extremely high.The genetic semihomology consensus does not separate a-and b-spectrins from each other as well as the consensus of Sahr and co-workers, but the expect threshold values are much lower.It means that the related proteins are easier to be found in the database as sequences of true homology.

Figure 5 .
Figure 5. Use of a-spectrin consensus achieved with different algorithms for sequence similarities search (BLAST).

Figure 6 .
Figure 6.Dot matrix comparison of human erythrocyte a-spectrin with itself.

Figure 7 .Figure 8 .
Figure 7. Dot matrix comparison of human erythrocyte a-spectrin with the consensus of 106-residue repeats.The consensus is (A) as described bySahr et al. (1990) or (B) achieved with the genetic semihomology approach(Leluk, 1998).The identity threshold and frame size are set as 8 and 40, respectively.In the genetic semihomology approach (B) the identity and genetic semihomology of the compared residues is visualized.Details in the text.

Figure 9 .
Figure 9. Frequency of six-codon amino acids as a function of position variability in different protein families.

Figure 10 .
Figure 10.Frequency of six-codon amino acids as a function of position variability in randomly selected proteins of different origin and nature (see text for details).