Regular paper on-line at: www.actabp.pl Tandemly repeated trinucleotides — comparative analysis

Characteristics of 64 possible tandem trinucleotide repeats (TSSR) from Homo sapiens (hs), Mus musculus (mm) and Rattus norvegicus (rn) genomes are presented . Comparative analysis of TSSR frequency depending on their repetitiveness and similarity of the TSSR length distributions is shown . Comparative analysis of TSSR sequence motifs and association between type of motif and its length ( n ) using ρ-coefficient method (quantitatively measuring the association between variables in contingency tables) is presented . These analyses were carried out in the context of neurodegenerative diseases based on trinucleotide tandems . The length of these tandems and their relation to other TSSR is estimated . It was found that the higher repetitiveness ( n) the lower frequency of trinucleotides tandems . Differences between genomes under consideration, especially in longer than n = 9 TSSR were discussed . A significantly higher frequency off A- and T-rich tandems is observed in the human genome (as well as in human mRNA) . This observation also applies to mm and rn , although lower abundant in proportion to human genomes was found . The origin of elongation (or shortening) of TSSR seems to be neither frequency nor length de pendent . The results of TSSR analysis presented in this work suggest that neurodegenerative disease-related microsatellites do not differ versus the other except the lower frequency versus the other TSSR . CAG occurs with relatively high frequency in human mRNA, although there are other TSSR with higher frequency that do not cause comparable disease disorders . It suggests that the mechanism of TSSR instability is not the only origin of neurodegenerative diseases .


INTRODUCTION
Mammalian genomes scattered with simple sequence repeats (SSRs) (minisatellites and microsatellites) comprise about 3% of the human genome.
SSRs are liable to fluctuate, introducing deletion or insertion of one or more repeat units, probably because of the so-called slipped strand mispairing, which predisposes to pathogenic deletions and frameshifting insertions.
SSRs -microsatellites are the most abundant and have a more uniform distribution than minisatellites with the greatest single contribution originating from dinucleotide repeats, the ones most widely used in genetic analyses, which are now central to medicine, agriculture, evolutionary biology, and forensic science (Waterston et al., 2002).
Trinucleotide microsatellites (TSSR) focus the attention of researchers nowadays because of their high polymorphism, which is very useful in genetic studies. They produce a very large number of alleles because of their high variability. This generates a very high degree of heterozygosity or polymorphic information content at each locus, making them some of the most informative genomic markers for genetic analyses (Bickeboller & Clerget-Drapoux, 1995). This variability influences their expandability as well as contractibility. Although expansions and contractions can occur, a bias towards expansion is mostly observed. It is characteristic that repeats below a certain length are stable in mitosis and meiosis, while above a certain threshold length the repeats become extremely unstable (Strachan & Read, 1999). The amplitude between expandability and contraction may differ significantly between particular sequence motifs (Jurka & Pethiyagoda, 1995).
Most SSRs (including TSSR) are known currently as useful in linkage studies of mouse and human, because of their polymorphism in populations. Such SSRs arising through replication errors might be largely equivalent between mouse and human, but impressive differences between these two species are observed (Beckman & Weber, 1992).
On the other hand, the ability of TSSR to change size is also responsible for some genetic diseases, the origin of which is mainly the elongation of trinucleotide repetitive sequences. This abnormality in TSSR distribution is registered frequently in neurodegenerative diseases (see Table 1), the background of which draws the attention of many researchers. The trinucleotide repeat disorders are a growing list of genetic neurodegenerative diseases characterized by the expansion of normally polymorphic repeated tripled nucleotides. They are dangerous because some kinds of tandems (located in genes) may be possibly responsible for modulation of protein-protein interaction. The length of tandem repeats is essential to the interaction between proteins. For example, protein containing polyglutamine tracts causes neurological disorders because of the different lengths of polyglutamine repeats associated with different affinities to transcription factors (Ashley & Warren, 1995;Margolis et al., 1997). Expansion of tandem CAG-nucleotides causing neurodegenerative disorders results from to a toxic gain of function of mutant expanded proteins. Occurrence of NIIs is characteristic. Protein misfolding, interference with DNA transcription and RNA processing, activation of apoptosis and dysfunction of cytoplasmic elements have all been invoked in the toxic process (Everett & Wood, 2004). Interesting is the androgen receptor gene, mutation of which causes spinal and bulbar muscular atrophy. It is caused by expansion of the trinucleotide (CAG) repeat that codes for a polyglutamine tract in the transactivation domain of the receptor. Infertility is also associated with CAG expansion in the androgen receptor. It is known that infertile men are more likely to have longer than normal CAG repeats in the androgen-receptor gene than fertile men (Dowsing et al., 1999). Lower numbers of CAG repeats in the androgen-receptor gene have been associated with higher incidence of prostate cancer (Gsur et al., 2002;Strom et al., 2004).
Two models have been proposed to account for variability in the number of repeat units. The first one is the initial co-mobilization of SSRs with dispersed repeats as a result of transposition. The length and composition of repeat units would be the result of unequal strand exchange followed by nucleotide divergence. In the second model the majority of length variants would result from mistakes (they are thought to arise by slippage) during DNA replication or during sister chromatid exchange (Toth et al., 1987;Levinson & Gutman, 1987;Schlotterer & Tautz, 1992;Kruglyak et al., 1998). Some data support a model in which expansion in the germ cells arises by gap repair and depends on a complex containing Msh2. Expansion occurs during gap-filling synthesis when DNA loops comprising the CAG trinucleotide repeats are incorporated into the DNA strand (Pearson et al., 1997;Kovtun & McMurray, 2001).
TSSR instability may result in the creation of non-standard structures of DNA, particularly in the non-coding regions (Sinden et al., 2002), which disturbs the natural functioning of genetic processes, or it can result in perturbation of gene expression, causing synthesis of defective proteins (when the TSSR occurs in the gene or in the close vicinity of the gene).
Despite the large number of publications on this subject, the mechanism of tri-nucleotide microsatellite expansion is not completely identified. That is why analysis of repetitive sequence instability is of such interest.
The characteristics of tandemly repeated homogenous trinucleotides in genomes of selected organisms are presented in this paper. The relation between sequence motifs of TSSR and their lengths was the main object of this research. This relation was compared in different organisms in the context of disease-related TSSR expandability (some of the repeats recognized as disease-related are listed in Table 1. The distribution of TSSR in the organisms under consideration and a comparative analysis of the quantitative estimation of association between the sequence motif and its repetitiveness is presented in this paper. The distribution was approximated to a function, the parameters of which allowed a qualitative comparative analysis. A scale quantitatively measuring the association between two parameters (sequence motif and repetitiveness) was introduced based on the ρ-coefficient. The method of ρ-coefficient calculation allows ranking according to the strength of the mutual dependence. If the dependence is found, the results can show, which form of the pair of variables is mostly responsible for this association and which ones play a negligible role (Goodman & Kruskal, 1959;1963;1972). Among a few methods to calculate dependency in contingency tables (Goodman & Kruskal, 1954;1959;1963;1972; Tandemly repeated trinucleotides Björnstad, 1979) the ρ-coefficient analysis not only gives information about the presence of the association but also allows the form of the particular pairs of variables to be validated. This method allows assess dependency between both grouped and subdivided variables.

MATERIALS AND METHODS
Genomic DNA and cDNA (mRNA) data were taken to computational analyze of TSSR. The sequences of the following model organisms were studied in the context of the presence of poly-trinucleotides: Homo sapiens, Ratus norvegicus, Mus musculus (The data from 14.04.2003) were from National Center for Biotechnology Information (http://www. ncbi.nlm.nih.gov/). The sequence of human mRNA was also taken from NCBI.

Computational analysis
TSSR distribution comparison. TSSR were detected by tracing the (XXX) n sequence repeats in genomes, where n = 4 to 14 and XXX = sequence motif under consideration. The size of n in TSSR frequency calculation was selected as n = 4 to 14 based on data of (Margolis et al., 1997) PERL scripts were used for counting.
The function (found according to MATLAB to be the best one) (1) was applied to approximate the mammalian (hs, mm, rn) TSSR distribution depending on the length. Three parameters (k 1 , k 2 , k 3 ) found for each distribution were compared allowing comparison of genomes under consideration.
The exponential function parameters (k 1 , k 2 , k 3 ) turned out to be simple and easy for interpretation to show how frequencies of trinucleotide sequences within different genomes are differentiated.
The meanings of the parameters are as follows: k 1 -directional coefficient k 2 -an increase in the values of the k 2 parameter causes a slight decrease of the function values together with an increase of the x-value (size of tandem). In our case, it proves the slight decrease of frequency of repeated sequences, from short to long tandems, or it proves a sudden increase of long tandems k 3 -however, a high value of the k 3 parameter causes a high value of function for a low x-value. With the increase of the x-value (size of tandem) the value of the function drastically decreases. The higher the value of k 3 the shorter the length of TSSR.
The method allows the assessment of differences between TSSR within one genome and also permits the inter-genome comparative analysis of TSSR.
Dependence in contingency table measurements. The ρ-coefficient was applied to measure the association between two qualitative variables (sequence size of tandem A = {A i } -columns; its sequence motif B = {B j } -rows). The ρ-coefficient was defined as follows for a 2 × 2 table: and was used to evaluate the mutual dependence between sequences consisting of a particular sequence motif and its tandem size (repetitiveness).
Briefly, the method is as follows: Assume that the contingency table below represents the observed (empirical) probabilities for c different realizations of variable A (qualitative) and for r different realizations of variable B (qualitative). For the problem presented in this paper, assume that A represents sequence motives and B their repetitiveness (see Table 2).
To estimate whether p ij expresses relatively high or low probability (high or low coupling of a (2) particular i-th sequence with a particular j-th structure, its value is compared with all possibilities for solutions of other A (excluding the i-th) and other B (excluding the j-th). Each pair of i-th and j-th realizations of A and B can be represented using a 2 × 2 contingency table (see Table 3).
The value of the ρ-coefficient can be calculated for each i-th and j-th realization of A and B. The ranking order permits comparisons between particular pairs over the whole contingency table, allowing selection of those that play an important role in the general dependence of A and B (structure-tosequence). High ρ values distinguish pairs whose participation in general dependence is high. Others with lower ρ values are not necessarily responsible for the dependence (relation) under consideration.
The method based on the ρ-coefficient enables the assessment of the strength of the pair-wise mutual dependence. In other words, we can find the preferences for particular sequences to occur with a particular length. Thus the ρ-coefficient based method allows for distinguishing the sequences with tendency to low and high dispersion all over the genome. The value of the ρ-coefficient can be interpreted as a measure of the strength of association between a particular length and a particular kind of motif.

RESULTS
The analysis was focused on the search for similarities and differences of TSSR between the species studied, taking the length (size of tandems) and sequence motifs into account.

Comparative analysis of TSSR frequency depending on their repetitiveness
Polynucleotide repeats are overrepresented in the genomes of most eukaryotes. The amount of TSSR in sc, at, oi and at is significantly lower than in mammalians like hs, mm and rn. Among the bacterial genomes present in our analysis (not published), the oi genome contains the highest number of TSSR. There are repetitive trinucleotide sequences only for n<= 5. The general tendency (on the basis of selected organisms) found for TSSR frequencies are that the higher the number of n the lower the number of trinucleotides in tandemly repeated fragments. The most frequent are the tandems of n = 4 (especially in hs, mm and rn genomes; see Fig. 1A). The decrease of frequency treated as dependent on tandem size has a hyperbolic shape, although the functions differ.
Long tandems (n above 10) are more frequent in mm and rn than in hs (see Fig. 1B).
The number of TSSR in the mm genome was found to be higher than in the rn genome.

Similarity of the TSSR length distributions
The frequency distribution was approximated to the function presented in Methods. Only hs, mm and rn were incorporated into this analysis. Three parameters were obtained for each approximated function calculated for the distribution of tandems (n = 4 to 14) (see Fig. 2). Mean standard errors of approximation in particular species were as shown below (see Table 4).   Parameter k 2 in hs is rather stable. Its value slightly increases together with the increase of the presence of C, G and T (see Fig. 2A). An exception was (TCG) n , for which parameter k 2 is equal to 139.936, due to the relatively low frequency of TSSR and the presence of longer repeats with higher frequency and the absence of some short tandems.
Approximation of parameter k 3 allowed estimation of the "preference" of particular TSSR to be present in tandems of low size. A high value of k 3 suggests a tendency to lower n values (see Fig. 2B). Interpretation of the k 3 value for all trinucleotides in hs shows that tandems (AAA) n and (TTT) n appear mostly for low n (n = 4). An increase of n for those trinucleotides causes a decrease of their frequency. The lowest k 3 was obtained for (TCG) n . The lowest and the highest values of k 3 are shown in table (see Table 5).

Comparative analysis of TSSR sequence motifs
The analysis reveals that the TSSR of highest frequency are the trinucleotides (AAA) and (TTT) especially in the human genome (Fig. 3A). These sequences are also abundant in mm and rn genomes. Tandems (AAA) in the human genome appear to have a 0.343366 share of all TSSR under consideration (n = 4 to 14). Their shares in the other analyzed organisms are mm = 0.19435, rn = 0.211723. TSSR (AAA) n , and (TTT) n appeared significantly frequently in at and sc genomes.
The characteristics of (AAA) n and (TTT) n seem similar for all organisms, but other sequence motifs rich in "A" and "T" appeared to differ significantly between organisms. It is surprising that TSSR very abundant in the mm and rn genomes are not frequent in the hs genome (see Fig. 3B). These motifs are (TAT) n , (ATA) n , (TTA) n , (TAA) n , (ATT) n , (AAT) n (combinations of nucleotides "A" and "T"), (GGC) n , (GCC) n , (GCG) n , (CGC) n , and also (CGG) n , and (CCG) n . On the other hand, (TTG) n , (CAA) n ,  The distribution of (TAC) n in rn was the worst approximation, with 46% error.

Species
Mse (%) hs mm rn rn (TAC) n 3.6 3 3.4 46  The abundance of TSSR (CCC) n and (GGG) n , in the mm and rn genomes, is also remarkable while in hs these sequences occur with low frequency.
Disease-related trinucleotides are present in the analyzed genomes with relatively low frequency versus the frequency of other TSSR (see Table 6).
The proportions of disease-related tandems in coding sequence mRNA differ. Thus, tandems (CAG) n (whose expansion is responsible for most neurodegenerative diseases) are only 0.25% of the complete amount of TSSR; in mRNA their presence is expressed by 2.05% of TSSR (n = 4 to 14) (see Fig. 4). Their presence in mRNA can be even as high as 5% in the case of exclusion of (AAA) n and (TTT) n (see Table 7).
The distribution of the remaining diseaserelated trinucleotides in mRNA is also shown in Fig. 4. It reveals that TSSR like (CAA) n are not many, while in the whole genome they have higher participation.

Association between type of motif (sequence motif) and its length (size of tandem)
The association between the size of tandems and the trinucleotide sequence was searched using ρ-coefficient. The contingency table was created, with columns expressing the length of the tandem (A i ) (4 to 14) and rows expressing the sequence of the trinucleotide (B j ) (64 combinations of four different nucleotides in the trinucleotide sequence).
The ρ-coefficient was calculated for hs, mm and rn. ρ-Coefficient calculation reveals that associations can be found for particular sequences and their lengths. Moreover, the procedure of ρ-coefficient calculation for each cell of the contingency table can validate a particular dependence quantitatively.
The contingency table with ρ-coefficients calculated for each cell is presented in Fig. 5 using a color scale (legend included). The highest ρ-coefficient values are explained precisely.
Trinucleotides TCT and TCG appeared to be highly associated with the multiple of 4 in the hs genome (see Fig. 5a). It may be inferred that these trinucleotides represent a low tendency to polymorphism. The interpretation is as follows: If the TSSR of the sequence (TCT) n or (TCG) n is found in the human genome, one can predict that its size is n = 4. There is a low probability that longer fragments can be found, and one may be sure that it does not represent a length of n = 10. Although the calculated 0.075 value of the ρ-coefficient is very low its value shall be interpreted relative to the contingency table under consideration. This value is highest in comparison with all others obtained for this contingency table, and reveals the association between the presented trinucleotides and the tendency to occur in a multiple of 4.
Trinucleotides CTT and GGA (see Fig. 5b, c) of mm and rn appeared to represent the highest association with the length of tandems n = 4.
The disease-related trinucleotides do not reveal a high association with a particular length of TSSR that can be interpreted as sequences of high polymorphic character. No association was found in mm and rn between disease-related trinucleotides and their length suggesting the absence of evolution-dependent processes.
The high association of sequence-to-length analysis in hs mRNA found for (TCG) n , and n = 4 (see Fig. 6) supports findings from the complete hs genome. The low polymorphic character of this sequence seems more reliable.
Trinucleotides of high expandability, such as (CAG) in human mRNA, do not exhibit any association with a particular length.

DISCUSSION
The characteristics of TSSR allow localization of polymorphic sequences. It is also important to distinguish dominant and marginal fractions in the genome. Estimation of the association between the sequence of the repetitive unit and its length may allow analysis in the context of evolution, particularly when the genomes of different species are compared.
Numbers and proportions of TSSR are different in every species. The comparison of genomes of non-mammalian organisms (e.g. bacteria, fungi, plants, etc.) with genomes of mammals (e.g. human, rat, mice, etc.) reveals significant increase of tandem sequence accumulation. The higher the genome organization the higher the number of TSSR. On the other hand, for comparably developed organisms (for example mammals) both the number and the kind of repetitions are also differentiated, which indicates that those sequences can be involved in many biochemical processes that are typical for organisms with genomes in which they occur.
Some TSSR are important for proper protein interactions in transcriptional complexes (Ashley & Warren, 1995;Margolis et al., 1997). One may assume that also some TSSR, when expressed, can influence many special biochemical interactions on the protein level. In spite of this they may be responsible for proper interactions between proteins occurring in biochemical processes in cells. On the other hand, tandem trinucleotide repeated sequences occurring in noncoding fragments probably can play an important role in the organization of genetic material.
It can be concluded that highly organized genomes have developed a system of repeated sequences for a better organization of genetic material and a better precision in a process of expressing information stored in genomes.
Particularly important is comparative analysis of neurodegenerative disease-related microsatellites and other repetitive sequences without a disease-related phenotype. Columns -size of tandems (4 to 14), rows -trinucleotides (color legend included). High ρ-coefficient cells are distinguished and described. Columns -size of tandems (4 to 14), rows -trinucleotides (color legend included). The high ρ-coefficient cells are distinguished and described.