Prediction of the structure of the common perimitochondrial

Many nuclear genes encoding mitochondrial proteins require specific localization of their mRNAs to the vicinity of mitochondria for proper expression. Studies in Saccharomyces cerevisiae have shown that the cis-acting signal responsible for subcellular localization of mRNAs is localized in the 3' UTR of the transcript. In this paper we present an in silico approach for prediction of a common perimitochondrial localization signal of nuclear transcripts encoding mitochondrial proteins. We computed a consensus structure for this signal by comparison of 3' UTR models for about 3000 yeast transcripts with known localization. Our studies show a short stem-loop structure which appears in most mRNAs localized to the vicinity of mitochondria. The degree of similarity of a given 3' UTR to our consensus structure strongly correlates with experimentally determined perimitochondrial localization of the mRNA, therefore we believe that the structure we predicted acts as a subcellular localization signal. Since our algorithm operates on structures, it seems to be more reliable than sequence-based algorithms. The good predictive value of our model is supported by statistical analysis.


INTRODUCTION
Subcellular localization of a given mRNA may play a crucial role in correct functioning of the respective protein in the cell.Some proteins require mRNA localization for their expression in specific subcellular compartments, for example those involved in embryo development, neural activity and in mitochondrial biogenesis.Many of the about 1 000 mitochondrial proteins encoded by the nuclear genome are synthesized in a process that strictly requires localization of their transcripts in the vicinity of subcellular structures, such as mitochondria (Jansen et al., 2001).
Studies in Saccharomyces cerevisiae have proved that the cis-acting signal responsible for mRNA localization to the vicinity of mitochondria is localized in the 5' UTR and/or 3' UTR of the ATM1 gene.Perturbation of this signal can lead to incorrect localization, which results in respiratory dysfunction, despite the fact that the ORF has not been changed (Corral-Debrinski et al., 2000).In addition, results of a genome-wide microarray assay experiment have shown that a signal in the 3' UTR is responsible for mitochondrial localization of dozens of nuclear mRNAs encoding mitochondrial proteins.It has also been reported that most proteins translated from mRNAs with perimitochondrial localization are of a prokaryotic origin (Marc et al., 2002).Following subsequent studies, a 3' UTR stem-loop structure responsible for sorting the ATP2 mRNA to the vicinity of mitochondria has been predicted (Margeot et al., 2002).The role of the 3' UTR signal responsible for mRNA distribution between free and mitochondria-bound polysomes has also been shown to be conserved in the eukaryotic world (Sylvestre et al., 2003).
Those experiments suggest that a 3' UTR signal, probably a stem-loop structure, could be a universal tag marking mRNAs for transport to the 2007 R. K. Ejsmont and others vicinity of mitochondria.In this paper we present an approach to predict in silico a structure that is common to most transcripts that represent a high Mitochondrial localization ratio (MLR), by analyzing data from the experiment performed by Marc et al. (2002).

Sequences and databases.
All yeast genome sequences were downloaded from the Saccharomyces cerevisiae Genome Database (SGD) (Cherry et al., 1997).The database containing fungal 3' UTR sequences (UTRdb) was downloaded from Internet Resources for UTR analysis (Pesole et al., 2002).Localization data for nuclear transcripts encoding mitochondrial proteins were downloaded from the LGM Mitochondria Microarray Project Web site (Marc et al., 2002).
We have downloaded sequences of 3050 genes out of 3 106 analyzed in (Marc et al., 2002).Sequences for 56 genes were unavailable, thus not analyzed.
RNA sequence and structure prediction.The 3' UTR sequences were predicted as described in the Results section and in Fig. 1.We used Clust-alW (Higgins et al., 1994) to make a global alignment and WUstl-BLAST (Gish et al., 1996) for local alignment and sequence screening.Structure prediction was performed with mFold (Zuker et al., 2003).Structural alignments were done using RNAdistance from the ViennaRNA package (Hofacker, 2003).
Data analysis.We tested the quality of both sequences and structures we have predicted.Since the -log(E-value) indicates the likelihood that the predicted sequence is aligned correctly and Gibbs' free energy (ΔG) presents the stability of the predicted structures, these values were used to measure the quality.We also compared the length and the GC-content of the 3' UTR sequences in our database.
All of the predicted structures had some common elements.We applied statistical methods to decide if differences between groups of structures with different MLR present enough diversity to consider our template as a common element of mRNAs that are transported to the vicinity of mitochondria.Thus we computed average ScoreD for groups of structures with similar MLR values and evaluated the correlation between ScoreD and MLR by linear regression.Moreover, to make sure that a statistically significant difference exists between groups with different MLRs, we applied a test comparing ScoreD for subsets with a high (> 90) and low (< 10) MLR value.

Identification of the 3' UTR sequences
Since yeast EST (see Sequences and databases) databases do not contain the majority of transcripts, we had to predict most of the 3' UTR sequences from genomic sequences.For each analyzed gene a sequence of 2000 nucleotides downstream from the STOP codon of the respective ORF was prepared.We called it the 2 kb tail.It was used as a BLAST query against the UTRdb (Pesole et al., 2002).We extracted from the UTRdb sequences with the highest homology to the query and globally realigned them to the query sequence using ClustalW.The existence of the poly-A tails in the query sequences could disturb the alignment, therefore they were truncated.Alignment files were parsed by our software and 3' UTR sequences, beginning at the first nucleotide of the 2 kb tail sequence and ending at the last matched nucleotide of the alignment were extracted.Both 2 kb tail and 3' UTR sequences for each analyzed gene were placed together with the genomic and coding sequences in an SQL database.The whole sequence extraction procedure is illustrated in Fig. 1.
From the 3050 sequences downloaded from SGD, we were able to predict 3' UTR sequences for 2953 genes.The shortest predicted sequence had the length of 45 bp, whereas the longest 2 kb.The average sequence length was 1103 bp.The distribution of sequence lengths across the database was similar for subsets with different MLRs and very close to the average for the whole database.We observed a similar pattern in respect to the GC-content in the predicted sequences, which was approx.34.25%.The average -log(E-value), representing the probabil- Genomic sequence (genomic) is aligned using ClustalW with the genomic sequence containing 2 kb downstream region (genomic-2kb-tail) and the downstream region is extracted (A).Extracted sequence (2kb-tail) is used as a BLAST query against Fungi UTRdb and the highest scoring sequence is extracted (B).The sequence from Fungi UTRdb is aligned using ClustalW with the 2 kb-tail sequence (C).3' UTR sequence is extracted from alignment.It begins with first nucleotide of 2 kb-tail sequence and ends at the last matched nucleotide in alignment (D).

Perimitochondrial localization signal of nuclear transcripts in yeast
ity of a successful alignment, was for our database 33.54, indicating that it was of a high quality.

Identification of the template for transport signal prediction
In order to predict a common signal for perimitochondrial localization of mRNA we decided to find one model 3' UTR structure (a template) based on the criteria listed below, and to look for its occurrence in other 3' UTR structures.If the structure we have chosen was a real mitochondrial localization signal, there should be a correlation between the experimental MLR values from Marc et al. (2002) and the similarity of a given mRNA structure to our template.
We used the following criteria in our search for the template: -The structure should be small, less than 200 nucleotides long, since such small structures were reported previously as controlling mRNA localization (Margeot et al., 2002); -The MLR value for mRNA containing this structure should be high, not less than 95; -The protein encoded by the mRNA containing the mitochondrial localization signal should have prokaryotic homologs (Marc et al., 2002).
-It would be of advantage if a gene containing our template 3' UTR had a well-proven mitochondrial function or would be phylogenetically connected with the mitochondrial genome.
The sequence that satisfied the majority of these conditions was the 3' UTR of the YJL225C gene, with the MLR = 99.This gene is localized on yeast chromosome X and encodes a protein with a helicase activity (Yamada et al., 1998).It has a CDS of 5277 bp and a short intron of 388 bp.The 3' UTR of YJL225C is very short and consists of only 72 bp.
The sequence of YJL225C is almost identical to the 3SCE000226 sequence from UTRdb and presents partial homology with RecG helicase from Chlorobium tepidum and BH1607 helicase from Bacillus halodurans.It is very likely that YJL225C is of mitochondrial origin, since the sequence downstream from the CDS contains a fragment identical to a part of the bI4-intron of the cytochrome b gene.No experimental data on the localization of the gene product have been reported in the literature.On the other hand, the MitoProtII prediction showed a rather low probability (13.70%) that this protein is localized in mitochondria.
We performed structure prediction for the 3' UTR sequence of our template using mFold and got a 72 bp stem-loop structure with five stems and five loops (Fig. 2).ΔG of the predicted structure was -8.98 kcal/mol and the average pairing energy was 125 cal/mol.

RNA structure prediction
The next step was to predict the structure of the template, as well as structures of other 3' UTR sequences in our database using mFold (Zuker, 2003).The predicted 2D structures were then converted to bracket notation and placed in our database.All of the predicted structures were aligned with the template using RNA distance from the Vi-ennaRNA software package (Hofacker et al., 2003) and assigned scores describing the distance between the analyzed structure and the template.The following scoring was used: -ScoreA -number of positions in the analyzed structure differing from those in the template divided by the analyzed structure's length; -ScoreB -number of positions in the template absent in the analyzed structure, divided by 0.05% of the template's length (factor based on normalization to the average value for the whole dataset); -ScoreC -number of inner (i.e., those flanked by already aligned fragments) positions in the analyzed structure absent in the template, divided by the analyzed structure's length; -ScoreD -the sum of ScoreA, ScoreB and Score C. The negative (ScoreD) of this value represents the similarity of the analyzed structure to that of the template.
The computed data were put into the database for further analysis.
Structure prediction for 2 953 tested yeast 3' UTR sequences in our database produced 78 705 structures, approx.27 structures per each 3' UTR sequence.Each structure had ScoreA, ScoreB, ScoreC, ScoreD and ΔG values assigned, based on the results of our computations.
We chose to work on structures with the lowest ΔG, one ΔG value per each gene (5961 structures, about two structures per sequence), since the others presented random ScoreD distribution with respect to MLR (results not shown).The average energy of the selected subset was -228 kcal/mol, average pairing energy was 192 cal/mol.
We computed average ScoreA, ScoreB, ScoreC and ScoreD for groups with different MLR values and found a linear correlation between ScoreD and MLR values.High ScoreD value corresponds to a  (Huh et al., 2003)).Bars show percentage of genes encoding proteins with known mitochondrial localization in each group.

Perimitochondrial localization signal of nuclear transcripts in yeast
Table 2. Genes with extremely high (>50) ScoreD and MLR from 81 to 90.
Localization data based on GFP assays (Huh et al., 2003).low similarity to the template, and correlates with low MLR values, low ScoreD represents structures with high MLR.Correlation was determined using linear regression with r 2 = 0.77 and a regression error of 0.66.See Fig. 3 for details.In total, 2 953 yeast 3' UTRs were analyzed.Structures for 2 302 contained some fragments of the YJL225C structure, we assumed therefore that this structure is as close to the consensus as possible.

Template verification
Since the biological criteria used for the best template selection were arbitrary, we had to check if the template selected on the basis of those criteria was indeed the best template possible.Therefore we selected 192 possible templates with MLR values greater than 88 and calculated correlation coefficients between ScoreD calculated basing on those templates and MLRs for the tested dataset of 2 953 yeast 3' UTRs.The results of these computations are plotted on Fig. 5.The strongest negative correlation between ScoreD and MLR exists for YJL225C, the template which we preselected.

Statistical analysis of computed data
In order to verify if the structure found was good enough to be considered as a consensus for all transcripts that follow the perimitochondrial localization pathway, we had to apply statistical methods that would show whether the determined correlation was statistically significant.The statistical test showed that the suggested dependency was statistically significant.

DISCUSSION
The obtained data suggest that the YJL225C 3' UTR is a good consensus structure for the perimitochondrial localization signal.It is short (probably due to subtelomeric localization of the gene), has a high MLR value (Marc et al., 2002), high AU content, has bacterial homologs and due to homology with mitochondrial DNA (see "Identification of the template for transport signal prediction"), is related to the mitochondrial genome.The protein encoded by YJL225C does not contain a mitochondrial import signal, however, there are other examples of

R. K. Ejsmont and others
proteins that do localize in mitochondria but do not contain an import sequence.However, a mitochondrial function of YJL225C is not strictly required for this protein.It could have been lost during evolution.The strong perimitochondrial localization signal in YJL225C 3' UTR could also have been acquired by recombination.
The modeled structure for the YJL225C 3' UTR sequence is similar to the previously predicted structure of ATP2 3' UTR, a transcript proved to localize to the vicinity of mitochondria (Margeot et al., 2002).The low energy of the modeled structure suggests that it is stable in the cellular environment.Fragments of the YJL225C 3' UTR appear in most of the analyzed structures, furthermore, the structures with high (> 90) MLR values contain nearly the entire YJL225C structure, whereas in those with low (< 10) MLR large fragments of this structure are missing.
Our analysis shows that the 3' UTR length has no influence on perimitochondrial localization.We observed that the 3' UTR regions of many genes overlap with 5' UTR regions or even CDS of downstream genes which is normal in the highly compact yeast genome as well as in genomes of other primitive Eucaryota.The great length of many sequences suggested that the assumed threshold of 2 kb could miss some sequences.In fact, we found 115 sequences with a length of 2 kb, but since they were a minor (less than 4%) part of the database, we ignored them.
We observed a high AU content in the predicted 3' UTR sequences, which is typical for untranslated regions of mRNA.Most regulatory regions are described as AU-rich sequences, especially those connected with transcription termination and polyadenylation (Grafi et al., 1993;Graber et al., 1999;Legendre et al., 2003;Caballero et al., 2004).Our analysis has shown that the GC content does not influence the perimitochondrial localization pathway.
Our analysis revealed that neither the average pairing energy nor total energy affects perimitochondrial localization.This suggests that in all cases only a small fragment of the 3' UTR is responsible for targeting the mRNAs to the vicinity of mitochondria.In addition, it appears that other signals in the long 3' UTR do not interfere with the perimitochondrial localization signal.

CONCLUSIONS
In this paper we predicted in silico a structure of 3' UTR responsible for perimitochondrial localization of cytoplasmic yeast mRNAs.We have analyzed almost half of the yeast transcriptome for which the MLR values were determined.
Our method is based on structure analysis that, compared with sequence based algorithms (as the one used for example by Jacobs Anderson and Parker (Jacobs et al., 2000)), should give more reliable results.To give an example, ATP2 mRNA known to localize in the vicinity of mitochondria does not contain the CYTGTAAATA element described in (Jacobs et al., 2000), but does contain a structure similar to that of the YJL225C 3' UTR.Of course there is a group of 3' UTRs that do contain the CYTG-TAAATA element and a structure similar to that of the YJL225C 3' UTR (four genes with ScoreD < 1, 34 genes with ScoreD < 1.5).
The model we developed has a potentially high predictive value for perimitochondrial localization of transcripts with unknown MLR due to the strong correlation between ScoreD and MLR that has been estimated by r 2 = 0.77 of linear regression (Fig. 3).
To summarize, we have shown that mRNAs following the perimitochondrial localization pathway in yeast contain a common structural signal, similar to the one found in the YJL225C 3' UTR.The data acquired from our computation strongly correlate with empirical results from (Marc et al., 2002).When we confirm our results in vivo we will be able to create an algorithm for fast in silico prediction of perimitochondrial localization of any yeast mRNA sequence.

Figure 1 . 3 '
Figure 1.3' UTR sequence extraction procedure.Genomic sequence (genomic) is aligned using ClustalW with the genomic sequence containing 2 kb downstream region (genomic-2kb-tail) and the downstream region is extracted (A).Extracted sequence (2kb-tail) is used as a BLAST query against Fungi UTRdb and the highest scoring sequence is extracted (B).The sequence from Fungi UTRdb is aligned using ClustalW with the 2 kb-tail sequence (C).3' UTR sequence is extracted from alignment.It begins with first nucleotide of 2 kb-tail sequence and ends at the last matched nucleotide in alignment (D).

Figure 5 .
Figure 5.A comparison of correlation coefficients for different 3' UTRs used as templates for the common perimitochondrial localization signal.YJL225C presents the strongest negative correlation between ScoreD and MLR.

Correlation between MLR value and ScoreD.
The light grey parts of bars represent ScoreA, dark grey -ScoreB and the white ones -ScoreC.The total height of bar represents ScoreD.The number of transcripts in each group is shown in Fig.4.There are two bars shown representing ScoreD for transcripts with MLR from 81 to 90.This was done due to a few suprisingly high ScoreD values for some transcripts.The list of these transcripts is presented in Table2.The narrower bar shows ScoreD calculated excluding these values.The line represents growing trend calculated by linear regression (excluding the suprisingly high values in MLR 81-90 group) with regression error of 0.77 and r 2 of 0.79.