An Approach for the Identification of Microrna with an Application to Anopheles Gambiae

MicroRNAs (miRNAs) are an abundant class of 20–27 nt long noncoding RNAs, involved in post-transcriptional regulation of genes in eukaryotes. These miRNAs are usually highly conserved between the genomes of related organisms and their pre-miRNA transcript, about 60–120 nt long, forms extended stem-loop structure. Keeping these facts in mind miRsearch is developed which relies on searching the homologues of all known miRNAs of one organism in the genome of a related organism allowing few mismatches depending on the phylogenetic distance between them, followed by assessing for the capability of formation of stem-loop structure. The precursor sequences so obtained were then screened through the RNA folding program MFOLD selecting the cutoff values on the basis of known Drosophila melanogaster pre-miRNAs. With this approach, about 91 probable candidate miRNAs along with pre-miRNAs were identified in Anopheles gambiae using known D. melanogaster miRNAs. Out of these, 41 probable miRNAs have 100% similarity with already known D. melanogaster miRNAs and others were found to be at least 85% similar to the miRNAs of various other organisms.

The unique characteristics of miRNA arefirst, all miRNAs are present in noncoding regions of the genome; second, when genomic sequences surrounding the identified 22 nt RNAs were examined, computer analysis predicted miRNA precursors capable of forming stem-loop structure, a single miR-NA molecule ultimately accumulates from one arm of each miRNA hairpin precursor molecule; third, miRNA sequences are nearly always conserved in related organisms.
To identify novel miRNAs, several approaches have been used involving biochemical approach based on purification of RNAs after size fractionation (Lau et al., 2001) or bioinformatics approach centering on the conservation of intergenic regions of DNA between two clearly related Caenorhabditis species (Lim et al., 2003b) or between Drosophila melanogaster and Drosophila pseudoobscura species (Lai et al., 2003).Both miRscan and MiRseeker extracted conserved intergenic regions between two closely related species.MiRseeker subjects conserved intronic and intergenic sequences to an RNA folding and evaluation procedure to identify evolutionarily constrained hairpin structures with features charac-R.Chatterjee and K. Chaudhuri teristic of known miRNAs (Lai et al., 2003).On the other hand, miRscan evaluates conserved stem-loops as miRNA precursors by passing a 21-nt window along each conserved stem-loop, assigning a loglikelihood score to each window that measures how well its attributes resemble those of the first experimentally verified C. elegans miRNAs with C. briggsae homologs (Lim et al., 2003b).
For the detection of novel miRNAs in specific animals and plants, comparative genomics was used in several reports (Lim et al., 2003a;2003b;Lai et al., 2003;Bonnet et al., 2004;Jones-Rhoades & Bartel, 2004;Ohler et al., 2004;Wang et al., 2004) and for the detection of orthologs and paralogs of known miR-NAs, homology searching was also used (Pasquinelli et al., 2000;Lagos-Quintana et al., 2001;Lau et al., 2001;Lee & Ambros, 2001;Weber, 2005).Using the current sequence alignment tools like blast (Altschul et al., 1990), short sequences (mature miRNAs are of about 22 nt) as a query sequence will produce large number of irrelevant hits.Pre-miRNA sequences are also used for homologue searching.But due to the non-conservation of the other parts compared with miRNA and miRNA* (the fragments on the opposite arm of the hairpin) (Lau et al., 2001), it is expected that, the homology searching based on pre-miRNA sequences may produce many false positives.So, the more sensitive approach will be to consider both sequence and structure conservation.Using this strategy, ERPIN used pro-files to capture both sequence and structure information of animal miRNA precursors (Gautheret & Lambert, 2001;Legendre et al., 2005).Another study have been proposed starting with the BLAST searching with known pre-miR-NA sequences followed by assessment of structure information (Wang et al., 2005).Since in many cases though the miRNA sequences may be conserved, the precursor sequences are much less conserved (Lau et al., 2001), so, searching with pre-miRNAs might lead to under estimation of the actual miRNAs present in an organism.
The present strategy uses sequence alignment of mature miRNAs, the structure conservation and assessing the position of the mature miRNAs in the pre-miRNA.Starting with the known mature miR-NAs from an organism as query, homologues were searched in a related organism allowing few mismatches depending on their phylogenetic distance.The pre-miRNA and their potentiality to form stem loop structure were assessed further and finally the stem-loop secondary structure was confirmed using MFOLD (Zuker, 2003), followed by assessing the preferable position of the miRNA in the pre-miR-NA secondary structure.As an application, about 91 probable miRNA sequences have been identified in Anopheles gambiae genome starting from D. melanogaster miRNAs.
The complete genome sequence of A. gambiae are arranged in 8987 (Accession No. AAAB01000001 to AAAB01008987) scaffolds for downloading at FTP site at NCBI (ftp://ftp.ncbi.nih.gov/genbank/genomes/Anopheles_Gambiae/Assembly_scaffolds/)(Holt et al., 2002).Out of these 93 large scaffolds, covering 82% of the total genome, were selected.Remaining 8894 scaffolds, covering 18% of the genome were not taken into the analysis because of their small size and large number and also to minimize the miRsearch screening time.
Strategy of miRsearch.The computational screening of miRNA was executed through the program written in Perl scripts (Fig. 1), followed by the miRNA characteristics based screening algorithm (Fig. 2), the entire algorithm being named as miRsearch.Using Drosophila miRNA as query sequence, genome of A. gambiae, which belongs to the same order diptera, was searched with a user defined score (S).The score S is based on the number of mismatches and defined as: The number of mismatches in case of organisms belonging to same genus but different species, (for example, C. elegans and C. briggsae pair) was chosen to be zero, but we have relaxed it to allow for 3 mismatches for D. melanogaster and A. gambiae pair as they belong to the same order diptera.
Next, the searching and selection of pre-miR-NAs were done using the following algorithm.The query sequence (q[i]) of size n nucleotide was placed along the column and the input sequence of same size of the query (target t[j]) was passed along the row, so as to form a n × n matrix (M).For i = j = 0 to n-1, q[i] was compared with t[j] for perfect matching and assigned a score of +2 and otherwise -2, the scores were placed along M[i][j] for i = j = 0 to n-1.If the trace of the matrix is greater than the user given score S, the reverse match was searched for from -80 to -1 from the first base of the target sequence, to find whether other arm of the hairpin loop precursor miRNA is in the upstream of the target sequence else from +1 to +80 from the last base of the target sequence to find whether other arm of the hairpin loop pre-miRNA is in the downstream of the target sequence.For reverse matching, the reverse complement of q[i] was searched with a different scoring system.As the pre-miRNAs are known to form a stable hairpin loop structure, so for A-T base pairing a reward of +2 and for G-C pairing a reward +3 was given otherwise the reward was taken as 0. If the reverse matching score is greater or equal to R, then the precursor sequence was reported with the lower and upper co-ordinates.The assigned score R of the reverse matching was determined by training the program to find the all known D. melanogaster miRNAs.
Both the forward and reverse complement sequences of the scaffolds were searched for the analysis of miRNAs.The selected sequences representing probable candidates were then examined through NCBI map viewer for their possible location (http://www.ncbi.nlm.nih.gov/mapview/map_search. cgi?taxid=7165) and those located in the exonic regions were eliminated.
The candidate pre-miRNAs were then filtered through the RNA folding program MFOLD (Zuker, 2003) (http://www.bioinfo.rpi.edu/applications/mfold/old/rna/form3.cgi) selecting the cut-off values on the basis of known D. melanogaster pre-miR-NAs (Fig. 2).The characteristics observed from the MFOLD output of D. melanogaster pre-miRNAs are: (i) the predicted mature miRNAs are within the long helical arm of the secondary structure of pre-miR-NA; (ii) ΔG ≤ -21.0 kcal/mole; (iii) largest helical arm contained at least 23 bp sequence.Considering these observations, the MFOLD output of each A. gambiae pre-miRNA was examined.A structure is accepted as probable miRNA if (a) ΔG ≤ -21.0 kcal/mole, (b) the longest helical arm contains at least (20-29) bp depending on miRNA sequence length and (c) the predicted miRNAs are within the long helical arm.
The search program will be available from the authors on request.

Computational prediction of miRNAs by miRsearch.
Observations have suggested that mature miRNA sequences are phylogenetically conserved and have characteristic stem-loop secondary structure.Based on this miRsearch used a homologous sequence searching strategy to identify the primary sequence which was simultaneously examined for the capability of formation of stem-loop secondary structure and subsequently MFOLD was used for final prediction of miRNA as described in detail in methods.Using D. melanogaster (Dme) miRNA as input, it searches homologous sequences with a maximum of 3 mismatches in the scaffolds of A. gambiae sequences.Homologous sequences, with 3 mismatches, may be present in many places in the genome, all of which may not have the capability of forming stem-loop precursor structure characteristics of pre-miRNAs.To eliminate those sequences, which do not form typical pre-miRNA structures, reverse complement of the homologue of miRNA sequences (reverse match) were searched at a position -80 to +80 from the matched sequence.To assign a proper score value to the reverse matching sequence, the program was trained with all Dme miRNA sequences and we empirically set the minimum score value obtained from Dme sequences as the cut off score for A. gambiae miRNA (Fig. 1).As miRNA genes can be located on either strand, we searched each sequence in both the forward strand as well as in its reverse complement.Using a total of 79 mature miRNA sequences in the miRbase sequence database and 93 scaffolds of A. gambiae, we have detected 489 sequences, which had homologous miRNA sequences and their pre-miRNAs are capable of forming of stem-loop secondary structures.This total set was viewed through Mapviewer to identify their location in the A. gambiae genome.Only 13 of these were found to present in the exonic region of genes and were excluded from the set as miRNAs are not supposed to be present in exonic region.Further evaluation of the quality of stem-loop formation of the remaining 476 pre-miRNA sequences was assessed through the RNA folding program MFOLD and Out of 91 A. gambiae miRNAs, 41 have 100% similarity to the Dme miRNA (Table 1).One miRNA, which was 85.7% similar to Dme miRNA (dme-mir-33), was 100% similar to the hsa-mir-33 miRNA (Homo sapiens).Two of the Dme miR-NAs viz.miR-9a and miR-2a were conserved both in sequence and in their location in chromosome 2L.Other predicted miRNAs were not conserved in their chromosomal location (not shown).The miRNA gene cluster of miR-276a found in chromosome 2L of D. melanogaster was detected in chromosome 3L in A. gambiae.The locations of the predicted miRNA were also recorded (Table 1, 2).Remaining 50 miRNAs are probable newly identified A. gambiae miRNA having potential for the formation of hairpin secondary structure with high degree of MFE (∆G) and more than 85% sim- ilar with the already predicted miRNAs (Table 2).
In some cases, same predicted miRNA was obtained using different Dme miRNAs as the query sequence (Table 1 and 2).
Interestingly, although the miRNA sequences of Drosophila and Anopheles were 100% similar (Table 1), the sequences and structures of corresponding pre-miRNAs from Drosophila and Anopheles were not conserved to that extent (Fig. 3).This holds good for each of the 30 miRNA sequences.This analysis supports our strategy for the miRNA detection, which is based on homology search with the mature miRNA sequences rather than precursor miRNA sequences.
Out of the total 83 A. gambiae miRNA reported so far (Griffiths-Jones et al., 2004), we have detected 31 miRNAs.A set of 11 Anopheles miRNAs were found to be 100% similar to Dme miRNAs, was missed in our study, which is due to exclusion of scaffolds covering 18% of the genome in our study.Moreover, we have identified 10 new miRNAs having 100% similarity with the other species miRNAs.This data suggest that, use of only Dme miRNAs as query data set would have detected at least 42 of the predicted miRNAs, new 10 miRNAs and 50 putative novel miRNAs.However, use of all the known miRNA sequences, proper choice of mismatches for each species and considering the whole genome sequences will enhance the efficiency of miRsearch for the identification of new miRNAs.

DISCUSSION
Informatics approach used so far to identify miRNA involves alignment of genomes of two closely related species to find conserved regions followed by identification of stem-loop precursor transcripts capable of processing and forming about 22 nt mature RNA (Lai et al., 2003;Lim et al., 2003a).In our approach we have eliminated the whole genome alignment step and instead have used the following steps: (i) searching for homologues of known mature miRNA in one organism (Dme) to the genome of another related organism (Aga), after allowing some mismatches, depending on phylogenetic distance between them, (ii) assessing the capability to form stem-loop precursors or structures and finally (iii) observing the preferable position of the mature miRNA in the secondary structure of pre-miRNA.Such an approach is most useful when the complete set of miRNA in one organism is available along

Figure 1 .
Figure 1.Flowchart for the identification of probable miRNA candidates with precursor sequences.

Figure 2 .
Figure 2. Flowchart for prediction of putative miRNAs based on the miRNA characteristics.