A Common Cis-element in Promoters of Protein Synthesis and Cell Cycle Genes

Gene promoters contain several classes of functional sequence elements (cis elements) recognized by protein agents, e.g. transcription factors and essential components of the transcription machinery. Here we describe a common DNA regulatory element (tandem TCTCGCGAGA motif) of human TATA-less promoters. A combination of bioinformatic and experimental methodology suggests that the element can be critical for expression of genes involved in enhanced protein synthesis and the G1/S transition in the cell cycle. The motif was identified in a substantial fraction and previously reported 18 ribosomal protein genes. Since the motif can define a subset of promoters with a distinct mechanism of activation involved in regulation of expression of about 5% of human genes, further investigation of this regulatory element is an emerging task.


INTroDuCTIoN
The regulation of transcription is the major process modulating expression of genes on both qualitative and quantitative levels.Regulatory elements concentrated in gene promoters include several classes of functional DNA sequence motifs (cis elements) recognized by protein agents (trans elements), i.e. essential components of the DNA-directed RNA polymerase transcription machinery (GTF, general transcription factors) and complementary transcription factors (TFs).The efficiency of transcription is enhanced by specific interactions between DNA-binding proteins and sequence elements present in promoters (TFBSs, transcription factor binding sites).Apart from the cis-trans cooperation other regulating mechanisms include variations in chromatin composition via histone modifications (Barrera & Ren, 2006).
The regulation of gene expression is a complex process resulting in enhanced activity of the encoded gene products (proteins).Previously, groups of coregulated genes (so called gene expression modules) has been identified by comparative measurements of gene expression in various tissues (Segal et al., 2004).For the yeast model gene-expression clusters were effectively translated into regulatory networks defining a molecular background of co-expression (Segal et al., 2003).Specific functions have been assigned to several cis elements and the presence (or activation) of the related trans agents (TFs) shown to activate specific molecular switches triggering the expression of respective genes (Hughes et al., 2000).Since the regulation of gene expression in higher Eukaryotes L. S. Wyrwicz and others is more complex (Jura et al., 2006), the progress in the field is less advanced.So far, specific regulatory mechanisms has been assigned to a limited number of functional groups of genes (Hardison et al., 1997;Frech et al., 1998;Yoshihama et al., 2002), cellular processes (Kel et al., 2001) and tissue-specific expression patterns (Wasserman & Fickett, 1998).
The interplay between the activity of TFs in a given cell and the presence of TFBS in promoters is one of the most important mechanisms responsible for inducible or tissue-specific transcription.Therefore identifying functional elements of a gene promoter allows prediction of the gene's expression in various tissues and different environmental conditions (Tronche et al., 1997).The description of the full repertoire of transcription factors (trans) and their binding specificities (cis elements) is one of the most important tasks of bioinformatics in the analysis of gene expression (Pennacchio & Rubin, 2001).
Here we present a common DNA element of human promoters involved in regulation of genes associated with protein translation.Using genomewide scanning we suggest that the element can take part in regulation of expression of nearly 5% of human genes, mostly those transcribed from TATAless promoters.

MeTHoDs
sequence neighborhood.Whole genome human-mouse alignments (genome builds: "hg17", May 2004; "mm5", May 2004) were obtained from the Genome Browser (Kent et al., 2002).Promoter sequences, defined as 1000 base pairs (bp) upstream and 100 bp downstream from transcription start site (TSS), of the 16 749 human genes (non-redundant set from the Reference Sequence project) (Kent et al., 2002) were retrieved from sequence alignments.
The promoter alignments were scanned for occurrences and evolutionary preservation of all kmers ranging from 6 to 8 bases.A k-mer was recognized as conserved only when it occurred in both genomes at corresponding (homologous) locations with no differences in sequence.The observed conservation ratio (c) of a motif was determined as the proportion of human occurrences (k) that were present in conserved form (non-mutated) in a homologous locus of the mouse genome to all the motif's occurrences in human promoters (n; c = k/n).
To analyze the degree of conservation of a k-mer we tested each motif against its "sequence neighborhood" (SN; Table 1) which was defined by all k-mers differing by exactly one nucleotide (e.g.: SN of AAAAA consists of: CAAAA, GAAAA, TA-AAA, ACAAA, ..., AAAAT).Such algorithm was introduced to avoid the problem of unequal conser-vation ratio of motifs of different nucleotide content.
The conservation ratio of each sequence motif was assessed against the average conservation ratio of the other sequences from its sequence neighborhood (C) using a binomial distribution model (probability of k conserved instances out of total n instances for given probability (C) of conservation for any one instance).Z score was calculated according to the binomial approximation of the normal distribution formula (Feller, 1968).Motifs with the Z score of binomial statistics above 4.0 were selected.
The dataset of regulatory motifs.The collected motifs were grouped before compilation into a database of potential regulatory signals.The rules for clustering were as follows: sequences could differ by a maximum of one nucleotide and could be shifted by a maximum of one position and no gaps were allowed in the alignment.The distribution of clustered motifs was evaluated by the Student's ttest for paired data and the clustering was allowed only for motifs of consistent distribution (P<0.05,motifs' occurrences were counted in a 20 bp window along the promoter sequences).The dataset is available for browsing at URL: http://promoter.bioinfo.pl(Wyrwicz L.S., Rychlewski L., Ostrowski J., manuscript in preparation).
The impact of the motif presence on promoter activity was assessed for gene expression profiles  (Velculescu et al., 1995) deposited in the GEO database (http://www.ncbi.nlm.nih.gov/geo)(Edgar et al., 2002).The SAGE method was preferred instead of microarrays or other platforms for estimation of gene expression as it has previously been shown to exhibit more precise discrimination between high and low abundance transcripts (van Ruissen et al., 2005).A total of 164 gene expression libraries of 10 bp tags associated with NlaIII restriction sites representing various tissues and cell lines derived from human normal and cancerous cells were selected (Edgar et al., 2002).The previously published algorithm (Klimek-Tomczak et al., 2004) was used to match the expression data to the set of Reference Sequence project genes.Genes from each SAGE experiment corresponding to tags found in the SAGE library were sorted by the number of tag counts and grouped into: "high expression" (HE; top 40% of expressed genes) and "low expression" (LE; 40% of genes with lowest tag count).Chi-square test was used to compare the number of conserved motif occurrences in both groups of promoters.Annotation of human promoters.We tested the presence of the motif of interest in human promoters retrieved from the Eukaryotic Promoter Database (EPD) (Schmid et al., 2006) and the UCSC Genome Browser (Kent et al., 2002) databases ("up-stream1000" data set) using proprietary scripts written in PERL programming language.The functional annotation of genes was performed with the Gene Ontology (http://geneontology.org) (Harris et al., 2004) and UniProt (http://www.uniprot.org)resources (Bairoch et al., 2005).The scripts, datasets and search results are available as Supplementary materials (URL: http://lucjan.bioinfo.pl/supplemental/cellcycle).

resulTs AND DIsCussIoN
The applied algorithm allowed the identification of a subset of the human genome as potential regulatory motifs.The summary of the motifs' selection is shown in Table 2. Since palindromes constitute one of the most important group of regulatory elements, the dataset was tested for the presence of such motifs.The top scoring palindrome motifs identified are summarized in Table 3.
An uncharacterized palindrome motif TCT-CGCGAGA was identified among the most conserved motifs in a genome-wide human-mouse assessment of 6-8 nucleotide segments and is deposited in Pro-moSignalDB under the accession number H-26.1 (http://promoter.bioinfo.pl/data.pl?acc=H-26.1).The core part of the motif (CTCGCGAG) was conserved in 151 cases out of 283 occurrences in the analyzed human promoters (53%).The conservation ratio increased to 66% for motifs located between base -180 and +40 in relation to TSS.
The motif distribution in human promoters is shown in Fig. 1A.Detailed analysis of the motif distribution within human promoters suggested that the motif tended to be present in more than one copy.In the "upstream1000" dataset the consensus element was present in a duplicated form 12.62 times more often than expected.The motifs within a pair were usually separated by up to 200 nucleotides (Fig. 1B).Selective conservation of the two copies in homologous genomic loci of related species and accumulation of mutations in the spacer sequence were observed (an example motif shown in Fig. 1C).The preference of the motif to occur in more than one copy is unusual.To assess if the motif can be recognized in the single or double configuration an experimental study of electrophoretic mobility shift assay (EMSA) of an oligonucleotide containing the TCTCGCGAGA motif was performed.For test-ing we selected native oligonucleotides containing a motif nearly identical to the consensus sequence.The sequences were obtained from proximal promoters of XPC (xeroderma pigmentosum, complementation group C) and COX11 (cytochrome c oxidase assembly protein 11).In the selected promoters two copies of the element were present in close proximity, spaced by 18 and 9 nucleotides, respectively.Both elements were conserved in homologous loci of different species of vertebrates (mouse, rat, dog).
The binding of nuclear proteins to the double stranded oligonucleotides (dsDNA) was assessed by EMSA.To induce transcription, HeLa cells were first starved for 48 h, then stimulated with fetal calf serum for 0, 1, 6 or 24 h.We observed a specific mobility shift for dsDNA probes containing two copies of the motif (Fig. 2B, C), while no shift was observed for an oligonucleotide containing a single copy of the element (XPC-single; Fig. 2A) or for a single nearly identical motif retrieved from promoter of HNRPK (not shown).The specific shift was present only when nuclear protein extract from induced cells was used and the amount of shifted probe increased with extension of time of serum induction.
To investigate the function of the presence of two copies of the motif, deletion mutants of COX11 native element were assayed (Fig. 3).No shift was observed for probes with the central six nucleotides deleted in either one (lanes 3, 4) or both copies (lane 5).
The described regulatory motif was previously identified in other genome-wide studies but no details on its activity were provided.FitzGerald and coworkers (2004) identified this element as a com- mon motif clustering in the human genome in close proximity to transcription start sites.Xie et al. (2005) identified the element as a conserved motif in several mammalian genomes.Haun and coworkers (1993) investigated the role of the TCTCGCGAGA element in promoter of ARF3 and concluded that mutation of a single copy of the element diminished the transcriptional activity of the ARF3 promoter in vivo.
Notably -the ARF3 promoter also contains a second imperfect copy of the motif 23 nucleotides apart, which was present in the analyzed gene construct, but not reported by Haun and coworkers (TCT CGC GAG AAC TGC CGC TAG CTA CCG CGC AGC TCT CGC GCG A).The effect of mutation or deletion of the latter site was not investigated.The presence of similar motifs was postulated by Roepcke and coworkers (2006) (motif M4; AGTCTCGCGAGATCT) and Perry (2005) in their studies on sequence elements overrepresented in promoters of human ribosomal genes.None of the presented studies suggested the tandem composition of the active element.
We performed a search of human promoters containing the composite tandem motif in the Eukaryotic Promoter Database.Other genes containing the motif in their promoters were functionally related to enhanced protein synthesis and included translation initiation factors (EIF5, EIF2S1, EIF4G2, EIF3S8, EIF4), cell cycle genes active in G1/S phase (CDK8, CDC25A, CUL1), cyclins (CCNC, CCNG1), genes linking gene expression and cell cycle regulation (TAF7), transcription regulators (TAF13, PROX1, KLF7, NCOA2) and chromatin structure modulators (HDAC2, TAF6L).The motifs identified in promoters of the mentioned genes are shown in Table 4.
Since the role of the motif in gene expression had not been investigated before, we tested whether the motif occurs in promoters of tissue-specific genes.Analysis of gene expression profiles obtained with the SAGE method revealed that the motif was overrepresented in promoters of highly expressed genes when compared with the low expression subset in 121 of 164 tested tissues and cell lines.Similar results (association in > 50% of tested gene expression profiles) were achieved for the general or very common regulatory elements which do not exhibit tissue selectivity, i.e.TATA-box, CAAT enhancer, Oct1 and Ets motifs, as well as the Kozak sequence (a motif associated with highly efficient translation) (Kozak, 1987).The results of the analysis are available as Supplementary materials (URL: http://lucjan.bioinfo.pl/supplemental/cellcycle).
Although the consensus motif has been determined, analysis of reference human promoters (EPD) and comparative genomics analysis suggests that a certain degree of variation is accepted within the site, as can be shown for the tandem motif of COX11 promoter (Fig. 1C, human: TCTCGCGA-GA N 9 CCTCGCGAGA, mouse: TACCGCGAGA N 9 TCTCGCGAGA).Moreover, in several promoters of ribosomal protein genes we identified two imperfect copies of the motif located in close range, where neither the proximal nor the distal copy matched 10 bp consensus (Table 4).
An analysis of human promoters retrieved from the UCSC Genome Browser dataset matching the tandem element was performed and ribosomal protein genes were observed among the most abundant class of genes (Table 5).Since the bioinformatic identification of the full repertoire of genes associated with a motif relies on an assumption of a minimal degree of similarity to its consensus sequence ( Hoh et al., 2002), the list of genes presented here is only an approximation.The structure of human ribosomal genes has previously been studied and distinct features of their promoters were identified, including: oligopyrimidine tract around TSS, GC-rich promoters with TATA-like sequences, but usually lacking a typical TATA-box (Yoshihama et al., 2002).We assessed whether the mentioned features are present in promoters containing the motif.Since the exact position of the transcription start site for the "upstream1000" dataset remains uncertain (Makalowski, 2001), we were not able to analyze the neighborhood of TSS in this data.However, for the promoters retrieved from the EPD database we confirmed the presence of the mentioned features characteristic for ribosomal promoters and selected other genes (example entries are shown in Table 6).Since we observed the similar characteristics of the gene promoters presented here we can assume that the described tandem motif complements the previous observation of the functional elements of ribosomal protein gene promoters.
Based on our genome-wide promoter analysis, experimental work and previously published studies we conclude that a tandem TCTCGCGCA-GA motif is a common regulator interacting with unknown protein(s) induced during enhanced protein synthesis or/and cell proliferation.The motif consists of a palindrome sequence and is active in a tandem arrangement.Due to the close proximity between the studied element and the transcription start site we also suggest that it may play a central role in expression of a significant fraction of human genes transcribed from a distinct class of TATAless promoters previously described as ribosomal protein gene-specific promoters (Yoshihama et al., 2002).The identification of the motif in the functionally associated sets of genes of translation and the cell cycle suggests the existence of a common process-specific mechanism of gene expression.Since the previously described specific features of ribosomal gene promoters have a low information content (oligopyrimidine tract around TSS, GC-rich promoters with TATA-like sequences, but usually lacking typical TATA-box) their usefulness in the identification of co-regulated genes is limited.The identification of the described motif enables the identification of a full repertoire of genes regulated in this manner.The analysis of gene expression profiles suggests that the motif is rather involved in a general mechanism of regulation of gene expression and is not a tissue-specific cis element.Although the detailed mechanism of its action remains undiscovered, we assume that it may play a role of a central GTF, alternative to the TATA-binding protein (TBP) or is a highly active enhancer element recruiting the assembly of the polymerase complex in the neigh-borhood of TSS and its determination may result in a development of novel therapeutic strategies (Gniazdowski & Czyz, 1999).

Figure 2 .Figure 3 .
Figure 2. Electrophoretic mobility shift assay of in vitro binding of novel cis element TCTCGCGAGA.Starved HeLa cells were stimulated with 15% fetal calf serum for: 0 (lane 1); 1 h (lane 2), 6 h (lane 3) and 24 h (lane 4).A, XPC-single; B, COX11; C, XPC.The locations of the delayed probe likely corresponding to a specific interaction of nuclear extract protein(s) with the tandem motif oligonucleotide are marked with arrowheads.

Table 4 . list of selected genes involved in protein translation and cell cycle regulation with the duplicated motif within their promoters.
Promoter sequences retrieved from genome assembly (not deposited in EPD) are marked with asterix.