Regular paper Fold recognition insights into function of herpes ICP4 protein

ICP4 is an important factor regulating the life cycle of HSV1. This conserved protein has several molecular functions, including activation of expression of viral late gene transcripts and inhibition of immediate early genes. Although ICP4 and its Alphaherpesvirinae homologs (eg.: IE62 of VZV) have been subjects of various molecular studies, a complete view of their molecular function is lacking. Here we present the results of fold recognition and molecular modelling of ICP4 functional domains. The performed state-of-the-art bioinformatic fold recognition analysis identified a dual helix-turn-helix motif as a binding module of repressor activities (so called region 2 domain). The mapping of distant homology identified that a segment responsible for activation of late gene promoters (region 4) exhibits folding of uracil DNA glycosylase (UDG), but seems to be a non-functional homolog of UDG. Potential implications of the results are discussed.


INTroDUCTIoN
During productive infection by herpes simplex virus type 1 (HSV1), nearly 80 genes are transcribed by DNA-dependent RNA polymerase II in three phases named immediate early (IE), early (E), and late (L).ICP4 (IE175) is the major regulatory protein of HSV1 and is one of IE genes expressed at the earliest stages of virus infection (Wagner et al., 1995).The gene is crucial for infection since its inactivation interrupts the progress beyond IE phase.Its product is required for the activation of transcription from the majority of viral promoters (Watson et al., 1980), but the knowledge on the mechanism of its action is still incomplete.The central role of ICP4 in the progression of the viral life cycle is augmented by the fact that ICP4 protein represses transcription of three viral genes (early regulator, ICP4; latency phase product, LAT; and ORF-P) via interaction with high affinity binding sites at their transcription initiation sites (Faber et al., 1988;Batchelor et al., 1994;Gu et al., 1995).
The gene with open reading frame of nearly 1300 amino acids is present in all genomes of Alphaherpesviridae, including also two other human pathogens: herpes simplex virus type 2 (HSV2) and Varicella-Zoster Virus (VZV).The native protein forms homodimers (Metzler et al., 1985) and this interaction has a fundamental role in ICP4 functionality.ICP4 protein has several molecular functions, like DNA binding associated with repression of gene expression, nuclear localization and activation of expression of late transcripts.The activities previously mapped on the ICP4 open reading frame are distributed throughout it (DeLuca & Schaffer, 1988).
The DNA binding activity is associated with the N-terminal part of the protein (Beard et al., 1986;DeLuca et al., 1988;Wy et al., 1990).The bipartite consensus binding site of the so called "A site" is determined as ATCGTCnnnnYCGRC, where n is any nucleotide, Y is a pyrimidine (cytosine or thymine), and R is a purine (adenine or guanine) (Faber & Wilcox, 1986).Additionally, unrelated motifs with no obvious common sequence similarity pattern L.S. Wyrwicz and Leszek Rychlewski have been observed (B sites) (Faber & Wilcox, 1986).The binding specificity of ICP4 and its homologs from various Alphaherpesvirinae is similar (Wu & Wilcox, 1991) and multiple binding sites are described throughout the genome (Michael & Roizman, 1989).The protein-DNA interaction is augmented by the proper spatial distribution of the 'A site' and general transcription factor binding sites (DiDonato & Muller, 1989).A bend of DNA within A sites is observed during the formation of protein-DNA complex (Everett et al., 1992).
The regulation of expression of late genes is associated with the C-terminal segment of ICP4.Earlier studies suggested an involvement of proteinprotein interactions between ICP4 and cellular factors of the polymerase complex.Analysis of ICP4related activation of expression of viral genes indicates that ICP4 performs its action in cooperation of the host proteins of the preinitiation complex, like TATA-binding protein (TBP), TAF II 250 and TFIIB (Smith et al., 1993;Carrozza & DiLuca, 1996).
Since the protein is essential for all Alphaherpesviridae, and its function is preserved, we may assume that the most critical activities are performed by the protein fragments exhibiting the highest sequence similarity throughout the protein family.McGeoch et al. (1986) pointed out five different functional regions of ICP4 based on patterns of conservation.In general, two conserved fragments called region 2 and 4 are mapped on independent functional regions responsible for DNA binding at 'A sites' and associated with the activation of late genes, respectively.The internal fragment of nearly 200 amino acids (region 3) located in the center of ICP4 is much more divergent.Apart from carrying a nuclear localization signal it may function as a spacer of the functional domains of different activity (McGeoch et al., 1986).The N-terminal fragment (region 1) is the least conserved one.It contains multiple phosphorylation sites (Xia et al., 1996) and a protein-protein interaction interface (Grondin & DeLuca, 2000).The role of region 5 localized in the C-terminal fragment is associated with late regulatory activities.The removal of its terminal 56 residues, as well as introduction of some point mutations in this region abolished its activity.It has also been suggested that region 5 acts as an enhancer of ICP4 N-terminal transactivation domain (Bruce et al., 2002).
In our previous reports, we applied methods for identifying distant homology of protein families with subsequent protein structure molecular modelling to analyse the functions of divergent proteins of Herpesvirinae.This approach successfully identified the herpes UL24 gene product as a potential endonuclease (Knizewski et al., 2006).Similar methodology allowed the identification of glycoprotein L (gL) as a protein resembling chemokine receptor ligands (Wyrwicz & Rychlewski, 2007a) and EBV BcRF1 as a late regulatory TATT-binding protein (Wyrwicz & Rychlewski, 2007b).Here we present the results of fold recognition and molecular modeling of the functional domains of the ICP4 protein.

MATErIAlS AND METHoDS
Fold recognition and assembly of structural alignments.Sequences of ICP4 (HSV1, HSV2) and IE62 were retrieved from the corresponding genome sequences (GenBank) and aligned using ClustalW (Thompson et al., 1994) with subsequent manual corrections.The annotation of globular regions was performed using GlobPlot (Linding et al., 2003).The protein sequences were divided into 300 amino acid long overlapping fragments and submitted to the Structure Prediction Meta Server (http://bioinfo.pl/meta) (Bujnicki et al., 2001), which assemblies various secondary structure prediction and top-of-theline fold recognition (FR) methods.Regions with a high propensity to create non-globular regions in GlobPlot (Linding et al., 2003) and segments without consistent and confident predictions of secondary structure using PsiPred (Jones, 1999) and Prof-Sec (Rost et al., 2004) were marked as non-globular (Fig. 1).
Since the protein structure prediction methods had been optimized for processing of globular domains, the potential globular regions were divided further into single domains according to secondary structure predictions and preliminary results of fold recognition searches.The domains with corrected boundaries were resubmitted again to the Structure Prediction Meta Server (Bujnicki et al., 2001) and Me-taBasic (Ginalski et al., 2004).Collected models were screened with 3D-Jury, a consensus fold recognition prediction method (Ginalski et al., 2003).
The protein structure prediction methods were used to identify similarity between ICP4 domains and known protein families.For both target and template sequences, close homologs were collected with PSI-BLAST and aligned by using Clus-talW (Thompson et al., 1994) and PCMA (Pei et al., 2003) with final manual adjustments according to secondary structures (observed and predicted) and critical residues of the fold identified by literature browsing.
Identification and distant fold mapping of DNA-binding domain.Since the initial procedure of fold recognition failed to provide a confident assignment for the repressor domain (region 2), an additional algorithm was applied.All corresponding sequences of the DNA binding domain were extracted from ICP4 homologs present in the "nr" database (GenBank) clustered at 90% sequence identity (Li & Modelling of ICP4 domains Godzik, 2006).The sequences were submitted again to the Protein Structure Prediction Meta Server (Bujnicki et al., 2001).The results of the template selection procedure were screened for proteins involved in regulation of transcription according to functional assignments provided by the GeneOntology consortium (Gene Ontology terms: GO:0003700 -transcription factor activity, GO:0045449 -regulation of transcription, GO:0006355 -regulation of transcription, DNA-dependent, GO:0003677 -DNA binding (Harris et al., 2004)).Since consistent predictions (DNA/RNA-binding three-helical bundle fold superfamily, SCOP: a.4 (Wintjens & Rooman, 1996;Aravind et al., 2005)) were observed as results of several fold recognition methods (INUB (http:// inub.cse.buffalo.edu/),mGenThreader (McGuffin & Jones, 2003), further analysis was performed on the proteins from this superfamily as described in the Results section.The query sequences were aligned (ClustalW; Thompson et al., 1994) to template sequences from the three helical bundle superfamily with subsequent manual correction of the alignment according to secondary structure (predicted and observed).The identification of homology was confirmed by enumeration of critical residues creating the potential protein-DNA interaction interface.

Fold recognition for ICP4 regions
The initial screening for potential globular regions suggested that regions 1 and 3 represent a likely non-globular domains and therefore are not suitable for application of fold recognition methodology.The sequence analysis mapped the potential folding to domains from regions 2 and 4 of Alphaherpesvirinae ICP4 (McGeoch et al., 1986).The domains and their location in the ICP4 sequences are shown in Fig. 1.

Fold recognition of repressor DNA-binding domain
The identified hits to proteins related to transcription, gene regulation and DNA binding are listed in Table 1.Among the results, we identified 102 hits to proteins associated with regulation of transcription and 83 of them represented the DNA/ RNA-binding three-helical bundle fold (SCOP superfamily: a.4).Among the remaining 19 hits none of the identified folds was reported for either of the tested ICP4 homologs; the second top scoring hit -(antitermination factor NusB -SCOP a.74 superfamily) -was identified seven times and the remaining superfamilies were represented by nearly single hits.The SCOP a.4 superfamily utilizes a major structural motif capable of binding DNA -so called helix-turn-helix motif (HTH).It is composed of two α helices joined by a short strand of amino acids and is found in many proteins that regulate gene expression.The C-terminal helix is involved in DNA binding via sequence-specific interaction with the major groove of double-stranded DNA (Wintjens & Rooman, 1996;Aravind et al., 2005).Due to very low degree of sequence similarity no HTH protein of known structure was preferably identified as a likely homolog of the ICP4 domain.Further selection of modeling templates among the three-helical bundle superfamily was supported by the following observations: 1) the ICP4 DNA binding site consists of a bipartite motif of 11 nucleotides spaced by additional four bases (ATCGTCnnnnYCGRC); 2) the potential globular segment of ICP4 DNAbinding region (so called region 2) has six α helices (as predicted by PsiPred; Jones, 1999) and ProfSec (Rost et al., 2004) accessed via the Protein Structure Prediction MetaServer (Bujnicki et al., 2001); 3) the least conserved segment (potential inter-domain linker) divides the secondary structure elements into two blocks of three helices; 4) the third and sixth helices (last helix in each 3helix segment) contain conserved basic residues (arginine, lysine, histidine) potentially involved in interaction with DNA elements.
Based on the above-mentioned facts we concluded that for the purposes of the modelling study we should select a protein of known structure containing two HTH domains involved in cooperative binding of DNA.We arbitrarily selected a paired box domain (PAX) (Underhill, 2000) as a domain containing two HTH domains independently interacting with closely spaced DNA motifs (Xu et al., 1999).PAX proteins represent a distinct conserved class of Eukaryota and are critical for gene expression in the development (Lang et al., 2007).PAXs contain three functional regions involved in the protein-DNA interaction: two helix-turn-helix motifs utilizing a DNA-binding module typical for this fold divided by a linker region which acts as an additional component of the DNA-binding interface (Lang et al., 2007).
The structural templates of PAX domains were collected from the PDB database (6paxA, 1pdnC), additionally sequences of human PAX proteins were aligned by using ClustalW (Thomposon et al., 1994).A structural alignment of ICP4 and PAX families was created manually according to the secondary structure (PsiPred predictions of ICP4 and human PAX proteins and observed structure of PAX6 from PDB entry 6paxA).The alignment of the ICP4 domain and PAX protein family is shown in Fig. 2. The basic amino acids involved in binding to phosphate moieties of double-stranded DNA were identified by inspection of the PAX6 structure and marked on Fig. 2 (1-9, A-E).The corresponding sequences were extracted from the GenBank entries (coded with GenBank identifier gi; PRV, Pseudorabies virus; CHV, Canine herpesvirus; BHV5, Bovine herpesvirus 5; FHV, Feline herpesvirus type 1; EHV1, Equine herpesvirus type 1; MaHV1, Macropodid herpesvirus 1; EHV4, Equine herpesvirus 4; MeHV1, Meleagrid herpesvirus 1; CHV7, Cercopithecine herpesvirus 7; CHV1, Cercopithecine herpesvirus 1; GHV2, Gallid herpesvirus 2; GHV3, Gallid herpesvirus 3).Numbers in brackets refer to positions in the GenBank entries.The sequences of crystallographically solved proteins from PAX family, PAX6 and PAX5, are shown (6paxA, 1pdnC, respectively).The observed (PDB: 6paxA) and predicted (psipred predictions for VZV ICP4 and PAX1) secondary structure elements are coded with letters: H, α-helix, E, β-strand (extended).The distinct functional regions identified previously for PAX6 are marked below (two helix-turn-helix DNA binding domains and a linker binding to minor groove).Basic residues involved in interaction with DNA in PAX6 (PDB: 6paxA) are marked below (1-9, A-E).

Fold recognition of glycosylase-like domain
The fold recognition of region 4 was performed according to the standard procedure of fold recognition as described previously (von Grotthuss et al., 2003).Among the hits collected by the Protein Structure Prediction Meta Server (Bujnicki, 2001), the uracil DNA glycosylase fold was identified by several homology modelling and threading methods (Table 2).Uracil-DNA glycosylase is a DNA repair enzyme catalyzing the reaction of cleavage of the RNA-specific base (uracil) from DNA (reviewed by Krwawicz et al., 2007).As summarized in Table 3 this prediction was consistently assigned by the 3D-Jury prediction assessment system for several proteins of Alphaherpesvirinae ICP4 (HSV1, HSV2 -ICP4, IE62 -VZV).The structural alignment of this domain is shown in Fig. 3.
In order to test whether the ICP4 region 4 domain apart from assuming the uracil DNA glycosylase fold also retains its function an analysis residues creating the active site was performed.The conserved residues concentrate in the internal part of the domain, while there is a weak sequence conservation on the surface and residues creating the active site are not preserved in ICP4.

DISCUSSIoN
The applied methodology provides an additional view into the molecular function of the ICP4 protein -a main regulator of early/late gene expression in Alphaherpesviridae.Previous functional studies mapped the observed activities to defined subregions of ICP4 (Fig. 1).Here we show that such regions represent distinct structural domains which can be characterized with the bioinformatic methodology of fold recognition and homology modelling.
The DNA-binding domain (region 2) -responsible for repression of early genes of HSV1 -encodes a common DNA/RNA binding fold of the three helical bundle superfamily.The consistent presence of six αhelices and preserved pattern of basic  L.S. Wyrwicz and Leszek Rychlewski amino acids potentially involved in binding of DNA phosphate moieties strongly supports the distant mapping of the fold.For the modelling purposes we selected a similar tandem domain from the helixturn-helix family (PAX domain).Out of 14 basic residues located on the potential interface of the ICP4 DNA-binding domain, seven are preserved between PAX and ICP4 family (positions 1, 3, 5 in the first HTH domain, 9 in the intra-domain linker and C, D, E in the second HTH domain, compare Fig. 2).Since the conservation pattern is not accurate, we may assume some variation of the binding mode among the species of Alphaherpesvirinae.The highest degree of conservation was observed for the residues critical for the interaction of HTH with DNA in the PAX domain located at positions 3 and 5 in the first HTH domain, as well as C and E in the second domain.
The fold recognition methodology in region 4 suggested that the globular domain located between residues 837 and 1104 of HSV1 ICP4 (Fig. 3) encodes a uracil-DNA glycosylase-like fold.With the consistent predictions of various distant similarity and threading methods (Table 2) the mapping of the fold is highly likely to be correct.Also the match in the pattern of secondary structure (predicted for ICP4 and observed for UDG) strongly supports the confident scores of the fold recognition protocols used.Since the residues creating the active site are not preserved (not shown), we conclude that this protein is unable to perform the enzymatic function of uracil-DNA glycosylase.Although retaining of a fold without preserving its enzymatic function is an important mechanism of evolution of protein families, such situations are observed relatively infrequently.We may expect that ICP4 and its homologs utilize the uracil-DNA glycosylase fold to interact with DNA of herpes gene promoters and either specifically recognize modified nucleotides (e.g.: methylated cytosines) or perform a chemical modification of promoter DNA.Further experimental work is needed to answer the above question and to investigate the possibility of utilization of potential inhibitors of this domain in repression of ICP4 function (Speina et al., 2005).
Notably, the glycosylase-like fold is smaller than region 4 described by McGeoch et al. (1986).Potentially, this region of ICP4 may create an additional structurally independent domain.With the presented methodology we were unable to get an insight into the structure of the C-terminal segment of region 4 and the whole region 5, but other protein modelling algorithms may prove to be successful with those regions (Kolinski et al., 2004;Ekonomiuk et al., 2005).
The fold recognition approach provides an additional view into the function of the complex Alphaherpesvirinae gene expression regulator by provid-ing evidence for a distant homology between ICP4 and proteins of established function.Although the presented structural assignments should be treated with rather limited confidence, this analysis clearly shows that bioinformatics has an important role in annotation of divergent genomes, like those of Herpesviridae.Application of profile-profile methodology in detection of distant similarity and various methods of fold recognition supported by homology modelling allows the identification of unexpected structural assignments.Our analysis, apart from providing an insight into the action of te ICP4 protein, allows further exploitation of the presented data in a rational design of experimental studies (e.g.mutation studies or binding assays).

Figure 3 .
Figure 3. Structural alignment of ICP4 glycosylase-like domain.The corresponding Herpesviridae regions were extracted from the GenBank entries: HSV1, gi|9629441; HSV2, gi|9629330 and VZV, gi|9625936; numbers in brackets refer to positions in the sequences and the length of removed non-conserved fragments.The sequences of crystallographically solved uracil-DNA glycosylases from HSV1, human and Atlantic cod are shown (PDB entries: 1lauE, 1akz_, 1okbA, respectively).The observed (below; 1lauE) and predicted (above; PsiPred, ProfSec predictions for ICP4 of HSV2) secondary structure elements are coded with letter codes (compare Fig.2).

Table 2 . Summary of fold recognition analysis for VZV ICP4 (gi|9625936:744-1033; region 4).
The top scoring hits for each method are shown in bold font.Hits below the 3D-Jury cutoff of 50.0 (corresponding to less than 5% of prediction error) are shown in italics.