Functional analyses of a low-penetrance risk variant rs6702619/1p21.2 associating with colorectal cancer in Polish population

Several studies employed the genome-wide association (GWA) analysis of single-nucleotide polymorphisms (SNPs) to identify susceptibility regions in colorectal cancer (CRC). However, the functional studies exploring the role of associating SNPs with cancer biology are limited. Herein, using chromatin immunoprecipitation assay (ChIP), reporter assay and chromosome conformation capture sequencing (3C-Seq) augmented with publically available genomic and epigenomic databases we aimed to define the function of rs6702619/1p21.2 region associated with CRC in the Polish population. Using ChIP we confirmed that rs6702619 region is occupied by a CTCF, a master regulator of long-range genomic interactions, and is decorated with enhancerlike histone modifications. The enhancer blocking assay revealed that rs6702619 region acts as an insulator with activity dependent on the SNP genotype. Finally, a 3C-Seq survey indicated more than a hundred loci in the rs6702619 locus interactome, including GNAS gene that is frequently amplified in CRC. Taken together, we showed that the CRC-associated rs6702619 region has in vitro and in vivo properties of an insulator that demonstrates long-range physical interactions with CRC-relevant loci.


INTRODUCTION
Cancer is a multigene disorder which develops as a consequence of a stepwise accumulation of genetic and epigenetic alterations leading to uncontrolled cell divisions. Germline mutations in DNA repair genes, proto-oncogenes or tumor suppressor genes can greatly increase the cancer risk. In case of colorectal cancer (CRC), only about 5% of cases are related to highlypenetrant mutations while twin studies estimate that 35% of all CRC cases are caused by heritable factors (Lichtenstein et al., 2000). Majority of the remaining alterations are most likely less penetrant and represent single nucleotide polymorphisms (SNPs) that have additive effects (Peters et al., 2012). The expression of the genome is a complex and multi-step process involving cis-and trans-acting regulatory elements (Pastinen et al., 2006). It is assumed that majority of genetic variants influence the abundance of transcripts level mainly by altering transcription factors' binding at the promoter, pre-mRNAs splicing or through regions containing regulatory elements that are distal to the genes. Thus, they are known as expression quantitative trait loci (eQTLs). eQTLs can be located in the proximity of a gene of interest or in a distant genomic region and affect expression by long-range interactions (Identify regulatory sequences and eQTL-causal variants, and estimate their effects on activation of transcription in a massively parallel reporter assay | Critical Assessment of Genome Interpretation, 2015). Genome-wide association studies (GWAS) showed that over 85% of genotype-phenotype associations are non-coding single nucleotide polymorphisms (SNPs) (Buroker et al., 2013). SNPs overlapping regulatory regions may play a significant role in the phenotypic variability and disease susceptibility mainly due to their effect on transcription. Functional analyses of regions indicated by those SNPs are prerequisite for understanding the molecular background of the observed association (Brown et al., 2013;Fareed & Afzal, 2013).
Recently, we performed GWAS analysis identifying SNPs associating with CRC in the Polish population (Gaj et al., 2012). One SNP pointed out in our study, namely rs6702619/1p21.2, seemed particularly interesting for further investigation due to the epigenetic features at its location including histone modification suggesting a presence of an enhancer and CTCF binding sites. CTCF plays an important role in the regulation of gene expression and higher-order organization of the genome (Kim et al., 2015). The main role of CTCF is considered to be its contribution to the chromatin architecture via processes like nucleosome positioning, organization of chromatin modifications, demarcation of the boundaries of independently regulated domains, as well as of active and repressive chromatin domains (Phillips & Corces, 2009). Over 20,000 CTCF target sites (CTSs) have been reported across the genome in different cell types. They are located in introns, exons, promoters, 3' and 5'UTRs, however, almost a half correlates with intergenic regions. About 40-60% CTSs were shown to be constitutive and invariant between cell types while remaining are considered to be involved in tissue-specific gene expression (Cuddapah et al., (Ohlsson et al., 2010).
To describe the influence of the rs6702619 SNP on the local chromatin structure and expression of nearby genes we analyzed gene expression of adjacent genes, histone modifications makeup and CTCF binding, as well as in vitro cis-regulatory activity of CTSs. Finally, we performed chromosome conformation capture (3C) sequencing to define local and long-range interactions of the regulatory element contacting the rs6702619 SNP.
USA) was used as a backbone for preparation of the gene constructs. Site-directed mutagenesis to introduce restriction endonuclease site (AgeI) was performed using the KOD Hot Start DNA Polymerase (Merck Millipore, USA) according to the manufacturer's instructions. β-Globin insulator (495 bp) and sequence chr1:100,045,980-100,046,473 (rs6702619, genotype -GG and -TT, amplified from HCT 116 and Colo205, respectively) were amplified from pBT268 expression vector (Addgene, Cambridge, USA) and the DNA isolated from cell lines, respectively, using primers containing overlapping ends. The sequences were inserted into pGL3-AgeI vector by digestion with a restriction enzyme (AgeI, NEB, USA) and ligation (T4 DNA ligase, NEB, USA). The mammalian expression vector pGL4.74 (Promega, USA) was used as an internal control for normalization.
HeLa cells were seeded into 96-well plate in DMEM with 10% FBS. The next day cells were co-transfected with the obtained constructs and pGL4.74 (ratio 100:1). After 24 h measurement of luminescence intensities was performed using the Dual-Luciferase Reporter Assay System (Promega, USA) according to the manufacturer's instructions. The differences in the expression levels were evaluated using the Student's t-test in GraphPad Prism 5 (GraphPad Software, Inc., USA). P<0.05 was considered to indicate a statistically significant difference.
Chromatin conformation capture (3C-seq) assay. 3C-seq library was obtained from HCT 116 cell line according to the protocol by Stadhouders et al. (Stadhouders et al., 2013). EcoRI and HaeIII were used as the first and second restriction enzyme, respectively. The region surrounding chosen SNP was used as a bait for inverse PCR (Table S1 at https://ojs.ptbioch.edu. pl/index.php/abp/). Identification of DNA sequences interacting with the bait was performed by sequencing on Ion Torrent PGM (Thermo, USA). Two biological replicates were analyzed.
Analysis of 3C-seq assay results. The reads generated by sequencing were aligned to hg19 genome assembly with the TMAP. Peaks were defined as the regions with coverage greater than the median value for the chromosome. The intersection of peaks from biological replicates was performed with bedTools. Only intersections with length 100 or more were subjected to further analysis. The peaks' distance to the closest gene was assessed with bedTools. The frequency of mutations and copy number alterations was assessed with cBioPortal (Gao et al., 2013), using TCGA Colorectoral Adenocarcinoma provisional dataset.

Description of rs6702619 SNP epigenetic landscape with publically available databases
The rs6702619 SNP is located within 335 kb long intergenic region at chromosome 1, between LPPR4 (lipid phosphate phosphatase-related protein type 4) and PALMD (palmdelphin) genes located 270 kb upstream and 65 kb downstream, respectively. LPPR4 belongs to the lipid phosphate phosphatase family which catalyzes the dephosphorylation of lipid mediators while PALMD is a cytosolic homolog of PALM (paralemmin) which is implicated to influence plasma membrane dynamics (Pruitt et al., 2012). rs6702619 lies within 110 kb long linkage disequilibrium block, confined by two sites with Functional analyses of rs6702619 variant recombination rate values of 30-40 cM/Mb ( Fig. 1) (Johnson et al., 2008). The sequence within which rs6702619 is located within a highly evolutionarily conserved DNA stretch. The histone modifications at this area, H3K4Me1 and H3K27Ac, identified in several cell lines within the frames of the ENCODE project are typical for the enhancer regulatory elements. Furthermore, CTCF binding was observed at this region by ChIP-Seq in normal human epidermal keratinocytes (NHEK) and human mammary epithelial cells (HMEC), as well as in several other cell lines. The observed signal is wide and covers about 350 bp (Rosenbloom et al., 2012) (Fig. 2A). Additionally, an in silico analysis with InsulatorDB tool indicated that SNP is located between two predicted CTCF binding sites (CTSs) and thus may influence CTCF binding specificity (Bao et al., 2008) (Fig. 2B). The evidence on the presence of both enhancer element and CTCF protein suggests that the latter one under a specific lineage context may act as an enhancer-blocking insulator.
Gene expression analyses of neighboring genes and survey of histone modifications and CTCF binding at rs6702619/1p21.2 in colorectal cancer (CRC) cell lines.
We performed genotyping of rs6702619 in five human CRC cell lines using TaqMan SNP Genotyping Assays. SW480, Caco2 and HT-29 cell lines are heterozygous for rs6702619 (GT), while HCT116 is a homozygote (GG), same as Colo205 (TT). The examination of expression levels of genes adjacent to rs6702619 (LPPR4, PALMD) showed no correlation with the genotype (Fig. 3).
Next, we determined the epigenetic makeup at the SNP site and control regions by measuring CTCF binding and histone modifications associated with enhancer regions (H3K4Me1, H3K27Ac), repressive (H3K27Me3) and permissive (H3K4Me3) chromatin using chromatin immunoprecipitation (ChIP) assay and three cell lines, namely HTC116, SW480, Colo205, representing all possible SNP genotypes. Control regions included a beta-globin (HBB) and glyceraldehyde 3-phosphate dehydrogenase (GAPDH) promoters, representing silenced and active promoters, respectively, as well as a representative enhancer region (EGR1 -811 bp) and CTCF binding site (EGR1 -15 kb) both located upstream of the EGR1 gene. ChIP assay confirmed CTCF binding at the control EGR1 -15 kb region and along the SNP locus probed at three sites (A, B, C) with the highest recruitment at the SNP site (B probe) in the three cell lines tested. However, the level of its recruitment did not depend on the genotype present in a given cell line (Fig. 4). At the same time, CTCF binding remained low at the promoters of HBB and GAPDH as well as at the control enhancer region EGR1 -811 bp. The H3K4me1 histone mark designates active and primed enhancers. These can be distinguished on the basis of the presence and absence of H3K27Ac mark, respectively (Creyghton et al., 2010). ChIP analyses at the SNP site revealed no marked presence of H3K27Ac in HCT116 and SW480 cell lines, while in Colo205 this mark was significantly elevated (p-value <0.05) when compared to silenced HBB promoter. The H3K27Ac levels corresponded with H3K4me1 in Colo205 where this mark was relatively, but not significantly, enriched in comparison to inactive HBB promoter and the control enhancer region EGR1 -811 bp. The H3K4me3 histone modification is mostly found to be associated with promoter regions (Heintzman et al., 2007), however, this mark may also decorate enhancer regions as a consequence of local RNAP2 presence (Pekowska et al., 2011). ChIP assay indicated no H3K4me3 presence in HCT116 and SW480 cell lines, while in Colo205 this modification at SNP site was significantly elevated (pvalue <0.05) at A site when compared to silenced HBB promoter. As expected, the highest level of H3K4me3 mark was present at active GAPDH promoter in all cell lines. In sum, ChIP measurements confirmed CTCF binding to the SNP site in tested cell lines and suggested that this cis-regulatory element could act as an enhancer-blocking insulator. This possibility was subsequently tested using a reporter system. On the left axis, pair-wise r 2 values between rs6702619 and SNPs within the region are shown. Plot generated with SNAP tool (Johnson et al., 2008). M. Statkiewicz and others Figure 3. Expression of LPPR4 and PALMD genes adjacent to the rs6702619 SNP was measured in human colon cancer cell lines with qPCR using SYBR Green chemistry. Cells were harvested, RNA extracted with Trizol, DNAse-treated and subjected to RT-qPCR measurements. RNA expression was normalized to YWHAZ, ALAS1, ACTB, TUBA1B and HPRT1 mRNA (n=3; mean ±SD).

The region harboring rs6702619 SNP acts as a genotype-dependent insulator in an enhancer-blocking assay
The sequence-specific insulator activity of the predicted CTCF binding site in the proximity of rs6702619 was measured with the enhancer-blocking assay. The DNA sequences containing β-globin insulator as well as the rs6702619 derived from HCT116 (G/G) and Colo205 (T/T) which differ in their sequence at the SNP site only, were inserted between the SV40 enhancer and promoter elements of the pGL3-control vector. As expected, the chicken β-globin insulator, used as a positive control, significantly inhibited the luciferase expression by 62% (Fig. 5). For the rs6702619 site with G/G and T/T genotype, the luciferase promoter activity was sig-nificantly decreased by 50% and 17%, respectively. Additionally, there was a significant difference in luciferase expression between the rs6702619 constructs containing G/G and T/T genotype. These data confirm that the rs6702619 site in vitro acts as an insulator and that its activity is genotype-dependent.

Portraying genomic interactions of rs6702619/1p21.2 locus with 3C-Seq
In order to define long-range chromatin interactions for rs6702619 locus, we performed 3C-Seq analyses using two biological replicates of HCT-116 cell line. Two replicates yielded 3,927,047 and 3,291,544 reads mapped to human genome version hg19, respectively. Overall, 122 interacting common loci for both Cross-linked chromatin was sheared and ChIP assay was performed using a Matrix-ChIP platform. ChIP DNA was analyzed in qPCR with primers amplifying rs6702619 site at region A, B, C (see Fig. 2 for amplicons location), beta-globin (HBB) , glyceraldehyde 3-phosphate dehydrogenase (GAPDH) promoters, as well as a control enhancer region (EGR1 -811 bp) and CTCF binding site (EGR1 -15kb) both located upstream of the EGR1 gene. Data are presented as the means ± S.D. (n=3), expressed either as the percentage (%) of input DNA for CTCF or as the ratio of modified histone to total H3 histone. Differences in binding levels between rs6702619 A, B, C regions and HBB promoter were evaluated using the Student's t-test -*P<0.05.
replicates were identified, of which 65 were located at chromosome 1 as the rs6702619 bait (Fig. 6A, Table S2 at https://ojs.ptbioch.edu.pl/index.php/abp/). The majority of 3C-seq sequences originating from the bait chromosome is expected as this technical bias is common for 3C methods (Raviram et al., 2014). 49 peaks were found within the gene bodies and 20 of them were not located on chromosome 1 (Table S2 at https://ojs.ptbioch.edu.pl/index.php/abp/). To further portray the connection of rs6702619 locus and 20 genes inter-chromosomally interacting with CRC we annotated them using TCGA Colorectal Adenocarcinoma dataset and cBioPortal (Gao et al., 2013). Interestingly, 14 of these genes exhibited a genetic lesion, either mutation or copy number alteration (CNA), in 26% of 640 samples from this dataset (Fig. 6B). Based on the data from 1000 random draws of 20 genes, the probability of obtaining this result is as small as 6.9e-17. The most frequently altered gene was GNAS (20q13.32) for which the amplification event was present in 8.44% of patients in TCGA dataset. GNAS activating mutations were identified in multiple cancers including CRC (Wilson et al., 2010). Inspection of the sequencing data at GNAS locus in UCSC browser revealed 3C-Seq peak presence in the intron sequence of the first and the third GNAS isoform as well as in the intron of GNAS isoform coding its antisense transcript (GNAS-As) ( Figure 6C). Additionally, the 3C-Seq peak overlapped with CTCF binding sites as measured for K562 and HeLa cell lines in the ENCODE project, suggesting that this long-range interaction could be mediated by CTCF. We next measured GNAS and GNAS-As expression in five CRC cell lines using qRT-PCR. We observed no association between rs6702619 genotype and GNAS mRNA abundance. However, such a connection was visible for GNAS-As mRNA (Fig. 7) where its transcript level was decreased and elevated in HCT116 and Colo205, respectively, which is in line with the activity of the G/G and T/T allele in the reporter system. Overall, the 3C-seq data and functional analyses showed that the rs6702619 region's interchromosomal gene network is enriched with loci that are relevant for CRC tumorigenesis.

DISCUSSION
To date, at least 38 GWASs studies reported 310 SNPs associating with CRC for different populations (MacArthur et al., 2017), however, the biological understanding and their functional contribution to CRC development remain unexplained for most of them. GWAS yields information about the statistical association between observed phenotype and tagging SNPs. To uncover the biological reason for observed association functional analyses of the whole LD block indicated by tagging SNP have to be performed. Most of the tagging SNPs are located in the intergenic or intronic regions which makes the task daunting and introduces the need for multistage analysis where the sequence and type of experiments depend on the structure of the analyzed region (Freedman et al., 2011). The biochemical characterization of chromatin within the frames of the EN-CODE project defined a plethora of cell-type-specific distal regulatory regions including enhancers and insulators (ENCODE Project Consortium, 2012). Further collation of GWAS SNPs to these regulatory regions has revealed that many SNPs fall within these regions having specific chromatin's biochemical makeup, including distinct histone modifications, open chromatin signatures and binding sites for transcription factors such as CTCF (Farnham, 2012). The ENCODE datasets, therefore, expanded the possibilities of rational SNP choosing for functional studies prioritizing SNP in high LD with GWAS SNPs that are located in a regulatory element. This approach combined with CRISPR/Cas technology allows studying the influence of CRC risk-associated SNPs and regulatory regions on gene expression (Yao et al., 2014).
In CRC, functional analyses of SNPs were described for several loci, however only for one of them, rs6983267 in 8q24 region, the functional variant and the mechanism of its influence was confirmed by several independent studies (Pomerantz et al., 2009;Tuupanen et al., 2009;Sotelo et al., 2010;Wright et al., 2010). The rs6983267 is located in a 1.5 Mb long gene-desert and molecular analyses showed that SNP lies within a region highly conserved in mammals, in a consensus sequence for transcription factor TCF4. ChIP, electrophoretic mobility shift (EMSA) and reporter assays confirmed the activity of computationally identified enhancer and demonstrated TCF4 differential binding depending on SNP allele (Pomerantz et al., 2009;Tuupanen et al., 2009;Sotelo et al., 2010). Spatial chromatin structure analyses, like 3C allowed to identify the interaction between the abovementioned enhancer and MYC promoter, located 335 kb apart (Wright et al., 2010). Recently, through the deletion of rs6983267 bearing enhancer with CRISPR/Cas followed by gene expression analyses a set of down-regulated transcripts, including MYC, was identified in HCT116 CRC cell line (Yao et al., 2014).
In this study, we used publically available epigenetic and genomic datasets together with ChIP, reporter assay and 3C-Seq experiments to characterize the structure and function of the genomic region tagged by rs6702619 SNP associated with adenoma and CRC in the Polish population, with odds ratio 0.71 and 0.73, respectively (Gaj et al., 2012). However, to date, the SNP rs6702619 has not been confirmed to be associated with CRC in other populations. Interestingly, the rs6702619 was recently found to associate with cardiac structure, specifically with aortic root diameter (Wild et al., 2017). Furthermore, in a study utilizing UK biobank specimens the rs6702619 significantly associated with calcific aortic Relative luciferase activity of promoter constructs in pGL3 plasmid was measured when cotransfected with pGL4.74 plasmid into HeLa cells with an empty vector, chicken β-globin insulator and the rs6702619 site with G/G or T/T genotype containing construct. All bars show the average of four independent experiments, error bars indicate standard deviation. Differences in expression levels were evaluated using the Student's t-test -*P<0.05, ***P<0.001 ****P<0.0001 valve stenosis (CAVS) and the G risk allele correlated with lower expression of PALMD (Thériault et al., 2018). Chromatin features from the ENCODE project at SNP rs6702619 suggested the presence of a CTCF-occupied insulator with a chromatin makeup of enhancer regulatory element. Analysis of LPPR4, PALMD gene expression in the vicinity of this regulatory element in several CRC cell lines showed no correlation between rs6702619 genotype and transcripts abundances in the studied cell lines. Furthermore, rs6702619 loci characterization with ChIP assay showed CTCF recruitment to this regulatory element in three CRC cell lines tested and indicated the presence of enhancer-like chromatin features in one of them at this site. The enhancer blocking assay with rs6702619 region confirmed that this genomic stretch could act as an insulator with activity dependent on the rs6702619 genotype. Finally, a 3C-Seq survey indicated more than a hundred loci in the rs6702619 locus interactome. Of them, 20 were inter-chromosomal connections to the genes. Importantly, one of the interacting genes was GNAS which is regarded as an oncogene contributing to the development of several neoplasms including pituitary, thyroid glands, pancreas and colon tumors (O'Hayre et al., 2013). Of note, the mRNA expression analyses in CRC indicated a putative connection of the rs6702619 with GNAS-As mRNA abundance, specifically for the homozygotic genotype. However, this result should be interpreted with caution since it was obtained for only two cell lines and has no statistical support. In CRC, the GNAS locus is frequently amplified which correlates with its increased mRNA expression (Ptashkin et al., 2017). Importantly, we previously observed a sig- Presence of 3C-Seq reads confirms the interaction of GNAS locus with rs6702619 region. Genome Segmentations and Integrated Regulation ENCODE annotation tracks were included to portray the biochemical chromatin makeup in the vicinity of the peak. Bright Red, predicted promoter region; Light Red, predicted promoter flanking region; Orange, predicted enhancer; Yellow, predicted weak enhancer or open chromatin; Blue, CTCF enriched element; Dark Green, predicted transcribed region; Gray, predicted repressed or low activity region nificant 2.4 fold GNAS overexpression in CRC when compared to normal mucosa in our transcriptome study (Skrzypczak et al., 2010), this result was also confirmed in a microarray data from a study by Hong et al. (Hong et al., 2010). GNAS encodes α-subunit of the Gs stimulatory protein and it was proposed that gain of function mutations in GNAS, which result in constitutive activity by reducing the rate of GTP hydrolysis, or its overexpression may influence pro-inflammatory gene expression and fuel tumor development (O'Hayre et al., 2013). Interestingly, the long-range interaction with GNAS locus could be CTCF-mediated since its ChIP-Seq binding signal, detected for several cell lines in the ENCODE project, is adjacent to 3C peaks from rs6702619 bait.
A limitation of our study is that we did not test the outcome of a given SNP variant in the same genetic background following CRISPR/Cas rs6702619 edition. Further experiments utilizing this approach are warranted as they could uncover bona fide molecular changes driven by the status of rs6702619. Nevertheless, we demonstrated that the CRC-associating locus containing rs6702619 has the properties of an insulator that exhibits multiple long-range physical interactions with CRC-relevant loci including GNAS.

Data Availability
Sequencing data are available for view and download in UCSC genome browser: http://genome.ucsc. edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_ otherUserName=prutten&hgS_otherUserSessionName=r s6702619_3C%2DSeq_2replicates Author contributions MM and NM conceived and designed the study, NM, MS, UK and MK acquired the data, MM and MK analyzed and interpreted the data, MS, MM and JO wrote the paper.