Smooth muscle contamination analysis in clinical oncology gene expression research

Gene expression profiling is one of the most explored methods for studying cancers and microarray data repositories have become a rich and important resource. The most common human cancers develop in organs that are walled by smooth muscles. The only method of sample extraction free of unintentional contamination with surrounding tissue is microdissection. Nevertheless, such an approach is implemented infrequently. In the light of the above, there is a possibility of smooth muscle contamination in a large portion of publicly available data. In this study, 2292 publicly available microarrays were analysed to develop a simple screening method for detecting smooth muscle contamination. Microarray Inspector software was used to perform the tests since it has the unique ability to use many selected genes and probesets in a single group as a tissue definition. Furthermore, the test was dataset-independent. Two strategies of tissue definition were explored and compared. The first one depended on Tissue Specific Genes Database (TiSGeD) and BioGPS web resources, which themselves were based on meta-analysis of thousands of microarrays. The second method was based on a differential gene expression analysis of a few hundred preselected arrays. The comparison of the two methods proved the latter to be superior. Among the tested samples of undefined contamination, nearly half were identified to possibly contain significant smooth muscle traces. The obtained results equip researches with a simple method of examining microarray data for smooth muscle contamination. The presented work serves as an example of how to create definitions when searching for other possible contaminations.


INTRODUCTION
According to the World Health Organization, cancer was the cause of death of 7.6 million people in 2008, representing nearly 13% of reported deaths in the world.Unfortunately, the mechanisms of tumour pathogenesis and processes leading to metastasis are still not well understood.In the field of oncology research, microarray technology is one of the most commonly used techniques in transcription profiling analysis.Although genomic technology has advanced, it is not yet a fully developed field.One of the clear findings so far is that there are different strategies in data processing which lead to discrepancies in sample handling, as well as raise incommensurability concerns between involved laboratories.Be it as it may, there has been a noticeable growth in the number of published practice guides designed to improve unity and reliability between different platforms.Nevertheless, surgical sample contamination by different tissues/cells remains a problem in a number of sample extraction processes, and is often ignored and rarely considered in the relevant literature.Most samples used in this kind of studies usually contain a mixture of cells or tissue types (Lähdesmäki et al., 2005;Wang et al., 2006).Although the laser capture microdissection (LCM) method enabled avoiding the interference provided by unintended tissues components (Paweletz et al., 2001) LCM is still not widely used.Little data based on this technology is available for analysis.The different impacts of tissue components on cancer gene expression profiling have been discussed in the literature.For example, tissue proportions of cancer and stroma cells in surgical samples have been recently analysed (Roepman et al., 2005;Wang et al., 2010), and shown to have an important role in prediction of tumour invasion and metastasis processes.Meanwhile, non-cancerous material can significantly affect cancer expression profile results.Until now, the subject of smooth muscle tissue contamination in cancer samples was not considered in literature.Smooth muscles (SM) are found within the walls of cavernous organs, their mucous membranes, in gastrointestinal, respiratory and urogenital tracts, as well as in the walls of blood vessels.The most common cancers in the human population develop in the lungs, colon, stomach, cervix, prostate, pancreas, or the bladder.Surgical samples of these organs are all likely to contain smooth muscle tissue.
Recently, a new software tool for microarrays called Microarray Inspector has been developed (Stępniak et al., 2013).It enables the analysis of raw microarray data files and detection of tissue cross contamination.Single biomarkers of tissues providing comparable absolute expression levels are hard, if not impossible, to find.Instead, Microarray Inspector uses the whole group of selected genes and probesets to compare against the reference set in a single trial.The reference set is also a group of genes and probesets of the same array.Usually, it is the whole present probe-sets group, but its scope and sensi-tivity can be adjusted.The Mann-Whitney-Wilcoxon U test is used to estimate the probability of a biomarker group being significantly expressed.The additional advantage of the software is that it tests a single array at a time, which makes it independent of the group of examined microarrays.
In this study, we present the development process of a biomarker set for smooth muscle in Microarray Inspector.The proposed tissue biomarker definition set is intended for samples originating from cancers that develop in the lungs, colon, stomach, cervix, prostate, pancreas, or bladder.

Study design.
In this study, the microarray data were collected from Affymetrix HG-U133Plus, HG-U133A, or HG-U133Av2 platforms.Exclusive usage of Affymetrix arrays can be explained by the fact that Microarray Inspector currently supports only this kind of microarray files.Unfortunately, no experiments containing known smooth muscle tissue mixtures were found on Affymetrix platforms.Five types of sample data were collected: 1) fully contaminated, 2) not contaminated, 3) negative control, 4) suspected of contamination, and 5) microdissection experiments.Expression analyses of smooth muscle cells were included in the first group, as these are the main cell type present in SM tissue (Chi et al., 2007).The advantage of using cell cultures is the homogeneity of the material as well as the absence of other cell types interfering with the expression found in analysis results.For this reason, transcription profiles of both cancer and normal cell lines derived from the lungs, colon, stomach, cervix, prostate, pancreas and bladder were assigned as not contaminated experiments.Expression pattern analysis of either cancerous or normal material, theoretically not contaminated with smooth muscle tissue and isolated from other body localizations, such as brain, blood, liver or lung (endothelium), was chosen as a negative control.Microarray data of tissue material derived from lung, colon, stomach, cervix, prostate, pancreas or bladder cancers were suspected of being contaminated, and therefore checked for purity.The last group includes experiments of cancer cells originating in the pancreas, lung, cervix, and colon collected by the laser microdissection method.These experiments were checked for contamination as well.
All 2292 assays in a total of 67 different experiments obtained from ArrayExpress repository (Parkinson et al., 2007; http://www.ebi.ac.uk/arrayexpress) were used in the study: 192 were fully contaminated samples, 631 assays were of not contaminated material, 291 were negative control assays, 1021 were suspected samples and 157 were assays of microdissection research.Table 1 shows a short description of experiments from group numbers 1, 2 and 3, which were crucial in this research.Reference numbers of experiments from groups 1-5 can be found in Table 5.
TiSGeD and BioGPS.Tissue-specific smooth muscle genes were selected using TiSGeD (Tissue-Specific Genes Database by Xiao et al., 2010; http://bioinf.xmu.edu.cn:8080/databases/TiSGeD/index.html) with specificity measure (SPM) factor greater than 0.9.SPM ranges from 0 to 1 with a high value corresponding to strong tissue specificity.In addition, the results were compared with information from the BioGPS portal (Chunlei et al., 2009; http://biogps.org/).Based on the collected data, several different smooth muscle definitions were created.Table 2 shows four examples of them.All biomarkers included in the four definitions as well as their corresponding probesets in HG-U133A, HG-U133Av2 and HG-U133Plus2 Affymetrix platforms are illustrated in Table 3.The first definition (def.1)(refer to Table 2) consists of seven genes randomly selected from the tissue specific genes.The second definition (def.2) was built using six genes which do not have cytokine annotation, and the third (def.3) was composed of five cytokine/chemokine encoding genes.The fourth and last smooth muscle definition (def.4) was created by trial and error, and then applied to the conclusions of testing SM definitions 1-3 on experiment groups 1-3 (not contaminated, fully contaminated, control).Additionally, one new gene named LRRC17 was included into the smooth muscle definition 4.
Differential expression analysis using R/Bioconductor.An alternative method used to design tissue definition is differential expression analysis.The standard approach of comparison of two groups with a t-test was applied in the R environment (R Team 2012) using the Bioconductor (Gentleman et al., 2004) package genefilter (Gentleman et al., 2012).Experiments representing noncontaminated samples were: E-GEOD-13309 (2 lung cancer cell lines), E-GEOD-21654 (22 pancreas cancer cell lines), E-MTAB-37 (10 bladder cancer cell lines), E-GEOD-17482 (2 prostate cancer cell lines), E-GE-OD-22183 (37 gastric cancer cell lines), E-MTAB-37 (7 cervix cancer cell lines), E-GEOD-30292 (8 colon cancer cell lines and laser dissected: tumor cells, normal colonocytes, and enterocytes of ileum and jejunum).They were compared against a fully contaminated group composed of: E-GEOD-11917 (coronary artery smooth muscle cells), E-GEOD-12261 (aortic smooth muscle cells) and E-GEOD-19672 (corporal smooth muscle cells).All the experiments are based on the HG-U133 Plus 2 Affymetrix platform containing more gene probesets than any of the older available platforms, that is HG-U133A, B or Av2.The experiments were also balanced so that the number of arrays in each group did not exceed the 1:2 ratio for contaminated vs. not contaminated, as suggested in the t-test procedure.The total number of arrays in the groups was 124 for contaminated (64.6% of the whole group), and 194 for not contaminated (30.7% of the whole group).
All arrays were normalized together using the GCRMA algorithm from the package gcrma (Wu et al., 2012).Next, all probesets with general low intensity and variability in all samples were discarded.This was achieved by filtering out probesets that did not present a log2-based expression value higher than 6.64 (intensity higher than 100) in at least 25% of all the arrays.Probesets showing an interquartile range lower than 0.5 were also discarded.Following the filtration, the t-test was applied.It yielded p-values describing how likely the corresponding probesets are to emerge as differentially expressed by chance.P-values were next adjusted using the Benjamini & Hochberg method (Benjamini & Hochberg, 1995).The top 100 probesets with best (lowest) p-values were annotated using the Bioconductor packages annotate (Gentleman, 2012), KEGG (Carlson, 2012), GO (Carlson, 2012), annaffy (Smith, 2010) and XML (Lang, 2012).The list of results was investigated to select probesets for tissue definition.Genes with low expression in the contaminant group or genes related to cancer were omitted.The exception was the FGF5 gene, which was included in definition 5 (def.5),despite it being previously reported to be overexpressed in human Smooth muscle contamination analysis in clinical oncology gene expression research pancreatic cancer (Kornmann et al., 1997).The examined data confirmed that FGF5 expression in a healthy pancreas is very low, unlike that of a cancerous pancreas.However, the risk of falsely marking cancer as smooth muscle is minimized by the other probesets in the tissue definition.Furthermore, definition 5 was more prone to false positive results in experiments from the control group due to the lack of the FGF5 probeset.The probesets selected by this method corresponding to smooth muscle biomarkers in Microarray Inspector are presented in Table 2. Evaluation of created smooth muscle definitions.Using the TiSGeD database and BioGPS portal information, four different definitions were built: SM definition 1 to 4, corresponding to fifteen genes.Based on differential expression analysis, eighteen probesets corresponding to eleven genes were selected and included in SM definition 5 (Table 2).Five definitions were checked for quality and smooth muscle tissue specificity by verifying the purity of fully contaminated, not contaminated, and control experiments (groups 1-3 from Study design).All results including contamination analysis of suspected and microdissection experiments are presented in Table 4.More detailed information is available in Table 5.

DISCUSSION
In this study, two strategies of designing smooth muscle definitions were shown.The tissue definition itself was not just a simple set of biomarkers.Instead, it was a collection of probesets that, only when taken together, enable identification of smooth muscle presence in the sample.Single probesets composing the definitions were not sufficient to perform the test.The whole expression of tissue definition was important in detection of contamination.
The first way to create smooth muscle tissue definition was based on the TiSGeD database and the BioGPS portal.Surprisingly, none of the four definitions created that way, was optimal for verifying the purity of the sample.Smooth muscle definition 4, although not perfect, seemed to be the closest to expected results.Unfortunately, definitions 1 and 3 were not good enough either -they provided a high percentage of contamination in the not contaminated experiments, and also gave the lowest value of contamination in the fully contaminated group (below 90%).SM definition 2 in both not contaminated and control experiments demonstrated the highest average percentage of reported contamination.Some of the smooth muscle tissue-specific genes selected from public databases are either expressed at high levels in other tissues, or at too low levels in smooth muscle tissue.The authors of the TiSGeD database claimed that at times, the assignment of gene expression tissue specificity was inconsistent.According to them, one of the reasons for this was the difference in the tissue scale between experiments (Xiao et al., 2010).What this means is that the material used in transcription profiling analysis was in some cases contaminated with other tissues or cells in different degrees, which affected gene expression results.This might explain why many genes selected with the TiSGeD-BioGPS strategy were associated with inflammatory response (CCL7, CCL8, CSF3, CXCL1, CXCL3, CXCL6, IL6, THBS2) and some were implicated in various cancers (FGF2, IL6, THBS2).We should also take into account that the results from both databases were based on integration studies of independent microarray data.
Smooth muscle definition 5 created by differential expression analysis has proven to be the optimal definition.All the results were in line with expectations including analysis of the sample purity from group number 5, where the material was obtained by laser microdissection (assuming a 5% error).The differential expression analysis method used in designing tissue definition appeared to be working better.The reason for this, is that this method was based on the microarray data from only one platform -HG-U133Plus2.It was previously reported that the most reliable quantitative results of integrated analysis were obtained from the same platform (Shi et al., 2006).Another advantage of this method is the utilization of 88 different cancer cell lines of seven human organs and several different smooth muscle cells.Smooth muscle cells from different anatomical locations have many common morphological and molecular features.Nevertheless, they also have individual properties and functions.For instance, colon smooth muscle cells are responsible for moving food in the digestive system, whereas vascular SM regulate the flow of blood through the blood vessels.As it was reported, SMs have distinct expression patterns associated with the anatomical location (vascular system, visceral organs, or bronchi) (Chi  et al., 2006).For this reason, several different smooth muscle cells derived from various body locations, such as coronary artery, aorta, and penis were included in the differential gene expression analysis.Genes composing SM definition 5 came from a wider variety of functions.Several of them were reported to be involved in collagen formation (BGN, COL1A1) and muscle functioning (PRRX1, PAMR1).COL1A1 was also implicated in skin cancers and ITGA4 was reported as involved in regulation of the immune response, which was similar to most of the genes' functions in definitions 1-4.The function of ELTD1 and C1orf54 is undetermined, and in the light of this work, could be an interesting target for future studies.
The results obtained in this research indicated that about 50% of cancer tissue samples derived from seven listed organs were likely contaminated with smooth muscle tissue.Possible smooth muscle contamination was already detected during expression profiling of the human bladder (E-GEOD-7476) and gastric (E-GEOD-22377) cancer, as mentioned by the authors in (Mengual et al., 2009;Förster et al., 2011).The contamination in the first experiment was identified mostly in control samples causing complications in interpretation of the results.Interestingly, gene expression analysis of muscle-invasive bladder cancers (microdissected cancer cells) resulted in smooth muscle contamination being detected (E-GE-OD-31684 by Riester et al., 2012).In this case, the presence of smooth muscle tissue could indicate the presence of contamination, as well as cancer progression and invasion of smooth muscle tissue.Perhaps when equipped with SM definition 5, Microarray Inspector could be useful in prediction of muscle invasive cancers.

CONCLUSIONS
The heterogeneity of sample composition in microarray analysis causes differences in the results obtained by different laboratories.We propose methods that prove useful when verifying the purity of the test material with a high possibility of smooth muscle contamination, which was derived from most common cancers in the human population.With the information provided in this paper, the users of Microarray Inspector will be able to use our definition or to design other tissue definitions crafted for their own needs.We believe that a proper verification of tissue sample contamination enables avoiding incorrect conclusions from obtained results.

Table 1 . Description of experiments used in the study (fully contaminated, not contaminated and control), obtained from ArrayEx- press repository at http://www.ebi.ac.uk/arrayexpress.
prostate Normoxia-and hypoxia-treated prostate tumor cell lines and primary prostate epithelial cells -global gene expression analysis