Analyzing Gene Expression Data

Click to learn more about author Jeany Prinz.

Express Yourself!

All individuals are unique, and so are our data needs. From simple CSV files to REST APIs to Google’s BigQuery or using customized shared components, Data Science Analytics Platforms offer many ways to access and analyze your data. Today, we will demonstrate how to access all of these aforementioned data sources through the use case of analyzing and annotating gene expression data.

Gene Expression Analysis

Gene expression analysis is widely used in bioinformatics because it enables researchers to find gene products with increased or decreased synthesis in individuals with particular diseases. Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product.

As we learned in the recent blog titled Motifs and Mutations — The Logic of Sequence Logos, DNA mutations yield different effects of varying impact. Some have no noticeable effect at all, and some can lead to severe diseases. As we also saw in that blog post, mutations that change gene expression are often associated with harmful effects. Hence, analyzing gene expression data directly is a straightforward way to find connections between genes and diseases.

Transcription

The first step in gene expression is called transcription, during which DNA is transcribed to RNA. Advances in massively parallel sequencing enable the rapid sequencing of this RNA (RNA-Seq) in a genome-wide manner in order to quantify the amount of synthesized gene product.

In our use case today, we:

Analyze RNA-Seq data from tumors and matched normal tissue from three patients with oral squamous cell carcinomas [1]
Investigate all statistically significant over/under-expressed genes and select interesting ones by looking into their functional annotations
Using hierarchical clustering, we select a cluster of similarly expressed genes and investigate their pathway enrichment
Search for compounds that target the gene products we picked

As illustrated in Figure 1, the overall workflow, Gene Expression Analysis, which can be downloaded from the Hub here, consists of the following steps:

Input data
Find differentially expressed genes
View results of differential gene expression analysis
Clustering
Pathway enrichment
Display compounds targeting gene product of interest

**Figure 1:** Overview of workflow — differentially expressed genes are discovered using R and then displayed in an interactive view. Subsequently, genes are hierarchically clustered based on their expression pattern, and the results are shown via a dendrogram alongside a heatmap. We then perform a pathway enrichment analysis and look for compounds targeting the gene product of interest.

The user can select the files containing RNA-Seq data for samples with and without a disease of interest (positive and control, respectively). This data then gets used in the R Snippet to find differentially expressed genes. The user can investigate those genes and select genes of interest based on statistics from the gene expression analysis. We then cluster the genes based on similar expression profiles and investigate their biological pathways. In the last step, we search for compounds targeting the selected gene products.

Input Data

As mentioned in the previous section, today’s example uses RNA-Seq data from normal and tumor cells from patients with oral squamous cell carcinomas. [2] The standard procedure to generate that data consists of the following steps: The RNA of the cells is reversely transcribed to cDNA and then sequenced using massively parallel sequencing, resulting in short sequenced reads. Subsequently, these reads are mapped back to the reference genome to identify the genes from which they originated. This results in a count for each position in a gene representing the amount of gene product. In our data set, read counts for 10,542 genes were collected.

Find Differentially Expressed Genes

One of the strengths of this type of Analytics Platform lies in its openness for other tools. This allows you to easily harness the power of those tools, such as R, with all of its libraries. In our case today, we want to utilize a commonly used R library for differential expression analysis of RNA-seq expression profiles: edgeR. [3] EdgeR implements a range of statistical methods, including likelihood tests based on generalized linear models (GLM).

GLMs are most commonly used to model binary or count data, which makes them perfectly suited to model the aforementioned read counts. GLMs model a response by a linear function of explanatory variables and allow for constraints such as a restriction on the range of the response Y or the variance of Y, depending on the mean. Hence, a generalized linear model is made up of three components: (1) a linear predictor, (2) a link function that describes how the mean depends on the linear predictor, and (3) a variance function that describes how the variance depends on the mean [var(Yi) =φV(μ), with φ being the dispersion parameter].

Therefore, in our R snippet, we use the read count data from tumor vs. normal tissue and estimate the dispersion parameter. In the next step, we fit the generalized linear model and apply a likelihood-ratio test. This results in a log fold-change (logFC) and a p-value for each gene. The logFC describes how much a quantity changes between an original and a subsequent measurement. In our case, that means how much the read counts per gene differ between tumor and normal cells. As we do this simultaneously for all 10,542 genes, we have to make sure to apply a multiple testing correction. [4] We use the default method provided by edgeR, Benjamini-Hochberg, which has the false discovery rate (FDR) as output.

As we are interested in the relative changes in expression levels between conditions, we do not have to account for factors such as varying gene length. However, we have to account for differing sequencing depth and RNA composition. Sequencing depth is adjusted in edgeR as part of the basic modeling procedure. To adjust for RNA composition effects, where highly expressed genes can cause the remaining genes to be under-sampled in that sample, we use the function calcNormFactors. In addition to the fold-change and the FDR for each gene, we extract the depth-normalized read counts (counts-per-million) for each gene in our analysis.

View of Differentially Expressed Genes

We now have statistics for 10,542 genes, so in the following steps, we want to narrow down the results to genes of interest. For that, we create an interactive composite view in which the user can select genes for further analysis. As can be seen in Figure 2, we display the fold-change vs. the FDR on a logarithmic scale in a scatter plot. The colors indicate if the FDR exceeds 0.01 or not. In addition, we show an interactive table with details of the results (FDR, -log10(FDR) , p-value, gene name), and a range slider that allows the user to interactively filter for the logFC. We can now, for example, extract only genes that are significantly upregulated in tumor cells by box selecting the genes in the upper right corner of the scatter plot.

Clustering

Having narrowed down our search space, let’s have a closer look at the biological processes associated with these genes. Similar expression patterns of genes often point to a common function. [5] To find those genes with similar expression patterns, we perform a hierarchical clustering on the normalized read counts and display the results in a hierarchical cluster tree. For this, we can use a shared component that can be found on the Hub, the component Hierarchical Clustering and Heatmap. This component allows you to perform a hierarchical clustering on numerical columns of your choice and to display a heatmap sorted according to the clustering results.

We combine this component with a table containing more detailed information about the genes (see Figure 3), allowing us to interactively identify and pick a cluster of interest. In our case, we select the one showing high (orange) values in tumor cells and low (blue) values in the matched normal tissue. As we can learn from the details in the table of our composite view, this cluster includes MMP11 (matrix metallopeptidase 11), which may play an important role in the progression of epithelial malignancies, and COL4A6 (collagen type IV alpha 6 chain), which is the major structural component of glomerular basement membranes.

To further investigate the shared function of the selected genes, we perform a pathway enrichment analysis in the next step.

***Figure 3:*** View of heatmap with normalized read counts and dendrogram showing the hierarchical clustering of the counts —the heatmap is sorted according to the clustering. This combination of the heatmap with the dendrogram can be easily achieved using the shared component “Hierarchical Clustering and Heatmap.” Additionally, a table with more detailed information is shown.

Pathway Enrichment

A pathway consists of a set of genes related to a specific biological function. As genes are often annotated to a lot of pathways, a pathway enrichment analysis allows us to find those pathways that are enriched in the input set of genes more than would be expected by chance. Pathway enrichment analysis is, therefore, a widely used tool for gaining insight into the underlying biology of differentially expressed genes, as it reduces complexity and has increased explanatory power. Again, we can use the Hub to easily drag and drop a shared component called Pathway Enrichment Analysis.

This component makes use of the Reactome Pathway Database, a resource that is open source, curated, and peer-reviewed. It provides a pathway enrichment web service that can be easily accessed by the Analytics Platform. The component takes as input a set of Ensembl gene IDs and automatically performs the pathway enrichment analysis. The results can be seen in Figure 4. The pathways with the most significant enrichment are “Collagen chain trimerization” and “Degradation of the extracellular matrix.” Both MMP11 and COL4A6 are part of these two pathways. Indeed, collagen is an essential part of the extracellular matrix, and extracellular matrix interactions are known to be involved in the process of tumor invasion and metastasis in oral squamous carcinoma. [6] This further corroborates our hypothesis that these genes play an important role in oral squamous cell carcinomas from our patients’ tumor cells.

***Figure 4:*** *Pathway enrichment view — the pathways with the highest enrichment are “Collagen chain trimerization” and “Degradation of the extracellular matrix.”*

View Compounds Targeting Gene Product of Interest

In this final step, we want to check if we can possibly interfere with the disease for which we did our expression data analysis. For that, we look for compounds that target the selected gene products. As we have seen, Google BigQuery offers effortless access to public life sciences data. For this, you first need to set up a BigQuery account. In particular, we can easily query bioactivity data from the database ChEMBL using the KNIME Google BigQuery Connector in combination with the Database nodes. For our query, we gather all human synonyms for the gene products of choice and extract compounds known to target those. This allows us to retrieve information for all those compounds, including the name, the assay ID, the type of measurement (e.g., IC50 or Ki), and the structure as SMILES string.

From the SMILES string, we create images of the molecules and display them in a tile view. As can be seen in Figure 5, we found only results for MMP11, Matrix metalloproteinase-11, also known as Stromelysin-3. MMP11 is known to be involved in extracellular matrix breakdown in normal physiological processes and has been implicated in promoting cancer development by inhibiting apoptosis as well as enhancing migration and invasion of cancer cells. [7] Additionally, it has been revealed that MMP-11 expression in oral squamous cell carcinoma samples can predict the progression and survival of oral squamous cell carcinoma patients. [8] Moreover, MMP11 has recently been identified as a potential therapeutic target in lung adenocarcinoma. [9]

**Figure 5:** *Tile view with compounds targeting the gene product of interest — the ChEMBL ID for the ligand, the name of the target, the assay ID, the action, the type of measurement, and its value are shown.*

Summary

Today we learned how to perform a classic task in bioinformatics: differential gene expression analysis for a disease of interest. We created an interactive view that allowed the user to select significantly under/overexpressed genes. From there, we further narrowed that set of genes down to genes with similar expression patterns and common functions. In the last step, we searched for compounds targeting the discovered gene products, thereby offering the possibility to interfere with the disease under investigation.

We applied our workflow to data from normal and tumor cells from patients with oral squamous cell carcinoma. Through our analysis, we were able to identify a gene that has been independently implicated as a therapeutic target for carcinomas. Moreover, it has been shown that the expression of that gene can predict disease progression as well as the survival of oral squamous cell carcinoma patients. All this was facilitated by our product’s openness for other tools, which enabled us to use our favorite R library, extract data from Google’s BigQuery, and use shared components to customize our analysis.

All steps of the analysis can also be performed on the WebPortal through the interactive views of the components.

References

[1] Tuch, B., Laborde, R., Xu, X., Gu, J., Chung, C., & Monighetti, C. et al. (2010). Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations. Plos ONE, 5(2), e9317. doi: 10.1371/journal.pone.0009317

[2] Tuch, B., Laborde, R., Xu, X., Gu, J., Chung, C., & Monighetti, C. et al. (2010). Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations. Plos ONE, 5(2), e9317. doi: 10.1371/journal.pone.0009317

[3] Robinson, M., McCarthy, D., & Smyth, G. (2009). edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616

[4] Noble, W. (2009). How does multiple testing correction work?. Nature Biotechnology, 27(12), 1135-1137. doi: 10.1038/nbt1209-1135

[5] Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., & Levine, A. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings Of The National Academy Of Sciences, 96(12), 6745-6750. doi: 10.1073/pnas.96.12.6745

[6] Lyons, A., & Jones, J. (2007). Cell adhesion molecules, the extracellular matrix and oral squamous carcinoma. International Journal Of Oral And Maxillofacial Surgery, 36(8), 671-679. doi: 10.1016/j.ijom.2007.04.002

[7] ZHANG, X., HUANG, S., GUO, J., ZHOU, L., YOU, L., ZHANG, T., & ZHAO, Y. (2016). Insights into the distinct roles of MMP-11 in tumor biology and future therapeutics (Review). International Journal Of Oncology, 48(5), 1783-1793. doi: 10.3892/ijo.2016.3400

[8] Hsin, C., Chou, Y., Yang, S., Su, S., Chuang, Y., Lin, S., & Lin, C. (2017). MMP-11 promoted the oral cancer migration and FAK/Src activation. Oncotarget, 8(20). doi: 10.18632/oncotarget.15824

LISTEN NOW: MY CAREER IN DATA PODCAST

Data Topics