Understanding the dark matter of the human genome

16 December 2022 by Larissa Lily (NCMM)

Since the discovery of the human genome sequence, we have access to looking at the different parts of the genome. If you look in terms of what defines you phenotype, say the colour of your eyes, that is defined by a specific protein.

We have 20 000 or so proteins in our genome, Sahu tells us, but that covers only 2% of the whole human genome sequence. So what does the remaining 98% of the genome do? After all, as a species we have carried it with us over hundreds and millions of years of evolution.

This 98% of the genome was previously treated as “junk DNA” – non-coding DNA that was considered evolutionary detritus. However, researchers have since past few decades come to understand that it serves an important regulatory function – while it doesn’t have a coding function itself, it’s function is in regulating gene action. Since the human DNA was sequenced, a lot of this regulatory function has also been studied. We have gained a better understanding how these 20 000 proteins are expressed and how they are controlled.

If you think of disease state, there are certain proteins that control the mutations. We need to understand how these proteins themselves are regulated. The key lies in the genome. In order to understand complex diseases such as cancer, where there are multiple things going on, we need to understand the mechanisms that lead to this, Sahu explains.

Sahu’s primary interest is to understand this 98%, that he calls the dark matter of the human genome. This non-coding regulatory holds the keys to understanding not only normal physiology, but also the different malfunctions. One such and very common malfunction is cancer. Sahu’s research sets out to understand more about how the cancer genome is regulated, what are its defining factors, what molecules interact with each other and how all of these factors work at the level of the genome.

Understanding cancer with respect to cell lineage

Cancer is not simply the result of uncontrolled growth. Sahu explains that in cancer state, there are multiple signals coming in, and those signals are carried by certain factors. One of them is the deregulation of the gene expression – in other words, deregulation of the proteins that control certain pathways.

Can we know what other factors that control these pathways in an abnormal manner, in a disease state like cancer? An important aspect to this is where these factors bind in our genome. Is the pattern similar to what happens in normal cells, as opposed to disease state like cancer. If we can find those differences, that can help us to have better biomarkers, and more effective ways of screening.

This is a broad field of study that many researchers are investigating in different ways. One of the unique things Sahu is doing is trying to link whether these factors work in a defined lineage. He explains:

By lineage I mean, there are different organs, such as intestines, which has colon part (or colorectal cancer, in disease state), pancreas (pancreatic cancer), and liver (liver cancer) - if you look across the development, all these orders have a link, they are in the same lineage, compared to something like brain cancer, which is of completely a different lineage. In terms of their development, there is fine-tuning between a lot of these factors which control whether something becomes a pancreatic cell type, intestinal cell type, or liver cell type.

Many of these factors have a lot in common but the fine-tuning means that if the ratio of particular factors is high, the cell will develop into a specific cell-type, say a pancreas. If the ration of that particualr factors is lower it will become something else, say a liver cell.

We have to understand cancer and the way cancer initiates with respect to the lineage. That is my primary interest, Sahu states.

Why is this important?

Now that many of the cancer genomes have been sequenced, we know the landscape of all the major somatic mutations and oncogenic drivers. However, we still don’t know how they operate in the context of a certain cell type or tissue. Cancer mutations are often cancer specific. So the mutation to act in a given cell type, in a given context has to be understood in terms of the cell foraging and in terms of the right lineage.

Developing more defined models

The exact cell foraging in cancer is still unknown. That is where Sahu’s research comes into play. His goal is to have a more defined model, where it is possible to identify the exact set of factors that control cell lineage.

Since we know all the major oncogenic drivers of human cancer, we can start asking questions such as, if this is an oncogene A, with which particular factors does it interact to give a phenotype that makes a cancer-cell.

This is what Sahu’s group is trying to develop and understand at the more molecular level and in a very defined manner. It is an in vitro approach, at first done in a petri dish, where the findings of the basics roots facilitate developing a more thorough understanding of the mechanisms at play. Once it is understood how oncogenes interact with a certain set of factors under particular conditions, and how another set of oncogenes interact under different set of conditions, it is possible to see whether there is any interplay or they are very specific processes. That understanding enables defining a formula, such as, this oncogene needs to operate under these kinds of signaling pathways.

The group’s approach is a combination of stem-cell based transdifferentatiation. This is a direct cell fate conversation method, where you can take a cell type A, and by knowing the defined factors which are specific for a given cell type, and by applying those factors, you can convert a cell type A to a cell type B. If during that course, say of making a liver cell, you introduce cancer specific mutations which are supposed to be specific for liver cancer, you can then see how they function - are they able to make cells which are transformed, which have properties similar to those cell types that are specific to tumor cells?

This is an approach where we combine two fields, one is the steam cell based transdifferentiation, and the other functional genomics. By combining these two fields, we aim to get a better overview of how these defined factors co-operate with oncogenes, Sahu says.

Genome wide perspective

Sahu aims to look at the non-coding regulatory, where these factors bind. These factors are highly sequence specific; they are known as transcription factors and they are a class of proteins very different from the rest of the proteins. If they bind specific sequences and if the genome is unstable, as is the case in cancer, how does the unstable genome, not only in form of somatic mutations but also in terms of specific binding of these particular factors, affect the downstream result – in other words, is how the code different in a normal cell versus a cancer cell.

We employ a plethora of genome wide techniques, so that we can measure the downstream effect of these factor binding sequences. This is to see what particular proteins are deregulated, what is the structure of the genome in the cancer cells as opposed to normal cells, and also, when you find these sequences, whether all of them are active.

Sahu’s interest started when he was doing his doctoral training. Back then, he was looking at one gene at a time, focusing on the protein to protein interaction, to see what happens to a chosen protein in the presence of another protein. When the next generation sequencing came in, it allowed sequencing of the genome in a high throughput and easily accessible manner. That changed the whole landscape of doing genomics research.

It was so accessible and easier to do. I could jump into a new field. Rather than looking at the role of every single cell or protein one by one, we had the opportunity to look in a genome wide manner. Say we have a protein that binds DNA, we could now look how this protein binds across the whole genome, and then code it, Sahu explains.

The genome is vast, it has 3 billion base pairs. Rather than looking at specific sites, sequencing the whole genome also allows sequencing the parts of the genome where a protein binds in a very specific manner, and then co-relating that. If a protein binds in a specific place, one can see what is the nearby gene it might regulate.

This is one of the ways functional genomics works - because you are trying to interpret the meaning of the whole genome using these kind defined assays. That is what got me interested in the field.
If you have a good antibody against your protein of interest, you can go on a fishing expedition, you can try to fish all the sequences in the human genome where this protein binds. Then get the sequences out, remove the proteins and sequence this.

This gives complete information, including where the sequence comes from, what part of genome, and what the nearby gene is. It allows identifying what gene this particular sequence bound by this particular factor regulates. Then by comparing the normal and disease state, it is possible to see if the sequences are the same, and if the factors are the same.

Data driven approach

These factors are often over expressed in cancer. One of the most common such factors is Myc. It is an oncogene and binds to many genes. It is very seldom mutated. However, it can be amplified or over expressed, and it drives a program that is very important for cancer cells’ survival.

Sahu used to work with prostate cancer. The master transcription factor - the proteins that are binding specific sequences in the genome - is androgen receptor. In breast cancer, it is oestrogen receptor. These have completely normal functions in human physiology as they develop the secondary sexual features and a lot of the reproduction and basic biology is a direct result of that. However, in the case of these two hormonal cancers, these proteins get deregulated.

Androgen receptor over expression correlates with the worst prognosis of prostate cancer. But when people started doing androgen deprivation therapy, because they thought that for a hormone dependent cancer, if you deprive the tissues of the hormone, then the androgen receptor would not work. However, the relapses still happened because cancer cells had figured out how to overcome this and the androgen receptor was still active.

Making a basic assumption on how the transcription factors behave proved unsuccessful in this case. It is clear that we need more complete information on how these factors work. Transcription factors are so essential to normal physiology and normal functions. But they are also very difficult to target. They don’t work on their own, they have a combination of multiple different co-regulators and those are critical, but once you know the defined co-regularors and if they are going haywire in disease state, then you can target them.

It gets more interesting when you start to looking at things in a more genome wide manner, it not only helps you to understand, and in my opinion it is a better way of doing things - because you don’t make a lot of a priori assumptions, Sahu says.

Genomics is data driven. Researchers generate a lot of genome wide data sets which allows them to work with certain bigger goals in mind, and to ask more defined questions after looking at the data.

We don’t start from a preconceived notion that this is what is going to happen. We ask the questions from the data we see, Sahu clarifies.

Career highlights so far

Sahu looks back at his PhD years as one of his career highlights. He received a distinction from the Faculty of Medicine at the University of Helsinki, and published papers in high impact journals. One of them was a Faculty of 1000 paper on prostate cancer signaling field, and it is still heavily cited today. Sahu’s PhD gave him the first window to enter the field of transcription factors.

At the time, he was only working on one or two factors, when according to the most recent catalogue, there are close to 2000 of these proteins. Sahu was driven by the idea of extending his research and he began looking for a group that studies all of them, at least to some extent. He soon found his new scientific home at the Professor Taipale’s lab in the Karolinska Institute and University of Helsinki. Taipale group takes a genome wide approach to understand all the different transcription factors that are in the human genome and to identify what their exact sequences are.

Since Sahu’s goal is to understand cancer signalling, he need to have his feet planted on two different fields.

On one side I was trying to understand the sequence specificities and sequence determinants in a more functional manner for all the transcription factors. On the other I also started working on how to proceed in the molecular approach so we can understand the role of oncogenes, oncogenic drivers, which are very cancer specific, together with these cell type specific and lineage defining transcription factors.

Future vision

What keeps Sahu motivated everyday is not only his research in the context of cancer, but imparting the knowledge he has gained in a wide variety of functional genomic methods to the next generation researchers. Sahu’s projects have been successful, and have resulted in publications in high impact journals.

If we can look at the gene regulatory programmes, maybe we will have a more defined approach, more defined ways to tackle these questions - both in preventing and treating cancer. But the primary goal is finding the molecular mechanisms.

Sahu started as a group leader at NCMM in September 2022. For the next 10 years his vision is to employ the molecular approach to study human cancer in a very defined manner, and to understand how the different transcription factors operate in different cancers. Using a genome wide perspective, his main approach is to identify, understand, and validate in order to better understand what is the regulatory logic employed in a diseased state and how different that is from a normal state.

- This is something we’d like to understand in the next 10 years. Once we have a better understanding of that, then we can start asking even more complex questions, Sahu concludes.