You might think the coolest thing about the Next Generation DNA Sequencing technologies is that we can use them to sequence long-dead mammoths, entire populations of microbes, or bits of bone from Neanderthals.
But you would be wrong.
Sure, those are all cool things to do, but Next Generation DNA sequencing (or NGS for short) can give us answers to questions that are far, far more interesting.
With NGS, we can look at entire transcriptomes (!!) together with the proteins that make them and the DNA modifications that help regulate them. If we compare a cell to music, a genome sequence would let us read the notes. Next Gen transcriptome data would give us the majesty and impact of the entire orchestra and combining multiple NGS experiments, well, that would join the measures into a symphony. We would hear the volume changes of gene expression together with movements of the protein conductors, punctuated by methyl groups acting as rests.
A transcriptome is the entire collection of RNA molecules in a cell - mRNA, ribosomal RNA, tRNAs, microRNAs, tiny RNAs, you name it, there's a lot of RNA and it's involved in way more things than we ever would have thought.
Next Generation DNA Sequencing or (NGS) is a powerful method for asking questions about the transcriptome.
Describing it though, is not always easy. Many of my conversations these days go like this:
Me describing Next Gen
I'm confused. Didn't you say you were sequencing RNA?
Then why are you sequencing DNA?
We sequence DNA because it's more stable than RNA.
But I thought you were interested in RNA, why are you sequencing DNA?
Because DNA sequences help us identify RNA.
I'm confused. How does that work?
The wet lab portion begins with isolating RNA.
So you do use RNA somewhere!
Right. We isolate RNA, make a DNA copy of the RNA, and then determine the sequence of the DNA and use those sequences to count molecules of RNA.
Okay, back to the lab.
Maybe I can make sense of this all by writing about it a bit, so feel free to add some questions.
The process of generating the data in a transcriptome experiment may sound a bit convoluted, but consider this: in many kinds of gene expression experiments, like microarray experiments, we don't ever know the exact sequence of the RNA that we're measuring. All that we know for certain is that something stuck to a spot on a slide and hopefully it stuck to that spot because it had the right sequence of bases.
In contrast, when we use NGS and determine the sequence of RNA, we know the identity of the sequence and we can use it to identify the RNA molecule and the gene that got transcribed.
Let's look at some NGS data
Okay, so what do Next Gen data look like? What kinds of things do you see from a Next Gen transcriptome analysis?
Here's an example. I got some publicly available Next Gen data sets from the NCBI (9 samples, about 250 Mb each) and uploaded them into GeneSifterÂ® analysis edition (GSAE from Geospiza, Inc.).
Then, I used one of the GSAE Next Gen Analysis pipelines to align the data to a reference database. The data, in this case, consist of millions of short DNA sequences, that we call "reads." The reference data consist of the RefSeq data set for mouse mRNAs. RefSeq data come from the Reference Sequence database at the NCBI. Each RefSeq comes from a single, naturally occurring molecule from one organism.
In my example, the alignment program found reads that aligned to at least 38,000 RefSeq mRNAs. (I didn't look at microRNAs or other wild things in this experiment.)
The most highly expressed gene in my sample was alpha actin 1. I found that 265,972 reads aligned to the mRNA for alpha 1 actin, a protein made in skeletal muscle. If we correct for the length of the mRNA, it means that 73,388 molecules of mRNA were counted from the alpha 1 actin gene. Pretty impressive!
An alignment, in this experiment, means a 25 base sequence matched the mRNA sequence exactly (I can adjust the program to allow for some mismatched bases, but I didn't do that here).
I also found some interesting patterns of alignments. In the image below, you can see the alignment pattern from almost 80,000 reads that aligned to tropomyosin 1, alpha.
The curious thing is the shape of the graph on the right hand side. Where are the reads for this half of the mRNA?
If I look at a more detailed map, I can see that some reads do map to this region, but the numbers are quite low. I see regions with 1-2 reads aligning instead of denser regions where as many as 5500 reads align.
Notice in this picture on the left, the numbers of aligning reads drop off around base 856. At base 855, over 100 reads map to this sequence. At base 856, only 5 reads match, and farther on down the sequence only 4-6 reads align. The low number of aligning reads (from zero to 3) continues through the rest of the sequence.
So, what does this mean? Why don't our reads map throughout the entire sequence of our RefSeq mRNA?
One explanation is that the tropomyosin 1 gene might be alternatively spliced. If there are multiple splice forms, my tissue sample might contain a splice form that differs from the RefSeq mRNA.
Let's find out.
First, I looked up the mouse gene reference for Tropomyosin 1.
But, I only found one mRNA listed for the mouse Tpm1 gene. (below)
We're not done yet, though. There's a human version of this gene and it might have more information.
If we search the NCBI Gene database with Tpm1 and pick the entry for Homo sapiens (human), we can see if the human gene has one mRNA or multiple isoforms. (An isoform of an mRNA is an alternatively spliced version.)
You can see the human version of Tpm1 alpha has seven different isoforms! This makes me think that the mouse gene probably has multiple isoforms as well and my data set contains one, or more, of them.
Just because the other isoforms aren't in the mouse RefSeq database yet, doesn't mean they don't exist.
In the picture below, I matched up the gene map for the mouse RefSeq mRNA with the maps for the human isoforms. The green bars represent the genes. The mRNAs are put together from the blue and red boxes (exons). The blue parts are the untranslated regions of exons and the red boxes are the parts that get translated to make protein. The lines are the parts that get spliced out of the pre-mRNA.
You can see that the human isoforms would make mRNAs of differing lengths. It would be really helpful to know if different isoforms are made in different tissues, but that information is more difficult to to find.
We'll look at some more Next Gen data in future installments and some microarray data, too. I've spent many years listening to solos, it's fun listening to the orchestra.