Monday, November 24, 2008 - 00:30
Ebola virus has impressed me as creepy ever since I read "The Hot Zone: A Terrifying True Story
some years back by Richard Preston. (I guess he has a new book, too, Panic in Level 4: Cannibals, Killer Viruses, and Other Journeys to the Edge of Science
but I haven't been in airport for the past couple of weeks, so I haven't read it yet.)
Infectious agents that cause diseases with gruesome symptoms really excite those of us with an interest in microbiology. Tara has written about this paper, too, and summarized the details.
I thought I'd show you how to have fun re-analyzing the data, demonstrate a new and unexpected feature that I happened to find in NCBI BLAST, and see if we can reproduce the phylogenetic tree from the paper by using the tree algorithms at the NCBI. Making phylogenetic trees is often kind of painful in the classroom, for various reasons, and I wanted to see if we could find a more user-friendly method.
Bone picking
First, I want to point out that the authors of this paper (1) were a bit negligent concerning their materials and methods. This is irksome, although not uncommon where the bioinformatics methods are concerned. (You know, we did computer stuff, it's all magic anyway.)
If you notice here, in this figure from the paper, there are 14 different genome sequences listed in the tree. I would expect to find the accession numbers for all 14 sequences in the paper.
Did I find the accession numbers in the paper?
No, I found six out of the 14, less than half. I only found two of the Marburg sequences in GenBank, and I'm not positive that those were the same ones in the paper. I think the reviewers were sloppy in this regard. How can an experiment be repeated if the materials aren't described? or even available? I would have thought the reviewers would at least look to see if the accession numbers were in the paper (they're not).
The paper gives the impression that complete genomes were used to create the tree in their figure. If that's true, it's hard to see where the data lives.
Still, I found the six most important genomes and a couple from Marburg virus (2), so I had some material to work with.
Learning from our mistakes
The reason I wanted these genomes was that I wanted to see if I could reproduce the published tree by using the tree analysis algorithms at the NCBI.
My first attempt failed miserably. I would enter my query in the usual place and then enter the accession numbers for the other viruses as an Entrez query.
The problem was, that none of the BLAST databases, I queried, contained the set of viral sequences that I wanted to see.
Actually, as it turned out, one database did contain some of the sequences, but the others were in a different database. Since NCBI BLAST only allows me to query one database, I can't search the data set that I want to see.
(At least that's what I thought.)
This difficulty with comparing sequences from different databases in one BLAST experiment has long been a source of frustration for me. If I'm doing this for work, I just make my own database and use our Finch Software for running BLAST or I run BLAST on my Mac.
When I'm teaching a class, though, I don't want to have to make students learn UNIX and install BLAST on their laptops and we haven't put BLAST on the student version of our software.
New tricks with BLAST
Luckily, I noticed that BLAST has something new.
I had ignored this new checkbox because I didn't want to compare two sequences.
However, that was a mistake. That checkbox is useful!
When I clicked it, another window opened up.
Now, I could enter the accession numbers for my new sequences!
Notice, too, the format. I tried some different ways for entering the numbers.
This method: Accession1, Accession2, Accession3..... Did not work.
This method: Accession1 Accession2 Accession3..... Did not work.
I could only get BLAST to work if I entered the sequences like this:
Accession1
Accession2
Accession3
I don't know why that is, but it really did work. I used these sequences:
NC_002549
AY354458
NC_006432
FJ217162
NC_004161
NC_001608
DQ447653
with the newly discovered virus, FJ217161, as the query.
Here are my results:
and, when I click the Distance tree of results link in BLAST, and use the default tree settings, I get a tree with the same shape and arrangement as the one in the paper (1).
Formatting the tree
Nothing is perfect of course. It's impossible to make the font size from the NCBI tree large enough to read.
Luckily, you can download the tree in the Newick format and make a pretty picture with the combination of NJ plot and a graphics program like Adobe Illustrator.
Here it is after some formatting. I highlighted the new virus.
Conclusions:
1. This new blast feature, where you can blast against your own set of sequences, is really helpful.
2. You can make the correct trees with the Mimimum Evolution Algorithm at the NCBI, but you will need to format your trees (i.e. make them pretty) somewhere else if you need pretty pictures.
References:






- Jonathan S. Towner, Tara K. Sealy, Marina L. Khristova, César G. Albariño, Sean Conlan, Serena A. Reeder, Phenix-Lan Quan, W. Ian Lipkin, Robert Downing, Jordan W. Tappero, Samuel Okware, Julius Lutwama, Barnabas Bakamutumaho, John Kayiwa, James A. Comer, Pierre E. Rollin, Thomas G. Ksiazek, Stuart T. Nichol (2008). Newly Discovered Ebola Virus Associated with Hemorrhagic Fever Outbreak in Uganda PLoS Pathogens, 4 (11) DOI: 10.1371/journal.ppat.1000212
- J. S. Towner (2006). Marburgvirus Genomics and Association with a Large Hemorrhagic Fever Outbreak in Angola Journal of Virology, 80 (13), 6497-6516 DOI: 10.1128/JVI.00069-06