Bioinformatics is full of unexpected adventures. Some are related to data discoveries, but many more are related to navigating the maize of software needed to implement a solution.
In previous blogs, I’ve shared our work in developing an immuno-bioinformatics class to be initially taught at Shoreline Community College. In part of that class we want to give students an authentic hands-on experience with the underlying tools that are used in immunoprofiling. Specifically want them to login to a Unix-based work station, run basic commands (programs), work with a sequence file from an immunoprofiling experiment, and perform an analysis to identify the parts of the immune receptor sequences contained therein.
The basic requirements:
- Have access to a Unix-based (Linux) computer
- Have an appropriate sequence file
- Have access to the appropriate analysis tool(s)
- Complete exercises in a 2-3 hour lab
- Be free. Budgets are tight.
Through this process, the students will experience working as a bioinformatician. They will learn how to use a Unix (Linux) command line interface (CLI), and learn about the differences between the common fastq and fasta DNA sequence formats, how to run a bioinformatics program, and work with its output.
For the class exercise, we’ll use the CyVerse (CyVerse.org) cyberinfrastructure . CyVerse, via its Atmosphere cloud computing platform, allows individuals to create and share virtual machines (VMs). VMs are computers within a computer and form the core of cloud computing. They provide a way to package and store a computer operating system, programs, and data into a file structure called in image. When an image is instantiated (set to running), it behaves like a computer.
As images can be copied, a single VM can be replicated so that in a class each student can have their own computer; each one with a common environment. In this way all the parts of the exercise can be setup ahead of time, and, if a student "breaks" their environment, they simply delete the working copy and start over. This approach also has the additional benefit that students gain cloud computing experience.
Now, all that are needed are some data and appropriate analysis programs. As all forms of cloud computing have cost it is best to develop prototypes on a local computer, in this case the Unix environment on my Mac. Once the parts are understood and operational, the system can be replicated in a CyVerse VM.
The adventure begins ...
Data and programs
Adventure one: getting data
Finding Data. The goal of this part of the class it to provide an authentic hands-on experience with immunoprofiling, so we need a dataset and way to analyze that data. This leads to our first adventure, finding a publicly available dataset. NCBI (National Center for Biotechnology Information) is a good place to start. Searching all of NCBI with “immune repertoire” yields hits (01/11/19) in many databases including 145 BioProjects and 969 SRA (short read archive) datasets. After some experience one learns that SRA data are part of BioProjects, so BioProjects is the place to start.
Clicking on the 145 in BioProjects, returns a list in the BioProjects page. This page allows one to filter a search with different parameters that include data types, organisms, and others to narrow the results. To make a long story short, filtering on SRA, Transcriptome, and Human, I winnowed the list of 145 candidates to nine. I selected “Dynamics of the human antibody repertoire after influenza vaccination” because influenza vaccine research is a good immunology topic and the experimental protocol produced antibody sequences that contain complete binding domains (FRs and CDRs) with both heavy and light chains represented in the data. Eventually I settled on run SRR4431764 for no particular reason other than it was at the bottom of a list.
Getting the data is hard. With NCBI you can search and browse, and go in circles with the links, but you cannot get data from the web UI; it’s too big. Instead, data can be obtained from NCBI’s FTP site which requires a lot of effort to navigate. And, the data are stored in NCBI’s SRA format, which requires the SRA toolkit to work with the data, which requires more work. Further, the data for samples may be in many files corresponding to the details of how those data were collected. Did I say too much work?
EBI offers an easier way. Like NCBI, The European Bioinformatics Institute (EBI) is an international data repository. EBI and NCBI share data. An advantage with EBI, is that researchers can get publicly available data via simple download interfaces. In my case I entered the SRR ID (SRR4431764), that I identified at NCBI, in EBI's front page search field. When the search results were returned, I clicked the matching link to navigate to a page with the needed datasets. On this page data are conveniently packaged by sample and run instance in several ways including compressed (gzip) fastq files. Data can be downloaded to local computers by clicking the appropriate links, or URLs can be copied to transfer data into other systems (see VDJServer below). Seems EBI did not get the memo that the data are too big.
As a side note, in my opinion, while EBI serves data better than NCBI, but NCBI provides better ways to find data. For example, at EBI, I could not easily get to the desired dataset using “immune repertoire” as a search term. It is great we have both resources and a strong international collaboration between the organizations to share data and keep it in sync.
Adventure two: Preparing data for class use:
Class time is limited, and, as noted above, cloud computing has cost. In CyVerse cost is counted as allocations, which are free, but limited. In other platforms cost is money. Thus, we want our analyses to run super fast. As immunoprofiling data files commonly contain millions of reads (sequences), simple manipulations of the file can take minutes and an analysis of the data file can take hours. Being that the goal is to learn how to run an analysis and look at some results, we can provide an authentic experience with a relatively small number (order 1000) of sequences. In this way manipulations take seconds (or less) and analyses take minutes.
Merging Reads. The data that we will use (from above) are from an Illumina instrument and are contained in two files; one representing forward reads and the other representing reverse reads. As the experimental protocol amplified cDNA from an Ig (antibody) RNA-Seq experiment using constant region and 5' adaptor primers, the forward and reverse reads must be first merged (assembled) into complete sequences. Because the order of the data in each file are identical, that is the first read in each file comes from the same DNA molecule in the sample, merging the data is relatively simple and many programs can be obtained to do this. Rather than download and install such programs, I used a program within the VDJServer at the University of Texas, Austin.
Briefly, the VDJServer is a web-based system for immunoprofiling. It runs on the some of the same supercomputers that CyVerse uses and can perform very powerful analyses. I used the VDJServer to improve my understanding of the immunoprofiling steps and to develop a plan for the class exercise.
Cleaning Data. To get a meaningful subset of the data requires that merged data file be further processed. Specifically, low quality bases and very sort reads need to be removed. While these reads are part of the data, they will distract from the mission of the exercise, which is to use bioinformatics to explore the biology of V(D)J rearrangements.
I used SeqPurge to clean the dataset. Deciding on a quality filter program and obtaining it is its own sub adventure. There are many choices, with Trimmomatic being one of the most popular. Timmomatic is written in Java, and thus requires a Java runtime environment (JRE) is available on the computer running it. With JAVA there is no such thing as a prebuilt binary. Turns out JAVA is not on my computer, and I did not want to install a JRE. SeqPurge came up as an alternative. A paper on SeqPurge appeared in 2016, and according to the authors, SeqPurge compared favorably to six other sequence trimmers including Trimmomatic. So, why not try it.
Getting SeqPurge turned out to be an adventure in its own right. It is included in ngs-bits, a package that contains many bioinformatics programs, which can be useful. The ngs-bits package has prebuilt binaries for Mac, Linux and Window, which is also good. It should be easy to install the programs, right? Sort of. First you need to install the bioinformatics package manager Bioconda. Bioconda is installed from Minicoda which has installers for Windows, Mac OS X, and Linux. Just download and run the installer and Bioconda will be installed. Once installed, the Bioconda channels need to added via the commands:
> conda config --add channels defaults > conda config --add channels conda-forge > conda config --add channels bioconda > conda install ngs-bits
After a while, because many programs (232) are being installed, SeqPurge was added to my home directory in the miniconda2/bin directory. To run SeqPurge from a command line as “> SeqPurge” requires that the directory path “~/miniconda/bin” be added to my .base_profile, or I move the program to a directory that is already in my ENV PATH variable.
With SeqPurge I can now accomplish the task I meant to, which was to remove the short and low quality sequences (before and after tables to the right). As a note on SeqPurge, it has two quirks. First, it assumes all data are paired. That is two input and two output files (to be trimmed [in] and trimmed data [out]) are required. One file each for the forward and reverse reads, respectively. As my data were already merged, I included the same filename for both input files without any issues. Second, the files must be gzipped. Using SeqPurge, and many of the other ngs-bits programs, means you will likely run gzip and gunzip a bunch, oh my. That’s bioinformatics, many programs are developed for the authors’ use case. In this case the authors like to keep their data compressed. While this not unreasonable, other programs (especially Unix programs: less, more, head, tail, wc, and others), not in ngs-bits, will not read gzip compressed data.
At this point I have my data file, it’s been cleaned up to a degree and is a gzipped fastq file. As I will ultimately want a fasta formatted file (next adventure), I need a fastq to fasta converter. Turns out there is one in the collection of 232 programs that I downloaded with SeqPurge, but I did not start there. Earlier, I had googled (something you do a lot in bioinformatics) “convert fastq to fasta” and the first hit, from ECSEQ Bioinformatics, was a way to do this with the Unix command:
> paste - - - - < file.fq | cut -f 1,2 | sed 's/^@/>/' | tr "\t" "\n" > file.fa
Lured by the Siren’s call I had to play with each command ... then get back to work.
Getting a subset. The very last step, I wanted to get a subset of data. I could use the Unix commands head or tail to get a portion of the file from the top or bottom, but I also wanted to get random sequences too. Through some more googling, I learned that there’s a Linux program called shuf. I also learned that I can install it on OS X using homebrew (yes, another package installer) via "> brew intall coreutils." Coreutils is a collection of Gnu software, and since it is Gnu and Apple has issues with Gnu (about the licensing mode), it will be named gshuf on my computer to remind my Apple has issues with Gnu. Nothing Gnu here.
As an aside, and another rabbit hole, while I was at it I also learned another way :
jot -r "$(wc -l FILE)" 1 | paste - FILE | sort -n | cut -f 2- | head -n 10
But a variation is needed to work with our data, which makes the getting gshuf adventure, somewhat futile. The variations:
> paste - - < FASTAFILE > tmp.fasta; jot -r "$(wc -l tmp.fasta)" | paste - tmp.fasta | sort -n | cut -f 2- | head -n 1000 | tr "\t" "\n" > RANDOM_SUBSET_FILE; rm tmp.fasta > paste - - < FASTAFILE > tmp.fasta; sort -R tmp.fasta | head -n 1000 | tr "\t" "\n" > RANDOM_SUBSET_FILE; rm tmp.fasta -- also works
Unix, and the Internet, are so cool. As an exercise for the student, describe what each command is doing.
Adventure three: IgBLAST
To remind ourselves, the goals of the command line bioinformatics exercise are to: 1) learn a few basic Unix commands and 2) run a bioinformatics program on some data, we will need a bioinformatics program. This is an entry level class, so the program should be installed on the computer along with any dependencies needed to operate it. An advanced class would include software installation.
For this exercise we’ll use IgBLAST because it is used in other systems such as the VDJServer. It was also developed from BLAST by the team as NCBI, has been published in a peer-reviewed journal, and will have good multi-platform support - Mac OS X, Linux, Windows - with prebuilt binaries. Additionally, BLAST has been a standard bioinformatics tool since 1990.
Getting and installing IgBLAST is straightforward ... not.
Starting with a google search “igblast,” we quickly get to the IgBLAST search page at NCBI. On this page, at the top in tiny type, is a link “Stand-alone IgBLAST." Clicking that takes us to a FAQ (Frequently Asked Questions) page where we find another link “IgBLAST FTP site." The FTP site has several directories and some files. Some directories hold older versions of IgBLAST others contain data for running IgBLAST, but which are required? For that you need documentation.
A general practice is to put documentation in a README file. The FTP site has one, so clicking on the README file should tell us what is needed. Lo and behold right there, in the file, it says, in brilliant ascii mono spaced font with double spaced lines ...
Documents to set up and run IgBLAST can be found at: https://ncbi.github.io/igblast/
That’s right folks, you have to go to another website, github (now owned by Microsoft). And to get there, you need to copy the URL and paste in your browser's URL field.
In the documentation, in the section “how to set up”, under the “Other notes on setting up IgBlast,” we learn that we need the internal_data directory and probably the optional_file directory. These will need to be in the directory where IgBLAST is run or specified in the IGDATA environment variable; examples are provided. Environment variables are part of the advanced Unix class too (remember SeqPurge and the 232 programs in minicoda2/bin?).
Reference Data (aka Databases). Finally, IgBLAST compares data in a fasta formatted sequence file to "databases" of antigen receptor (AR) sequences (Ig (antibody) or T-cell receptors). Getting AR structure from the data, requires separate databases. One each for the V, D, J genes corresponding to either Ig heavy chain/TCR-alpha, gamma (V, D, and J) or Ig light chain [lambda, kappa]/TCR-beta, delta (V and J). The databases also need to be in blast format. The FTP site does have a database directory, with files in blast format, but these are only for mouse and rhesus monkey. To get human data, one needs to go to IMGT (the International ImmunoGenTics) information system. We’ll cover IMGT in another post.
The next step is to get IgBLAST and the required directories of data. The program is easy. On the FTP site simply click the "latest" file, which navigates you the most recent version of the program, and then click the desired prebuilt binary to initiate a download. After it downloads, install the program via its installer (Mac, Windows) or manually by moving to an appropriate directory (Linux).
Program Dependencies. Getting the required data is harder. Clicking a directory via the FTP web interface simply opens the directory. You cannot download an entire directory by FTP via a web browser. What about FTP from the command line? First I had to install FTP, because Apple has decided not include FTP in its latest operating systems. I've been learning this is also true from some Linux distributions. After installing FTP via “> brew install inetutils” - and logging into NCBI’s FTP site - I was reminded that FTP stands for FILE transfer protocol. FTP transfers files, not directories. So a variant of FTP is needed, namely ncftp.
Back to google, back to my installers, this time the installer being homebrew. A quick “>brew install ncftp” and I was back at it. Using the commands:
> ncftpget -R ftp.ncbi.nih.gov . /blast/executables/igblast/release/internal_data > ncftpget -R ftp.ncbi.nih.gov . /blast/executables/igblast/release/optional_file > ncftpget -R ftp.ncbi.nih.gov . /blast/executables/igblast/release/database > ftp.ncbi.nih.gov . /blast/executables/igblast/release/edit_imgt_file.pl
Referece Data Formats. I was able to get the needed IgBLAST files. Now on to the Ig reference data. Recall, these are at IMGT and can found on their reference directory page. As there are not too many files to retrieve, I simply right clicked each needed file link to open the fasta formatted text data in a new tab in my browser. Next, I clicked on each tab and saved the data to a file on my computer.
Next as these data are in fasta format, with IMGT’s fasta headers, they need to be converted to: 1) NCBI’s fasta headers then, 2) into BLAST databases, albeit small databases. The first step is to run edit_imgt_file.pl on each fasta formatted file from IMGT, then format the file as a BLAST database:
> ./edit_imgt_file.pl IGMTFASTA_FILE > REFSEQ.fasta > makeblastdb -parse_seqids -dbtype nucl -in NCBIFASTA_IMGTFILE
Once the commands are run for each file, the needed databases are created. The files ending in .nhr, .nin, .nog, .nsd, .nsi, and .nsq are moved into the database directory. As this exercise will focus on an Ig dataset we only need the Ig files from IMGT.
As an exercise for the student how many files are needed to identify V(D)J genes in Ig heavy and light chains?
We now have a working prototype. IgBLAST is installed. The needed data have been identified, retrieved, merged, and cleaned up. A random subset has been selected from the larger data file to ensure fast analyses. The test runs work, so the next step is to put the essential parts in our VM and run some more tests. Along the way, I learned a few more Unix tricks and have a lot more software on my computer. Instead of 10 years, my adventures took a few days to complete and relied on many years of experience with bioinformatics programs, data, and Unix.
 From Cyverse.org: “CyVerse provides life scientists with powerful computational infrastructure to handle huge datasets and complex analyses, thus enabling data-driven discovery. Their extensible platforms provide data storage, bioinformatics tools, image analyses, cloud services, APIs, and more.
CyVerse is funded by the National Science Foundation’s Directorate for Biological Sciences. We are a dynamic virtual organization led by the University of Arizona to fulfill a broad mission that spans our partner institutions: Texas Advanced Computing Center, Cold Spring Harbor Laboratory, and the University of North Carolina at Wilmington.”