I had a wonderful opportunity to learn about DNA sequencing and bioinformatics while attending the virtual nine-week Waksman Institute Summer Experience program. We learned how to isolate the complementary DNA or cDNA fragments from the duckweed (Landoltia punctata) plant, using techniques like reverse transcriptase, plasmid vectors, polymerase chain reaction (PCR), gel electrophoresis, etc. We used bioinformatics tools like BLAST, etc. to compare the sequenced DNA with known sequences stored in online repositories like GenBank. We also learned how to determine if the cDNA codes for a protein using the ORF Toolbox and look for similar proteins in other organisms. The duckweed plant grows on the water in lakes and wetlands. It’s being researched as a potential food source and in bioremediation. With the help of the vWISE team, I submitted clones W418.20 (Landoltia punctata clone W418.20, 2021) and W417.20 (Landoltia punctata clone W417.20, 2021) to GenBank. The clone W418.20 codes for a protein that is similar to acid phosphatase/vanadium-dependent haloperoxidase-related protein.
Part three talks about the use of bioinformatics tools to perform analysis of the DNA sequence. Part one of this series showed the steps to create a cDNA library of random DNA fragments isolated from an organism. Part two discusses the steps to prepare a cDNA sample taken from the cDNA library to prepare it to be sequenced.
Table of Contents
- Nucleotide Databases
- Validating the electropherogram
- Analyzing cDNA sample using Bioinformatics tools
- Putting it together using a Bioinformatics Management tool
The nucleotide sequences of biological molecules like DNA, RNA, and protein of various organisms are stored in Nucleotide databases. These also describe the structure and function of the biological molecules. Examples of these databases are GenBank, PDB database, etc. The sequence file in the database for the particular molecule contains the annotation and sequence. The annotation contains the definition of the molecule, the accession number, a unique number used to identify the molecule, and a CDS, the corresponding protein-coding sequence. The sequence part contains the DNA sequence. One of the common searches in these databases is to find homology or similar sequences between organisms. There are also Expresses Sequence Tag databases where the DNA sequence may not be very accurate since they are cDNA derived from mRNA.
Here are some examples of nucleotide databases (Zou et. al., 2015).
GenBank DNA sequences (Collection of all publicly available DNA sequences) http://www.ncbi.nlm.nih.gov/genbank
EMBL DNA sequences (European Molecular Biology Laboratory – similar to Genbank) https://www.uniprot.org/database/DB-0022
DDBJ DNA sequences (Japan’s equivalent of Genbank) https://www.ddbj.nig.ac.jp/index-e.html
EST Expressed Sequence Tags (cDNA sequences reverse transcribed from mRNA) https://www.ncbi.nlm.nih.gov/genbank/dbest/
STS Sequence Tagged Sites (unique DNA sequences used for making maps for human chromosomes) https://www.ncbi.nlm.nih.gov/probe/docs/techsts/
PIR Protein Identification Resource (protein sequences) http://pir.georgetown.edu
UniProt Universal protein resource (protein sequences) http://www.uniprot.org
PDB Protein Data Bank for protein 3D structure http://pdb.org
SGD Saccharomyces cerevisiae (yeast) Genomic Database https://www.yeastgenome.org/
Wormbase C. elegans (worm) genome database https://wormbase.org/species/c_elegans#401–10
Flybase Drosophila (fly) sequence and genetic database https://flybase.org/
TAIR The Arabidopsis (a flowing plant) Information Resource https://www.arabidopsis.org/
Validating the electropherogram (waveform)
The sequencers output the DNA sequences in the form of an electropherogram and a text file of alphabetical sequences. The electropherogram contains waveforms of the four nucleotides identified during the sequencing process. Software in the sequencer machine performs “base calling” i.e. calls one of the four nucleotides based on the peak of the waveforms. However, these may not be accurate due to various reasons, i.e. poor quality of the supplied DNA sample, accuracy issues in the sequence machines themselves, etc.
The chromatogram file, usually with extension *.ab1, can be opened using tools like FinchTV (for Windows) or 4Peaks (MacOS). The electropherogram is analyzed for quality. A good quality reading will have distinct peaks with no overlaps. Base-calling software will mark peaks with N if the nucleotide cannot be identified clearly. Also, we check that the length of the sequence is at least greater than 300 base pairs. We identify where the sequence starts by looking for the unique Forward Sequencing Primer. We crop out the Forward Sequencing Primer and the pTriplEX2 plasmid DNA cloning sites. We crop any entries marked N as these are poor-quality reads. We then identify the ends of the sequence by looking for long runs of polyAs (a sequence of 15 or more As). What is left is the sequence of the cDNA sample. While we validated the base calling visually and manually, there are sophisticated software applications like the open source DeepNano that use neural networks to provide accurate base calls for long reads (Boza et. al., 2017). Here are my sequences of the Landoltia punctata clone in FASTA format, JZ984547.1, and JZ984546.1, with grateful assistance from Professor Andrew Vershon and Laboratory leader John Brick.
Analyzing DNA sample using Bioinformatics tools
Now that the DNA sequence is available, various bioinformatics tools are available to perform analysis. The most common analysis is to find homology or similar sequences in other organisms that are stored in the nucleotide databases. This will help us learn more about the DNA sample, its structure, and its function.
Looking for homologs using BLAST
BLAST or Basic Local Alignment Search Tool is a set of tools that are used to search for similar nucleotide sequences against nucleotide databases. These look for similar DNA or protein sequences in other organisms. I’ve used BLAST to compare protein sequences extracted from a 68 million year old dinosaur with living organisms in this post. I found out that the closest modern relative are the humble chicken and the Norway rat!
BLASTn performs a search for similar nucleotide sequences against the database. To get to the BLAST search page hosted on the National Center for Biotechnology Information (NCBI) website, we opened https://blast.ncbi.nlm.nih.gov/Blast.cgi and clicked on the “Nucleotide BLAST” image. We pasted the sequence in the “Enter accession number(s), gi(s) or FASTA sequence(s)” text box under the “Enter Query Sequence” section. We ensured that the “Database” dropdown in the “Choose Search Set” section is set to “Nucleotide collection (nr/nt)”. This option includes databases like GenBank, EMBL, DDBJ, PDB, etc., that contain annotated sequences. Under the “Program Selection” section, we selected the “Somewhat similar sequences (blastn)” option in the “optimize for” radio button group. This option compares word sizes of seven bases, making it a more accurate, though slower search (Madden, T., 2013).
The search results page shows matching sequences with the name of the sequence, the organism associated with the matched sequence, and the gene (if identified). The list shows that the sequence is closely related to proteins found in plants like African oil palm (Elaeis guineensis), wild Malaysian Banana (musa acuminata), Guinea yam (Dioscorea cayenensis), etc., all of which are from the plant kingdom. The “Max Score” column value is based on the number of matches between the sequence, better matches mean higher scores. The “Query Coverage” column shows the percentage of length of the sequence that was matched. This is also graphically shown under the “Graphic Summary” tab (Figure 6). The “E Value” column or the “Expect Threshold” is a probability that the matches are not by chance. A value close to zero means that the match is not due to background noise and shows a higher likelihood of a match. This is used as a threshold of statistical significance. the results are sorted by ascending order of E value.
The “Alignments” tab (figure 7) visually shows the alignment i.e. each matching base pair between the query (my sequence that I’m searching for homologs) and subject (the matching results). The “Identities” field shows the number of exact matches between the two sequences. The “Gaps” field shows the insertions made in either sequence in order to optimize the alignment. The vertical lines between the nucleotides show a match. Generally, the matches with the lowest E value are potentially the best. We looked at the graphical tab to ensure that the match is not occurring at the end where it could be due to the PolyAs.
I also repeated the same search but against the Expressed sequence tags. This returned hits against other similar sequences for the same plant Landoltia punctata.
BLASTp searches protein sequence against protein DB. You can access it from the same NCBI site but click the “Protein BLAST” image to access this search tool.
BLASTx searches a nucleotide sequence against a protein DB after converting the nucleotide into protein sequences (6 different reading frames). It looks for similar proteins coded by the DNA sequence because different sequences can create similar proteins as many of the amino acids can be coded by multiple codons, due to the degeneracy of genetic code. Blastx converts the DNA sequences into protein sequences for each of the 6 reading frames and searches the protein database. From the NCBI BLAST website, click the “blastx translated nucleotide > protein” image. Just like the BLASTn search, we pasted the DNA sequence in the textbox. In the “Database” dropdown, we ensured that “Non-redundant protein sequences (nr)” is selected.
The results are displayed in a similar manner as a BLASTn search. The “Descriptions” tab shows protein sequences that closely match the proteins generated from the cDNA sample. The BLASTx search shows better matches compared to the BLASTn search as it has identified proteins from the mung bean (Vigna radiata), adzuki bean (Vigna angularis), rose gum (Eucalyptus grandis), among others. The graphic Summary tab (figure 10) shows that the matching proteins all align between base pair 50 and 550 meaning that there is a good match and therefore this cDNA sequence is highly conserved, meaning that the sequence has remained unchanged through evolution, making it critical for the organism.
The “Alignments” tab shows the protein sequences have 91% identities and zero gaps (in the example shown in Figure 11). The “+” signs indicate that the compared amino acids are not identical but have similar chemical properties. The lower case sequences shown are not significant, therefore these are filtered out from the E-value or identities calculations.
Finding the Open Reading Frame
From the BLASTx search, we found matching protein sequences from the six reading frames. The Open Reading Frame or ORF is the region within the sequence that codes for a protein. Other non-coding regions of the sequences are the Untranslated Regions or UTRs. While we used a custom ORF tool used at the Waksman Institute, the NCBI has an online ORFFinder tool https://www.ncbi.nlm.nih.gov/orffinder/ (Wheeler, D. L, 2003). The results for each of the 6 reading frames are shown in figure 13. The optimal frame will have a longer protein sequence and will begin with Methionines (M) amino acid. In this example, frame 1 has the longest sequence. We can compare this with our BLASTx search results to validate the open reading frame. We then copy the protein sequence identified in the ORFfinder and paste it into the BLASTp tool. If the BLASTp results match that of BLASTx, then the predicted protein sequence is the correct translation of the cDNA sequence. We selected the “Low complexity regions” filter options under the “Algorithm Parameters” section of the BLASTp search.
Putting it together using a Bioinformatics Management tool
Since there are many tools used in the analysis of the cDNA sequence, an application is needed to manage and store the results of these tools. This helps with recreating the results to ensure that the data has not been tampered with. Applications like Galaxy, GenePattern, KNIME, etc. are used as Bioinformatics Workflow Management tools. We used an internal application called DNA Sequence Analysis Program (DSAP) at the Waksman Institute.
The final part of my post talks about the analysis of the closest homolog of the cDNA sequence that codes for a protein that is well studied so that I can perform literature searches and look for similar functions, structure, protein interactions, etc.
Landoltia punctata clone W418.20 acid phosphatase/vanadium-dependent haloperoxidase-related protein-like, mRNA sequence. Jaison, C., Vershon, A. and Mead, J., NCBI, May 18, 2021. Accession# JZ984547 https://www.ncbi.nlm.nih.gov/nuccore/JZ984547.1
Landoltia punctata clone W417.20, mRNA sequence. Jaison, C., Vershon, A. and Mead, J., NCBI, May 18, 2021. Accession# JZ984546 https://www.ncbi.nlm.nih.gov/nuccore/JZ984546.1
Zou, D., Ma, L., Yu, J., Zhang, Z. (2015, February 1). Biological Databases for Human Research. ScienceDirect. https://linkinghub.elsevier.com/retrieve/pii/S1672022915000078
Boža, V., Brejová, B., & Vinař, T. (2017). DeepNano: Deep recurrent neural networks for base calling in MinION nanopore reads. PloS one, 12(6), e0178751. https://doi.org/10.1371/journal.pone.0178751
Madden T. (2013 Mar 15) The BLAST Sequence Analysis Tool: The NCBI Handbook [Internet]. 2nd edition. Bethesda (MD): National Center for Biotechnology Information (US); 2013-. Available from: https://www.ncbi.nlm.nih.gov/books/NBK153387/
Wheeler, D. L., Church, D. M., Federhen, S., Lash, A. E., Madden, T. L., Pontius, J. U., Schuler, G. D., Schriml, L. M., Sequeira, E., Tatusova, T. A., & Wagner, L. (2003). Database resources of the National Center for Biotechnology. Nucleic acids research, 31(1), 28–33. https://doi.org/10.1093/nar/gkg033