Getting WISEr with Bioinformatics -4/4

I had a wonderful opportunity to learn about DNA sequencing and bioinformatics while attending the virtual nine week Waksman Institute Summer Experience program. We learned how to isolate the complementary DNA or cDNA fragments from the duckweed (Landoltia punctata) plant, using techniques like reverse transcriptase, plasmid vectors, polymerase chain reaction (PCR), gel electrophoresis etc. We used bioinformatics tools like BLAST, etc. to compare the sequenced DNA with known sequences stored in online repositories like GenBank. We also learned how to determine if the cDNA codes for a protein using the ORF Toolbox and look for similar proteins in other organisms. The duckweed plant grows on water in lakes and wetlands. It’s being researched as a potential food source and in bioremediation. With the help of the vWISE team, I submitted clones W418.20 (Landoltia punctata clone W418.20, 2021) and W417.20 (Landoltia punctata clone W417.20, 2021) to GenBank. The clone W418.20 codes for a protein that is similar to acid phosphatase/vanadium-dependent haloperoxidase-related protein.

Part four describes the additional analysis that I performed using bioinformatics tools on a well-studied protein with similar homologs to the cDNA sample. It also describes future analyses that I plan to do on this protein using different tools.

Table of Contents

BLASTing for homologs

In part 3, I used National Center for Biotechnology Information (NCBI)’s Basic Local Alignment Search Tool or BLAST (Madden, T., 2013) to look for homologs to my cDNA sample using the protein sequence identified by the optimal open reading frame from the ORF Finder (Wheeler, D. L, 2003). The closest results were hypothetical or predicted proteins. Hypothetical proteins have not been experimentally isolated and therefore not studied well enough. However, in order to learn more about the function and structure of the protein translated from my cDNA sample, I had to look further down in my BLAST results. I found two useful proteins that were homologs. Both are Acid phophatase/vanadium-dependent haloperoxide-related protein, one from Salix suchowensis (Willow shrub – figure 1) with E value of 4e-72 and the other from the well studied Arabidopsis thaliana (Thale cress – figure 2) with E value of 1e-7 and both with frame 1.

Figure 1: BLAST search result for Salix suchowensis
Figure 2: BLAST search result for Arabidopsis thaliana

Looking up protein info in the Protein Data Bank

The Protein Data Bank ( stores the structure of various isolated proteins. The structures of these proteins have been identified using methods like X-ray crystallography, Nuclear Magnetic Resonance (NMR), Electron Microscopy, etc. We performed an advanced search for protein information on the site by either using the protein name from the BLASTp search or by using the protein sequence from the ORF Finder. We opened the URL and either entered the protein name in the “Full Text” text box or the protein sequence in the sequence text box and clicked the eyeglass button (Berman et. al., 2000). Unfortunately, nothing was returned. As an example, we performed a search for part of the protein name “Vanadium” with the “Source Organism” name set to Arabidopsis thaliana (Figure 1). We received one result as shown below (Figure 2).

Figure 1: RCSB Protein Data Bank advanced search screen
Figure 2: Protein Data Bank search results

Information relating to the protein is shown when the title is clicked. This includes 3D view, annotations, and associated literature (Figure 3).

Figure 3: Structure summary for a cyclic nucleotide phosphodiesterase from RSCB PDB search

Searching Transcriptome Shotgun Assembly (TSA) database for matching transcripts

The transcriptome of an organism contains all the mRNA transcripts that are expressed in the cells of the organism. By analyzing the transcriptome, we can find out when the genes are turned on turned off in a cell. In addition, by counting the number of transcripts, the amount of gene activity or gene expression can be determined (NHGRI, 2019). We searched the TSA database to look for matching sequences of which my cDNA could be a part of a larger coding sequence. We performed the BLASTn search with the database set to transcriptome Shotgun Assembly and organism set to Landoltia punctata (Figure 4).

Figure 4: BLASTn search against Transcriptome Shotgun Assembly database

The returned result has an entry with an E-value of 0 and a percent identity of 99.14% (Figure 5). The resultant entry was published by the journal “Biotechnology for Biofuels” (Tao et al. Biotechnology for Biofuels 2013 6:72). We further performed BLASTx searches after looking for the ORF using the ORF finder.

Figure 5: Transcriptome Shotgun Assembly BLASTn result

Inspecting the interactome in the TAIR database

The interactome describes the interactions of proteins within a cell. Most proteins work in conjunction with other proteins, therefore it is important to study the interactions to understand the protein’s function. The Arabidopsis thaliana plant is a well-studied model and whose entire genome has been sequenced. Its protein interactions are stored in The Arabidopsis Information Resource (TAIR) database at, hosted by Phoenix Bioinformatics (Rhee et al, 2003).
The protein interactions are identified using the Yeast Two-Hybrid system (Singer Instruments, 2021) where proteins called transcriptional activators are fused to coding genes under study. When the coding genes express the protein fusion and if the proteins interact, the lacZ gene is expressed as blue colonies of yeast. Here is a video of the Yeast 2 hybrid system by the iGEM 2014 Team Goettingen.
Since we have identified similar proteins in Arabidopsis thaliana from my cDNA, we can search the database for protein interactions. The BLAST search can be accessed on the website by going to Tools > BLAST (Figure 6). We searched by nucleotide sequence and also by protein sequence identified by the ORF Finder.

Figure 6: BLASTn search in TAIR database

The results are shown with the graphical view with the E values and identity percentages (Figure 7).

Figure 7: TAIR BLASTn search result

One result with E value 1e-50 was identified to be Acid phosphatase/vanadium-dependent haloperoxidase-related protein, AT3G21610.1. Clicking on the number gives a detailed page (Figure 8). It shows the various stages that the protein is expressed as well as the processes it’s involved in.

Figure 8: AT3G21610.1 Details Page

The identified protein-coding gene homolog is then searched in the Arabidopsis interactome database stored at the Salk Institute Genomic Analysis Lab. On the website, click the Arabidopsis Interactome (AI-1) link under Plant and click AI-1. A direct link is presented here (We used Microsoft Edge to view it).

Exploring the Expressions using eFP

The Arabidopsis eFP browser, hosted at the University of Toronto, shows the level of expression of genes at different stages of the plant’s development (Winter, et. al., 2007). The browser shows an electronic fluorescent pictographic representation of the genes’ expression levels in the Arabidopsis thaliana plant. We entered the primary gene ID as AT3G21610.1 and click the Go button.

Figure 9: Arabidopsis eFP browser for gene AT3G21610

The color map shows the intensity of the gene expression and the location of the gene on the plant. The red color shows the highest level of expression in the petals and stamen of the flower in stage 15 (Figure 9).
The Cell eFP browser, also hosted by the University of Toronto, can be used to show the location of the gene in a cell. We entered the AT3G21610.1 in the primary gene ID text box and clicked the Lookup button. The color shows the level of confidence in the location of the gene.

Figure 10: Cell eFP browser for gene AT3G21610

Probing the periodicals for information

We performed a literature search in the NCBI Online Bookshelf for information on Acid phosphatase/vanadium-dependent haloperoxidase-related protein. We searched for the terms “Acid phosphatase” and “vanadium-dependent haloperoxidase”. Here are the search results for the term “Acid phosphatase”.

We also searched PubMed for the protein. One result was returned. We will be further researching the papers returned in the search result.

Planning further Analyses of the gene

Future Steps
Experiments to investigate gene expression
protein in vivo analysis using Western blotting or FISH
Experiments to inactivate a gene
Experiments to biochemically analyze protein
Purify protein needed for X-ray crystallography

The virtual Waksman Institute Summer Experience program has opened up a lot of learning opportunities for me as I researched more about the genetic analysis performed and about using bioinformatics to find patterns among a large amount of genetic data. My goal is to utilize these approaches to look for genetic causes of different types of cancer and autism.


Landoltia punctata clone W418.20 acid phosphatase/vanadium-dependent haloperoxidase-related protein-like, mRNA sequence. Jaison,C., Vershon,A. and Mead,J., NCBI, May 18, 2021. Accession# JZ984547

Landoltia punctata clone W417.20, mRNA sequence. Jaison,C., Vershon,A. and Mead,J., NCBI, May 18, 2021. Accession# JZ984546

Madden T. (2013 Mar 15) The BLAST Sequence Analysis Tool: The NCBI Handbook [Internet]. 2nd edition. Bethesda (MD): National Center for Biotechnology Information (US); 2013-. Available from:

Wheeler, D. L., Church, D. M., Federhen, S., Lash, A. E., Madden, T. L., Pontius, J. U., Schuler, G. D., Schriml, L. M., Sequeira, E., Tatusova, T. A., & Wagner, L. (2003). Database resources of the National Center for Biotechnology. Nucleic acids research, 31(1), 28–33.

Berman, H. M., Westbrook, J., Feng, Z., Gilliland, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., Bourne, P. E. (2000) The Protein Data Bank, Nucleic Acids Research, Volume 28, Issue 1, 1 January 2000, Pages 235–242,

NHGRI. (2019, March 9). Transcriptome Fact Sheet. Genome.Gov.

Tao, X., Fang, Y., Xiao, Y., Jin, Y. L., Ma, X. R., Zhao, Y., He, K. Z., Zhao, H., & Wang, H. Y. (2013). Comparative transcriptome analysis to investigate the high starch accumulation of duckweed (Landoltia punctata) under nutrient starvation. Biotechnology for biofuels, 6(1), 72.

Rhee, S. Y., Beavis, W., Berardini, T. Z., Chen, G., Dixon, D., Doyle, A., Garcia-Hernandez, M., Huala, E., Lander, G., Montoya, M., Miller, N., Mueller, L. A., Mundodi, S., Reiser, L., Tacklind, J., Weems, D. C., Wu, Y., Xu, I., Yoo, D., Yoon, J., … Zhang, P. (2003). The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic acids research, 31(1), 224–228.

Singer Instruments (2021, September 29). Yeast 2-Hybrid.

Winter D, Vinegar B, Nahal H, Ammar R, Wilson GV, et al. (2007) An “Electronic Fluorescent Pictograph” Browser for Exploring and Analyzing Large-Scale Biological Data Sets. PLOS ONE 2(8): e718.

Image designed by pikisuperstar / Freepik


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s