Coronavirus Genome Sequence Similarity and Protein Sequence Classification

Partha Mukherjee, Youakim Badr, Srushti Karvekar, Shanmugapriya Viswanathan

Pennsylvania State University, Great Valley, Malvern, PA-19335, USA

Cite: Mukherjee P., et al. Coronavirus Genome Sequence Similarity and Protein Sequence Classification. J. Digit. Sci. 3(2), 3 – 18 (2021).

Abstract. The world currently is going through a serious pandemic due to the coronavirus disease (COVID-19). In this study, we investigate the gene structure similarity of coronavirus genomes isolated from COVID-19 patients, Severe Acute Respiratory Syndrome (SARS) patients and bats genes. We also explore the extent of similarity between their genome structures to find if the new coronavirus is similar to either of the other genome structures. Our experimental results show that there is 82.42% similarity between the CoV-2 genome structure and the bat genome structure. Moreover, we have used a bidirectional Gated Recurrent Unit (GRU) model as the deep learning technique and an improved variant of Recurrent Neural networks (i.e., Bidirectional Long Short Term Memory model) to classify the protein families of these genomes to isolate the prominent protein family accession. The accuracy of Gated Recurrent Unit (GRU) is 98% for labeled protein sequences against the protein families. By comparing the performance of the Gated Recurrent Unit (GRU) model with the Bidirectional Long Short Term Memory (Bi-LSTM) model results, we found that the GRU model is 1.6% more accurate than the Bi-LSTM model for our multiclass protein classification problem. Our experimental results would be further support medical research purposes in targeting the protein family similarity to better understand the coronavirus genomic structure.

Keywords: Coronavirus Disease of 2019 (COVID-19), Severe Acute Respiratory Syndrome (SARS), Genome Structure, Basic Local Alignment Search Tool (BLAST), Gated Recurrent Unit (GRU), Protein Family Accession.

Acknowledgments. We acknowledge Ramya Chimata Venkatakrishnan and Venkat Nihaal Akula for their help in this project. Any remaining errors are authors’ responsibility.


1.  Lu R., Zhao X., Li J., Niu P., Yang B., Wu H., et al. Genomic characterisation and epidemiology of 2019 novel coronavirus: implications for virus origins and receptor binding. The Lancet, 395(10224), 565-574 (2020). DOI:
2.  Guo Y.-R., Cao Q.-D., Hong Z.-S., Tan Y.-Y., Chen S.-D., Jin H.-J., et al. The origin, transmission and clinical therapies on coronavirus disease 2019 (COVID-19) outbreak–an update on the status. Military Medical Research, 7(1), 1-10 (2020). DOI:
3.  Ruan Y., Wei C. L., Ling A. E., Vega V. B., Thoreau H., Thoe S. Y. S., et al. Comparative full-length genome sequence analysis of 14 SARS coronavirus isolates and common mutations associated with putative origins of infection. The Lancet, 361(9371), 1779-1785 (2003). DOI:
4.  Fehr A. R., Perlman S. Coronaviruses: an overview of their replication and pathogenesis.  Coronaviruses. Methods of Molecular Biology, 1282, 1-23 (2015). DOI: 10.1007/978-1-4939-2438-7_1.
5.  Wu F., Zhao S., Yu B., Chen Y.-M., Wang W., Song Z.-G., et al. A new coronavirus associated with human respiratory disease in China. Nature, 579(7798), 265-269 (2020).
6.  Zhou P., Yang X.-L., Wang X.-G., Hu B., Zhang L., Zhang W., et al. A pneumonia outbreak associated with a new coronavirus of probable bat origin. Nature, 579(7798), 270-273 (2020).
7.  De Wit E., Van Doremalen N., Falzarano D., Munster V. SARS and MERS: recent insights into emerging coronaviruses. Nature Reviews Microbiology, 14(8), 523-534 (2016).
8.  Wu A., Peng Y., Huang B., Ding X., Wang X., Niu P., et al. Genome composition and divergence of the novel coronavirus (2019-nCoV) originating in China. Cell host & Microbe, 27(13), 325-328 (2020). DOI:
9.  Angeletti S., Benvenuto D., Bianchi M., Giovanetti M., Pascarella S., Ciccozzi M. COVID‐2019: the role of the nsp2 and nsp3 in its pathogenesis. Journal of medical virology, 92(6), 584-588 (2020). DOI:
10.  Tang X., Wu C., Li X., Song Y., Yao X., Wu X., et al. On the origin and continuing evolution of SARS-CoV-2. National Science Review, 7(6), 1012–1023 (2020).
11.  Zhu N., Zhang D., Wang W., Li X., Yang B., Song J., et al. A novel coronavirus from patients with pneumonia in China, 2019. New England Journal of Medicine, 382(8), 727-733 (2020). DOI: 10.1056/NEJMoa2001017.
12.  Bileschi M. L., Belanger D., Bryant D. H., Sanderson T., Carter B., Sculley D., et al. Using deep learning to annotate the protein universe. bioRxiv, 1-28. (2019). DOI:
13.  LeCun Y., Bengio Y., Hinton G. Deep learning. Nature, 521(7553), 436-444 (2015).
14.  Bateman A., Coin L., Durbin R., Finn R. D., Hollich V., Griffiths‐Jones S., et al. The Pfam protein families database. Nucleic acids research, 32(suppl_1), D138-D141 (2004). DOI:
15.  D’Agaro E. Artificial intelligence used in genome analysis studies. The EuroBiotech Journal, 2(2), 78-88 (2018). DOI:
16.  Vijay R. Protein Sequence Classification: A case study on Pfam dataset to classify protein families. Last accessed 2019/09/02.
17.  Hu H., Li Z., Elofsson A., Xie S. A Bi-LSTM based ensemble algorithm for prediction of protein secondary structure. Applied Sciences, 9(17), 3538 (2019).
18.  Jurtz V. I., Johansen A. R., Nielsen M., Almagro Armenteros J. J., Nielsen H., Sønderby C. K., et al. An introduction to deep learning on biological sequence data: examples and solutions. Bioinformatics, 33(22), 3685-3690 (2017). DOI: 10.1093/bioinformatics/btx531.
19.  Altschul S. F., Gish W., Miller W., Myers E. W., Lipman D. J. Basic local alignment search tool. Journal of molecular biology, 215(3), 403-410 (1990). DOI: 10.1016/S0022-2836(05)80360-2.
20.  Ye J., McGinnis S., Madden T. L. BLAST: improvements for better sequence analysis. Nucleic acids research, 34(suppl_2), W6-W9 (2006). DOI:
21.  McGinnis S., Madden T. L. BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic acids research, 32(suppl_2), W20-W25 (2004). DOI:
22.  Yuan J., Hon C.-C., Li Y., Wang D., Xu G., Zhang H., et al. Intraspecies diversity of SARS-like coronaviruses in Rhinolophus sinicus and its implications for the origin of SARS coronaviruses in humans. Journal of general virology, 91(4), 1058-1062 (2010). DOI:
23.  Wheeler D. L., Barrett T., Benson D. A., Bryant S. H., Canese K., Chetvernin V., et al. Database resources of the national center for biotechnology information. Nucleic acids research, 36(suppl_1), D13-D21 (2007). DOI:
24.  Song S., Huang H., Ruan T. Abstractive text summarization using LSTM-CNN based deep learning. Multimedia Tools Applications, 78(1), 857-875 (2019). DOI:
25.  Cho K., Van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., et al. (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. in EMNLP, Association for Computational Linguistics, pp. 1724-1734.
26.  Sherstinsky A. Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena, 404(132306), 1-28 (2020). DOI:
27.  Zulqarnain M., Ghazali R., Ghouse M. G., Mushtaq M. F. Efficient processing of GRU based on word embedding for text classification. International Journal on Informatics Visualization, 3(4), 377-383 (2019). DOI: 10.30630/joiv.3.4.289
28.  Lee T. K., Nguyen T. Protein family classification with neural networks. Stanford University, pp. 1-9 (2016).
29.  Le N. Q. K., Yapp E. K. Y., Nagasundaram N., Chua M. C. H., Yeh H.-Y. J. C. Computational identification of vesicular transport proteins from sequences using deep gated recurrent units architecture. Computational and Structural Biotechnology, 17, 1245-1254 (2019). DOI:
30.  Pfeiffenberger E., Bates P. A. Predicting improved protein conformations with a temporal deep recurrent neural network. PLos One, 13(9), e0202652 (2018). DOI:
31.  Le N. Q. K., Yapp E. K. Y., Yeh H.-Y. ET-GRU: using multi-layer gated recurrent units to identify electron transport proteins. BMC Bioinformatics, 20(1), 1-12 (2019).
32.  Zhao M., Wang H., Guo J., Liu D., Xie C., Liu Q., et al. Construction of an industrial knowledge graph for unstructured chinese text learning. Applied Sciences, 9(13), 2720. (2019). DOI: 10.3390/app9132720.
33.  Needleman S. B., Wunsch C. D. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol, 48, 443-153 (1970). DOI:
34.  Reed M. L., Howell G., Harrison S. M., Spencer K.-A., Hiscox J. A. Characterization of the nuclear export signal in the coronavirus infectious bronchitis virus nucleocapsid protein. Journal of virology, 81(8), 4298-4304 (2007). DOI:
35.  Timani K. A., Liao Q., Ye L., Zeng Y., Liu J., Zheng Y., et al. Nuclear/nucleolar localization properties of C-terminal nucleocapsid protein of SARS coronavirus. Virus research, 114(1-2), 23-34 (2005). DOI:
36.  Gers F. A., Schmidhuber J., Cummins F.  Learning to forget: Continual prediction with LSTM. Neural Computation, 12(10) 2451-2471 (2000). DOI: 089976600300015015.
37.  Chung J., Gulcehre C., Cho K., Bengio Y. (2014) Empirical evaluation of gated recurrent neural networks on sequence modeling. NIPS Workshop on Deep Learning. DOI:
38.  Gruber N., Jockisch A. Are GRU cells more specific and LSTM cells more sensitive in motive classification of text? Frontiers in Artificial Intelligence, 3, 1-6 (2020). DOI:
39.  Kim H. Y., Kim D. Prediction of mutation effects using a deep temporal convolutional network. Bioinformatics, 36(7), 2047-2052 (2020). DOI: bioinformatics/btz873.

Published online 28.12.2021