Please use this identifier to cite or link to this item: http://prr.hec.gov.pk/jspui/handle/123456789/21780
Title: Generative Adversarial Learning from Protein Sequence Data
Authors: Mansoor, Musadaq
Keywords: Physical Sciences
Computer Sciences
Issue Date: 2022
Publisher: National University of Computer & Emerging Sciences, Islamabad
Abstract: Carbohydrates, fats and proteins are considered as the three basic macronutrients. Proteins are the core of all functions pertaining to living things. Proteins, which are made up of amino acids, are the most common molecules found in cells. Despite the fact that more than 115 million protein sequences have been gathered to date, only 0.15 percent of these proteins have been connected with experimental GO annotations. The exact annotation of protein activities is one of the most critical factors for a thorough knowledge of molecular biology. Most of the proteins, across all species, lack adequate additional information, causing them to remain uncharacterized. On the other hand, the amino acid sequence is the most important piece of information for all known proteins. To analyze proteins using their amino acid sequence, investigators are encouraged to build computational tools to broaden the algorithmic applicability across diverse types of proteins. Though, deep learning based computational techniques require a large amount of labelled data to provide good results. Data labelling is time and resource intensive, resulting in a scarcity of labelled data. This dissertation presents GOGAN, a model that predicts protein functions based on its amino acid sequence, using the feature to address the previously described difficulties of indeterminate proteins and typical deep learning methods. Handmade characteristics are not required by the suggested GOGAN model; instead, it automatically pulls all essential data from the input sequence using state-of-the-art unsupervised machine learning models. The GOGAN model extracts characteristics from significantly large unlabeled protein datasets. The term “unlabeled data” refers to information that hasn’t been given a label to describe its features or properties. Gene variation analysis, gene expression analysis, and gene regulation network discovery can all benefit from the features retrieved by the GOGAN model. The proposed model is based on the Homo sapiens species. When compared to previous methodologies, the experimental findings demonstrate significant gains in many evaluation metrics. Using solely the amino acid sequences of proteins, GOGAN achieves an F1 score of 72.1% with a hamming loss of 9.5%. Further adding to this model, this thesis also explores an improvement of GOGAN named as GOCAPGAN. Currently, Convolutional Neural Networks (CNNs) turned out to be pivotal in predicting protein functions based on protein sequences. The computation cost and translational invariance associated with CNN make it impossible to detect spatial hierarchies between complex and simpler objects. Therefore, this research utilizes Capsule networks to capture spatial information as opposed to CNNs. Since capsule networks focus on hierarchical links, they have a lot of potential for solving structural biology challenges. In comparison to the standard CNNs, results of the proposed model exhibit improvement in accuracy. GOCAPGAN achieved an F1 score of 82.6%, precision score of 90.4% and recall score of 76.1%.
Gov't Doc #: 27195
URI: http://prr.hec.gov.pk/jspui/handle/123456789/21780
Appears in Collections:PhD Thesis of All Public / Private Sector Universities / DAIs.

Files in This Item:
File Description SizeFormat 
Musadaq Mansoor Computer Science 2022 nu fast isb.pdf 11.10.22.pdfPh.D thesis4.74 MBAdobe PDFView/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.