From Protein Sequence to Protein Function via Multi-Label Linear Discriminant Analysis

Hua Wang, Heng Huang, Chris Ding

TCBB - 2017

Sequence describes the primary structure of a protein, which contains important structural, characteristic, and genetic information and thereby motivates many sequence-based computational approaches to infer protein function. Among them, feature-base approaches attract increased attention because they make prediction from a set of transformed and more biologically meaningful sequence features. However, original features extracted from sequence are usually of high dimensionality and often compromised by irrelevant patterns, therefore dimension reduction is necessary prior to classification for efficient and effective protein function prediction. A protein usually performs several different functions within an organism, which makes protein function prediction a multi-label classification problem. In machine learning, multi-label classification deals with problems where each object may belong to more than one class. As a well-known feature reduction method, linear discriminant analysis (LDA) has been successfully applied in many practical applications. It, however, by nature is designed for single-label classification, in which each object can belong to exactly one class. Because directly applying LDA in multi-label classification causes ambiguity when computing scatters matrices, we apply a new Multi-label Linear Discriminant Analysis (MLDA) approach to address this problem and meanwhile preserve powerful classification capability inherited from classical LDA. We further extend MLDA by ‘1-normalization to overcome the problem of over-counting data points with multiple labels. In addition, we incorporate biological network data using Laplacian embedding into our method, and assess the reliability of predicted putative functions. Extensive empirical evaluations demonstrate promising results of our methods.


Cite this paper

MLA Copied to clipboard!
Wang, Hua, et al. "From protein sequence to protein function via multi-label linear discriminant analysis." IEEE/ACM transactions on computational biology and bioinformatics 14.3 (2016): 503-513.
BibTeX Copied to clipboard!
  title={From protein sequence to protein function via multi-label linear discriminant analysis},
  author={Wang, Hua and Yan, Lin and Huang, Heng and Ding, Chris},
  journal={IEEE/ACM transactions on computational biology and bioinformatics},