Large-Scale Cross-Language Web Page Classification via Dual Knowledge Transfer Using Fast Nonnegative Matrix Tri-Factorization

Hua Wang, Feiping Nie, Heng Huang

TKDD - 2015

With the rapid growth of modern technologies, Internet has reached almost every corner of the world. As a result, it becomes more and more important to manage and mine information contained in Web pages in different languages. Traditional supervised learning methods usually require a large amount of training data to obtain accurate and robust classification models. However, labeled Web pages did not increase as fast as the growth of Internet. The lack of sufficient training Web pages in many languages, especially for those in uncommonly used languages, makes it a challenge for traditional classification algorithms to achieve satisfactory performance. To address this, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. In addition, we also observe that the associations between word clusters and Web page classes are another type of reliable carriers to transfer knowledge across languages. With these recognitions, in this article we propose a novel joint nonnegative matrix trifactorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification. Our approach transfers knowledge from the auxiliary language, in which abundant labeled Web pages are available, to the target languages, in which we want to classify Web pages, through two different paths: word cluster approximation and the associations between word clusters and Web page classes. With the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. In order to deal with the large-scale real world data, we further develop the proposed DKT approach by constraining the factor matrices of NMTF to be cluster indicator matrices. Due to the nature of cluster indicator matrices, we can decouple the proposed optimization objective and the resulted subproblems are of much smaller sizes involving much less matrix multiplications, which make our new approach much more computationally efficient. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results have demonstrated the effectiveness of our approach that are consistent with our theoretical analyses.

Links

Cite this paper

MLA Copied to clipboard!
Wang, Hua, Feiping Nie, and Heng Huang. "Large-scale cross-language web page classification via dual knowledge transfer using fast nonnegative matrix trifactorization." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.1 (2015): 1.
BibTeX Copied to clipboard!
@article{wang2015large,
  title={Large-scale cross-language web page classification via dual knowledge transfer using fast nonnegative matrix trifactorization},
  author={Wang, Hua and Nie, Feiping and Huang, Heng},
  journal={ACM Transactions on Knowledge Discovery from Data (TKDD)},
  volume={10},
  number={1},
  pages={1},
  year={2015},
  publisher={ACM}
}