Large-Scale Cross-Language Web Page Classification via Dual Knowledge Transfer Using Fast Nonnegative Matrix Tri-Factorization
Hua Wang, Feiping Nie, Heng Huang
TKDD - 2015
With the rapid growth of modern technologies, Internet has reached almost every corner of the world. As a result, it becomes more and more important to manage and mine information contained in Web pages in different languages. Traditional supervised learning methods usually require a large amount of training data to obtain accurate and robust classification models. However, labeled Web pages did not increase as fast as the growth of Internet. The lack of sufficient training Web pages in many languages, especially for those in uncommonly used languages, makes it a challenge for traditional classification algorithms to achieve satisfactory performance. To address this, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. In addition, we also observe that the associations between word clusters and Web page classes are another type of reliable carriers to transfer knowledge across languages. With these recognitions, in this article we propose a novel joint nonnegative matrix trifactorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification. Our approach transfers knowledge from the auxiliary language, in which abundant labeled Web pages are available, to the target languages, in which we want to classify Web pages, through two different paths: word cluster approximation and the associations between word clusters and Web page classes. With the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. In order to deal with the large-scale real world data, we further develop the proposed DKT approach by constraining the factor matrices of NMTF to be cluster indicator matrices. Due to the nature of cluster indicator matrices, we can decouple the proposed optimization objective and the resulted subproblems are of much smaller sizes involving much less matrix multiplications, which make our new approach much more computationally efficient. We evaluate the proposed approach in extensive experiments using a real world cross-language Web page data set. Promising results have demonstrated the effectiveness of our approach that are consistent with our theoretical analyses.
Links
- View publications from Hua Wang
- View publications in the project, Mining Materials Genome Data for Prediction and Guidance of Nanoparticle Synthesis
- View publications researching Matrix/Tensor Factorization
- View publications researching Transfer Learning
- View publications applied to Natural Language Processing
Cite this paper
MLA
Wang, Hua, Feiping Nie, and Heng Huang. "Large-scale cross-language web page classification via dual knowledge transfer using fast nonnegative matrix trifactorization." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.1 (2015): 1.
BibTeX
@article{wang2015large, title={Large-scale cross-language web page classification via dual knowledge transfer using fast nonnegative matrix trifactorization}, author={Wang, Hua and Nie, Feiping and Huang, Heng}, journal={ACM Transactions on Knowledge Discovery from Data (TKDD)}, volume={10}, number={1}, pages={1}, year={2015}, publisher={ACM} }