Cross-Language Web Page Classification via Dual Knowledge Transfer Using Nonnegative Matrix Tri-Factorization

Hua Wang, Heng Huang, Feiping Nie, Chris Ding.

SIGIR - 2011

The lack of sufficient labeled Web pages in many languages, especially for those uncommonly used ones, presents a great challenge to traditional supervised classification methods to achieve satisfactory Web page classification performance. To address this, we propose a novel Nonnegative Matrix Tri-factorization (NMTF) based Dual Knowledge Transfer (DKT) approach for cross-language Web page classification, which is based on the following two important observations. First, we observe that Web pages for a same topic from different languages usually share some common semantic patterns, though in different representation forms. Second, we also observe that the associations between word clusters and Web page classes are a more reliable carrier than raw words to transfer knowledge across languages. With these recognitions, we attempt to transfer knowledge from the auxiliary language, in which abundant labeled Web pages are available, to target languages, in which we want classify Web pages, through two different paths: word cluster approximations and the associations between word clusters and Web page classes. Due to the reinforcement between these two different knowledge transfer paths, our approach can achieve better classification accuracy. We evaluate the proposed approach in extensive experiments using a real world crosslanguage Web page data set. Promising results demonstrate the effectiveness of our approach that is consistent with our theoretical analyses.

Links

Cite this paper

MLA Copied to clipboard!
Wang, Hua, et al. "Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization." Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. ACM, 2011.
BibTeX Copied to clipboard!
@inproceedings{wang2011cross,
  title={Cross-language web page classification via dual knowledge transfer using nonnegative matrix tri-factorization},
  author={Wang, Hua and Huang, Heng and Nie, Feiping and Ding, Chris},
  booktitle={Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval},
  pages={933--942},
  year={2011},
  organization={ACM}
}