Home
iWeb
Projects
Publications


2000 to 2002 Publications
2000

Keyword Association Network: A Statistical Multi-term Approach for Document Categorization

K. H. Lee, J. Kay and B. Kang
Proceedings of The Fifth Australiasian Document Computing Symposium




2002

KAN and RinSCut: Lazy Linear Classifier and Rank-in-Score Threshold is Similarity-Based Text Categorization

K.H Lee, J. Kay and B. Kang
Nineteenth International Conference on Machine Learning

Two important research areas in statistical approaches for automated text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization systems. After researching common techniques in both areas, we describe a lazy linear classifier known as the keyword association network and an rank-in-score thresholding strategy to improve the categorization performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 data set. The experimental results show that KAN outperforms two linear classifiers, Rocchio and Widrow-Hoff implemented classifiers except Widrow-Hoff show performance improvements with RinSCut.



A Comparative Study on Statistical Machine Learning Algorithm and Thresholding Strategies ofr Automatic Text Ccategorization

K.H Lee, J. Kay and B. Kang
7th PRICAI 2002: Trends in Artificial Intelligence

Two main research areas in statistical approaches for automated text categorization are similarity based learning algorithms and thresholding strategies. The choice of cross-techniques in both areas significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (KNN and Rocchio) and three common thresholding technique (RCut, PCut, and SCut), we propose a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to overcome problems in the existing techniques. Extensive experiments have been conducted on the Reuters-21578 collection data set. The experimental results show that our new approaches are the better choices when both micro-averaged F1 and macro-averaged F1 scores are concerned. The combination of KAN and RinSCut achieves the best performance consistently over every pairs of the existing techniques in both F1 measures.


 
Copyright¨Ï MCRDR Research Group.All Rights Reserved