| 2000 to 2002 Publications |
|
| 2000 |
|
| Keyword Association Network: A Statistical Multi-term Approach for Document Categorization |
|
| K. H. Lee, J. Kay and B. Kang
|
| Proceedings of The Fifth Australiasian Document Computing Symposium |
|
|
|
|
|
| 2002 |
|
| KAN and RinSCut: Lazy Linear Classifier and Rank-in-Score Threshold is Similarity-Based Text Categorization |
|
| K.H Lee, J. Kay and B. Kang
|
| Nineteenth International Conference on Machine Learning |
|
| Two important research areas in statistical approaches for automated text categorization are similarity-based learning algorithms and associated thresholding strategies. The combination of these techniques significantly influences the overall performance of text categorization systems. After researching common techniques in both areas, we describe a lazy linear classifier known as the keyword association network and an rank-in-score thresholding strategy to improve the categorization performance over existing techniques. Extensive experiments have been conducted on the Reuters-21578 data set. The experimental results show that KAN outperforms two linear classifiers, Rocchio and Widrow-Hoff implemented classifiers except Widrow-Hoff show performance improvements with RinSCut. |
|
|
|
| A Comparative Study on Statistical Machine Learning Algorithm and Thresholding Strategies ofr Automatic Text Ccategorization |
|
| K.H Lee, J. Kay and B. Kang
|
| 7th PRICAI 2002: Trends in Artificial Intelligence |
|
| Two main research areas in statistical approaches for automated text categorization are similarity based learning algorithms and thresholding strategies. The choice of cross-techniques in both areas significantly influences the overall performance of text categorization. After investigating two similarity-based classifiers (KNN and Rocchio) and three common thresholding technique (RCut, PCut, and SCut), we propose a new learning algorithm known as the keyword association network (KAN) and a new thresholding strategy (RinSCut) to overcome problems in the existing techniques. Extensive experiments have been conducted on the Reuters-21578 collection data set. The experimental results show that our new approaches are the better choices when both micro-averaged F1 and macro-averaged F1 scores are concerned. The combination of KAN and RinSCut achieves the best performance consistently over every pairs of the existing techniques in both F1 measures. |
|
|