A Comparative Evaluation of Open-Source Part-of-Speech Taggers for South African Languages (Unpublished)
[Abstract] In this paper, a comparison of five different open-source, Java-compatible part-of-speech taggers for ten South African languages is performed. The taggers are evaluated to determine which of the five taggers offer the best combination of accuracy and speed for integration into a web service. The taggers are trained and evaluated using existing annotated data sets, and compared to existing taggers for the languages based on the HunPos implementation. It is shown that the NLP4J tagging algorithm is the most efficient tagger when both the speed and accuracy of respective taggers are considered. Based on the initial results, a series of feature selection experiments are performed for each of the language-specific taggers in an attempt to improve tagging accuracy. The modifications that benefitted each tagger are combined to create taggers that best serve to tag a given language.
This unpublished paper was presented as a poster presentation at the 2017 Pattern Recognition Association of South Africa and Robotics and Mechatronics conference (PRASA-RobMech), hosted in Bloemfontein, South Africa, 29 November – 1 December 2017.