Web data analysis and classification: Automated text classification by linguistic norms, content and genre
Abstract
The dissertation aimed to expand the effort of utilizing the largest collection of texts on the Internet. The purpose of this work was to critically approach the methods of analysis and classification of web data and the creation of the deliverable system (Katigoriopoiitis) that utilizes linguistic norms, content and genre of websites in order to facilitate the way in which this data is presented. A bibliographical research on Web Data Mining aimed to describe the techniques of collecting information from the web. A presentation and cross comparison of machine learning algorithms (Naïve Bays, Decision Trees, K-Nearest neighbors and Support Vector machines) aimed to find the best fit for general purpose content classification for the implementation of the classifier. Accuracies of different classification models were tested on the same dataset. The outcome of the dissertation was that there are efficient techniques that can be applied in order to sufficiently use Internet information. Internet technologies and standards are getting richer and this maximizes the options for data mining. Content classification can be easily achieved by using simple model implementations. More sophisticated models are needed in order to achieve high accuracy in sentiment analysis, or classification based on linguistic norms.