A decision tree based algorithm of sequential binary segmentation of n-dimensional feature space of text styles into 2^n non-overlapping n-dimensional intervals optimized by informational criteria is proposed. The intervals form a table of text styles that represent a “style profile” of a text corpus. It is assumed that the feature space is frequency-domain, i.e. consists of frequencies of occurences of functional words, word combinations, bi-grams, etc. The algorithm is implemented within the “StyleAnalyzer” system, developed for a complex investigation of heterogeneous corpora. Research was performed using decision trees and the tables of text styles to investigate the performance of the text classification over various text characteristics such as authors, genres and styles. Profiles of text styles found by the algorithm can be used to identify style of texts with unknown author that, for example, would allow to determine their most probable authorship.
Abstracts file: | Kubarev_Poddubny_Abstracts.doc |