An Evaluation of Text Classification Methods for Literary Study

This article presents an empirical evaluation of text classification methods in literary domain. This study compared the performance of two popular algorithms, naı̈ve Bayes and support vector machines (SVMs) in two literary text classification tasks: the eroticism classification of Dickinson’s poems and the sentimentalism classification of chapters in early American novels. The algorithms were also combined with three text pre-processing tools, namely stemming, stopword removal, and statistical feature selection, to study the impact of these tools on the classifiers’ performance in the literary setting. Existing studies outside the literary domain indicated that SVMs are generally better than naı̈ve Bayes classifiers. However, in this study SVMs were not all winners. Both algorithms achieved high accuracy in sentimental chapter classification, but the naı̈ve Bayes classifier outperformed the SVM classifier in erotic poem classification. Self-feature selection helped both algorithms improve their performance in both tasks. However, the two algorithms selected relevant features in different frequency ranges, and therefore captured different characteristics of the target classes. The evaluation results in this study also suggest that arbitrary featurereduction steps such as stemming and stopword removal should be taken very carefully. Some stopwords were highly discriminative features for Dickinson’s erotic poem classification. In sentimental chapter classification, stemming undermined subsequent feature selection by aggressively conflating and neutralizing discriminative features. .................................................................................................................................................................................


Introduction
Text classification is a typical scholarly activity in literary study (Unsworth, 2000;Yu and Unsworth, 2006). Humanist scholars organize and study literary texts according to various classification criteria, such as topics, authors, styles, and genres. For decades computational analysis tools have been used in some literary text classification tasks, such as authorship attribution (Mosteller and Wallace, 1964;Holmes, 1994) and stylistic analysis (Holmes, 1998). Recently, with the development of machine learning and natural language processing techniques, automatic text classification methods 1 provide new approaches to more literary text analysis problems (Argamon and Olsen, 2006); for example, discriminant analysis and cross entropy classification for authorship attribution and stylistic analysis (Craig, 1999;Juola and Bayyen, 2005), decision tree classification for genre analysis of Shakespeare's plays (Ramsay, 2004), SVM classification for knowledge class assignment of the Encyclopédie entries (Horton et al., 2007), naïve Bayes classification for the eroticism analysis of Dickinson's poems (Plaisant et al., 2006), and naïve Bayes classification for sentimentalism analysis of early American novels (Horton et al., 2006).
With the availability of so many text classification methods, empirical evaluation is important to provide guidance for method selection in literary text classification applications. A number of studies have evaluated popular classification algorithms on a few benchmark data sets (Dumais et al., 1998;Joachims, 1998;Yang and Liu, 1999). However, these benchmark data sets were limited to news and web documents, which have different characteristics from the creative writings in literature. Moreover, in these evaluation studies, all methods were tested on topic classification tasks. In the setting of literary text classification, text documents are categorized by many document properties other than topics. Some target classes, such as authors and genres, are defined in an objective manner, while other classes, such as the sub-genres 'eroticism' and 'sentimentalism', are subjectively defined by the groups of scholars in these particular fields of study. Prediction is the common purpose of scientific classifiers. Hence, classifiers are usually evaluated by the measure of classification accuracy. However, improving classification accuracy is seldom the goal for literary scholars. High classification accuracy provides evidence that some patterns have been inferred to separate the classes. The scholars are more interested in the literary knowledge as represented by these linguistic patterns. In other words, the usual purpose of literary classification is to seek suggestive evidence for further scholarly investigation of what texture features characterize the target classes (Ramsay, 2008). Sometimes scholars would also like to have classifiers as examplebased retrieval tools to find more documents of a certain kind, such as ekphrastic poems 2 and historicist catalog poems 3 (Yu and Unsworth, 2006). In these cases, only a small number of training examples are available, which requires the classifiers to learn fast and accurately. Facing the unique characteristics of literary text classification applications, we have to think about the question whether the existing conclusions on classification method comparison still hold for literary text classification tasks.
This article describes an empirical evaluation of text classification methods in the literary domain. Based on the above use scenarios this study evaluates the classification methods from three perspectives: classification accuracy, literary knowledge discovery, and potential for example-based retrieval. Because no benchmark data is available in this domain, the methods are compared on two specific sub-genre classification tasks as case studies, both focusing on identifying certain kinds of emotion, a document property other than topic.
The first task is eroticism classification of Emily Dickinson's poems. The debate over what counts as and constitutes the erotic in Dickinson has been a primary research problem in Dickinson studies for the last half century (Plaisant et al., 2006). To study the erotic language patterns in Dickinson's poems, a group of Dickinson scholars at University of Maryland at College Park compiled a Dickinson erotic poem collection that consists of 269 XMLencoded letters comprising nearly all the correspondence between the poet Emily Dickinson and Susan Huntington (Gilbert) Dickinson, her sister-in-law. Long letters which involve both erotic and noterotic contents were excluded from the collection. The scholars assessed each letter as either erotic or not. 4 Eventually, 102 letters were labeled as erotic (positive), and 167 not-erotic (negative).
The second task is sentimentalism classification of chapters in early American novels. Although academic study of sentimental fiction has been well accepted in the past few decades, academic disagreement persists about what constitutes textual sentimentality and how to examine sentimental texts in serious criticism (Horton et al., 2006). To explore what linguistic patterns characterize the subgenre of sentimentalism, two literary scholars at the University of Virginia constructed a collection of five novels in the mid-nineteenth century sentimental period, which are generally considered to exhibit sentimental features: Uncle Tom's Cabin, Incidents in the Life of a Slave Girl, Charlotte: a Tale of Truth, Charlotte's Daughter, and The Minister's wooing. The scholars assessed the sentimentality level of each of the 184 chapters as either 'high' or 'low'. Among them ninety-five chapters were labeled as 'high' and eighty-nine as 'low'.
Two popular text classification algorithms, naïve Bayes and support vector machines (SVMs), are chosen as the subjects of evaluation. Existing studies indicate that SVMs are among the best text classifiers to date (Dumais et al., 1998;Joachims, 1998;Yang and Liu, 1999). Naïve Bayes is a simple but effective Bayesian learning method (Domingos and Pazzani, 1997), often used as a baseline algorithm. This study compares the performance of these two algorithms on eroticism classification and sentimentalism classification tasks.
Algorithm selection is not the only factor that affects classification result. The choice of text representation models and text pre-processing options also influence the classification performance. The simplest bag-of-words (BOW) model is often used for text representation when no prior knowledge is available with regard to specific classification tasks. In fact, a number of studies have shown that complex features did not help statistical classifiers gain significant performance improvement (Lewis, 1992;Cohen, 1995;Dumais et al., 1998;Scott and Matwin, 1999). Under the BOW model a text document is converted into a vector of word counts. Without feature reduction, a document vector is often defined in a space of thousands of dimensions, each dimension corresponding to a word feature. In such a highdimensional space, many features are of low relevance. Feature reduction is important in order to train classifiers with good generalizability as well as reducing the computation cost. Stemming, stopword removal, and statistical feature selection are three common feature-reduction tools in text classification. Studies have shown that in some situations, these tools could interact with classification methods, and consequently affect the classifiers' performance (Riloff, 1995;McCallum and Nigam, 1998;Mladenic and Grobelnik, 1999;Scott and Matwin, 1999;Mladenic et al., 2004). Based on the above considerations, this study combines naïve Bayes and SVM algorithms with different choices of feature-reduction tools, and then examines whether these choices affect the algorithms' performance in literary text classification tasks.
The rest of this article is organized as follows. Section 2 describes the text classification methods, the feature-reduction tools, and the evaluation measures used in this study. Section 3 describes the design of the evaluation experiments. Section 4 and 5 report the evaluation results in the eroticism classification and the sentimentalism classification tasks, respectively. Section 6 concludes with discussions of the evaluation results across the two case studies.
2 Classification Methods, Feature-reduction Tools and Evaluation Measures

Naïve Bayes and SVM classifiers
Naïve Bayes is a highly practical Bayesian learning method. It assumes that the feature values are conditionally independent given the target value, and therefore significantly reduces the computation cost (Mitchell, 1997). Although real-world data (e.g. text data) often violate this assumption, naïve Bayes classifier can still be optimal under zero-one loss even when the independence assumption is violated by a wide margin (Domingos and Pazzani, 1997). As a simple but effective method, naïve Bayes is often included in comparative evaluation of text classification methods (Dumais et al., 1998;Joachims, 1998;Yang and Liu, 1999;Sebastiani, 2002).
The naïve Bayes algorithm can be implemented in various ways. Two naïve Bayes variations are widely used in text classification; they are called the multi-variate Bernoulli model and the multinomial model (McCallum and Nigam, 1998). The multi-variate Bernoulli model (abbreviated as 'nbbool' in this article) uses word presence or absence (one or zero) as feature values (Boolean). The multinomial model (abbreviated as 'nb-tf') uses word frequencies as feature values. Previous studies on topic classification tasks showed that the multivariate Bernoulli model is more suitable for data sets with small vocabularies, while the multinomial model is better on larger vocabularies (Lewis, 1998;McCallum and Nigam, 1998). However, recent studies demonstrate that naïve Bayes classifiers with word presence or absence values performed better in predicting opinion polarities of movie reviews (Pang et al., 2002). In this study, both target classes (eroticism and sentimentalism) are related to emotion, therefore both naïve Bayes variations are implemented and compared based on the description in Mitchell (1997).
SVMs are a family of supervised learning methods developed by Vapnik and colleagues based on the Structural Risk Minimization principle from statistical learning theory (Vapnik, 1982(Vapnik, , 1999. As linear classifiers (with linear kernel), SVMs aim to find the hyperplanes that separate data points with the maximal margins between the two decision boundaries. Aiming to minimize the generalization error, SVMs have the advantage of reducing the risk of overfitting. SVMs outperform other text classification methods in a number of comparative evaluations on topic classification tasks (Dumais et al., 1998;Joachims, 1998;Yang and Liu, 1999).
The SVM algorithm also allows for various kinds of word frequency measures as feature values, which results in multiple variations. In this study, the SVM algorithm is combined with four candidate text representations. The first one is 'svm-bool', which uses word presence or absence as feature value. The second one is 'svm-tf', which uses word (term) frequency as feature value. The third one is 'svm-ntf', which uses normalized word frequency as feature value. The last one is 'svm-tfidf', which uses term frequency weighted by inverse document frequency as feature value. The SVM-light package 5 and its default parameter settings are used in this study. Table 1 summarizes the combinations of classification algorithms and text representation models. For each algorithm, the variation with the best performance in the initial evaluation experiment will be used in the following experiments.

Stemming
In text classification, the stemming process conflates a group of inflected words with the same stems into one single feature, assuming that they bear similar meanings. However, sometimes different forms of the same word contribute to the classification in different ways. For example, distinguishing the singular and plural forms of nouns and different verb tenses improved terrorism document classification (Riloff, 1995). Hence, stemming might affect text classification in both positive and negative ways. (Scott and Matwin, 1999;Sebastiani, 2002). This study uses the Porter Stemmer (Porter, 1980) to stem words. Complementary look-up tables for irregular nouns and verbs 6 are also used because the Porter stemmer does not take into consideration irregular nouns and verbs.

The role of stopwords
In information retrieval, stopwords mean extremely common words, such as 'the' and 'of', which are considered useless and then removed from the queries and the document (Baeza-Yates and Ribeiro-Neto, 1999). Since common words are mostly function words, the concepts 'common words' and 'function words' are usually considered as synonyms. But they are actually overlapping but not equivalent concepts. 'Common words' are defined and selected based on word frequencies in a specific collection. A common word in one collection might not be common in another one. Function words are 'closed-class' word groups with constant members. They do not carry concrete meaning, but they have important role in grammar. Function words proved to be useful for some text classification tasks. For example, the pronoun 'my' is a very useful word feature to identify student homepages (McCallum and Nigam, 1998). Prepositions help identify joint venture documents (Riloff, 1995). Function words are even the major stylistic markers in genre analysis, stylistic analysis, and authorship attribution (Biber, 1988(Biber, , 1995Holmes, 1994;Argamon et al., 2003;Saric and

Statistical feature selection
Stemming and stopword removal are 'arbitrary' feature-reduction tools regardless of classification tasks. Statistical feature-selection methods measure the weights of features based on their relevance to the classes and select the features with heaviest weights. Feature-selection methods are often used as pre-processing steps before classification, because they are assumed to be independent of classification methods (Yang and Pedersen, 1997;Joachims 1998). However, Mladenic and Grobelnik (1999) have found that feature-selection methods could interact with classification methods. For example, information gain has negative effects on naïve Bayes classifiers, while odds ratio fits naïve Bayes classifiers best. Forman (2003) found that no feature-selection methods can improve the performance of SVM classifiers. Because both SVMs and naïve Bayes classifiers are linear classifiers, each of their features has a weight (coefficient) in the linear decision functions. Therefore, both SVM and naïve Bayes algorithms can be used as feature-selection methods as well (Guyon et al., 2002;Mladenic et al., 2004). The feature-weighting function in naïve Bayes algorithm is actually the same as odds ratio (Mladenic and Grobelnik, 1999). This study uses SVM and naïve Bayes algorithms themselves as feature-selection methods.

Classification evaluation methods
Cross-validation and hold-out tests are the usual methods for classification result evaluation. N-fold cross-validation splits a data set into N folds and runs classification experiment N times. Each time one fold of data is used as test set and the classifier is trained on the other N À 1 folds of data.  (Bland and Altman, 1995).

Experiment Design
A series of experiments are designed to test the performance of naïve Bayes and SVM algorithms combined with different feature-reduction tools. The following experiments are run for both eroticism classification and sentimentalism classification tasks.

Experiment 1: document representation model selection
The purpose of this experiment is to choose the best text-representation model for each algorithm to use in the following experiments. Without prior knowledge, the initial feature set for the eroticism classification is the full vocabulary excluding the words occurring only once. According to the scholars' domain knowledge, the initial feature set for the sentimentalism classification is the content words-nouns (except proper nouns), verbs, adjectives, and adverbs. Rare words (frequency <5) are excluded from the vocabulary. The Brill part-ofspeech tagger (Brill, 1995) is used to extract the content words.

Experiment 2: using stopwords as feature sets
This experiment evaluates the usefulness of stopwords in the two classification tasks. There are two ways to evaluate the contribution of stopwords to classification. The first approach compares the accuracies before and after removing stopwords from the feature set. The second approach directly uses stopwords as independent feature sets for classification. In text classification, usually a large number of features are redundant (Joachims, 1998). If some features are removed and the classification accuracy does not change, it does not necessarily mean that these features are not relevant because similar features might exist in the feature sets, contributing to the classification. Hence, the second approach is used to design this experiment. Two definitions of stop words are examined respectively. The common word list generated from the Brown Corpus and the function word groups generated by the Brill part-of-speech tagger are used as independent feature sets for classification.

Experiment 3: stemming
This experiment evaluates the effect of stemming on classification performance at both macro and micro levels. At macro level, it examines whether the overall classification accuracies change significantly after stemming. At micro level it compares the contribution of individual features before and after stemming toward classification. For example, the features 'woman' and 'women' will be merged as one feature 'woman' after stemming. If 'woman' and 'women' are relevant to the classes (e.g. the eroticism in Dickinson's poem) in a similar way, this word stemming and merging event should not negatively affect the classification result. Otherwise, if one word indicates 'erotic' and the other one indicates 'not-erotic', the conflation would neutralize two discriminative features and result in performance decrease. The idea of stemming and merging word features is similar to word clustering. All words with the same stem are gathered into one cluster. To group words into clusters, Baker and McCallum (1998)

Experiment 4: statistical feature selection
The effectiveness of feature selection is measured from two perspectives. The classification accuracy measures the relevance of the selected features. The feature-reduction rate measures the compactness of the selected feature subset. Feature-reduction rate describes the proportion of features removed from the original feature set. The reduced feature set has to cover all documents, which means no empty document vectors should be generated after feature reduction (Yang and Pedersen, 1997 At each size the classification algorithm runs fifty times. Each time the specified percent of data is randomly selected from the 90% training set. The fifty classification accuracies are then averaged at each different training set size. At the beginning, the whole data set is split into ten folds. The above experiment is repeated on each fold. The averaged classification accuracies will be used to draw the learning curve.
A linear classifier outputs a prediction value for each test example. This value indicates the distance between the test example and the decision hyperplane. The farther the data point is from the decision hyperplane, the more confident is the prediction. In this sense, the distance is a kind of confidence index of the prediction. The confidence curve experiment compares the confidence of each classifier's predictions on the same test data. The data set is randomly split to 60% training set and 40% testing set. Each classifier's predictions are sorted in decreasing order. The confidence curve plots the classifier's prediction accuracies in the top 10,20,30,40,50,60,70,80,90 and 100% predictions. Slowly decreasing confidence curve means the classifier is able to maintain high confidence for most of its predictions.

The Dickinson Erotic Poem Classification
The Dickinson data set contains 269 poems, among which 102 poems were labeled as erotic (positive), and 167 as not-erotic (negative). The original vocabulary of the Dickinson collection consists of 3984 unique words. A total of 1253 words remain after excluding the words that occur only once. Table 2 lists the classification accuracies of SVM and naïve Bayes variations. The best accuracy values for SVM and naive Bayes are highlighted in boldface. 'Svm-ntf ' is the best representation for SVM. It is significantly better than 'svm-bool' and the majority baseline (P < 0.05, see P-values with asterisks in Table 3), but its differences with 'svm-tf ' and 'svm-tfidf ' are not significant (Table 3). For naïve Bayes variations, 'nb-tf ' is better than 'nbbool ' and the majority baseline, but the differences are not significant. 'Svm-ntf ' and 'nb-tf ' will be used in the following experiments.

Stopword features
The Brown stopword list consists of 425 most common words in the Brown corpus. Of them, 306 are found in Dickinson poems, but most of them are no longer 'common'. Table 4 compares the classification accuracies with different stopword feature sets. The 306 Brown stopword features work as well as the total 911 word features for the eroticism classification. The pronoun group has only twenty-nine words, but the classifiers with pronoun features achieve the level of accuracy close to those with all 911 features. 'You' and 'I' are the best individual predictors for erotic poems. In summary stopwords are highly discriminative features for Dickinson erotic poem classification. Table 5 lists the classification accuracies before and after stemming. At the macro level, the feature set is   reduced by 13%, but there is no significant accuracy change before and after stemming. At the micro level, Table 6 lists a few conflation events with largest and smallest KLD values. Some events are good, such as merging 'silently' with 'silent'. Some conflations are bad (with large KLD values), such as merging 'hearts' with 'heart', 'women' with 'woman' and 'thinking' with 'think'. For some nouns, the singular forms are more relevant to erotic poems while the plural forms are more relevant to nonerotic poems. A possible explanation is that singular words like 'woman' and 'heart' are more selfportraying than their plural forms 'women' and 'hearts'. A usual pre-processing step in text classification is to convert all words into lower cases. Dickinson is known for her unconventional capitalization. Many words, especially nouns, were capitalized no matter where they occurred. A Dickinson scholar explained it as an old-fashioned emphasis borrowed from German. This study examines the case merge as a special kind of word conflation. At the macro level, no significant classification accuracy change is observed after the case merge. At the micro level, there exist both good and bad case merges. For some words, capitalization does not change their relevance to eroticism; for example, 'Dream' versus 'dream', 'Place' versus 'place', and 'Road' versus 'road'. For other words, the capitalized forms bear different meanings; for example, 'Joy' versus 'joy', 'Royal' versus 'royal', 'Red' versus 'red', and 'Love' versus 'love'. In these cases, Dickinson used the capitalized forms to describe general concepts in abstract thinking in non-erotic poems, while she used the lowercase forms to describe personal life scenarios in erotic poems.

Stemming
For both case merging and stemming experiments, the overall classification accuracies do not change significantly. But it does not necessarily mean that all of these conflations do not matter. In fact, both good and bad conflations occur simultaneously, although their effects are neutralized overall. Table 7 shows the naïve Bayes feature-selection results. For the stemmed feature set, the classification accuracy increases from 69.2% to 81.0% after self-feature selection. The paired t-test result shows that the accuracy difference is significant (t ¼ 6.449, P < 0.001). However, naïve Bayes self feature selection can only reduce the feature set up to 40% without generating empty documents. For the not-stemmed feature set, feature selection improves the accuracy even more (from 68.5% to 82.5%). However, there is no significant difference between the feature-reduction results with or without stemming. Therefore, stemming does not significantly affect the naïve Bayes feature selection in this task (t ¼ 0.675, P ¼ 0.517).

Statistical feature selection
After feature reduction, the SVM classification accuracies increase with some fluctuations ( Table 8). The accuracy changes from 70.7% to 76.2% for stemmed features and from 71.4% to 77.0% for not-stemmed features. The improvements are significant (t ¼ 3.143, P ¼ 0.012), although not as much as the improvements for naïve Bayes. However, SVM yields high feature-reduction rate. Actually SVM with the top 10% features performs better than SVM with the entire feature set.  To compare the two feature-ranking and selection methods in more detail, Fig. 1 plots the features with their SVM weights on X-axis and naïve Bayes weights on Y-axis. The two methods generally agree upon which features are 'erotic' or not, because most features fall into the first and third quadrants. However, there are only twenty-seven shared features in both top 100 feature lists. Apparently, the two methods prefer different kinds of features as the top ones. Figure 2 plots the relation between the feature ranks and their weights. The feature weights are normalized as proportional to the top feature weight. The SVM feature weights decrease quickly and smoothly from top rank to bottom rank. The top features with heaviest weights have strong influence on the classification decisions. The remaining features are not important due to the small feature weights. This explains the SVM high reduction rate from one aspect. On the contrary, there are large numbers of naïve Bayes features with same weights. The feature values decrease slowly. Most features ranked in the middle still have heavy feature weights. This explains why naïve Bayes cannot achieve high feature-reduction rate.
Both 'svm-ntf' and 'nb-tf' use word frequencies as feature values, normalized or not. Figures 3 and 4 plot the relations between feature ranks and their frequencies for both classifiers. Figure 3 (SVM) shows that high-frequency words accumulate at the top SVM feature ranks. Therefore, a small feature subset is enough to cover the whole collection without generating empty documents. In contrast,    ) shows that low-frequency words dominate the top naïve Bayes feature ranks. Most high-frequent words rank in the middle, so a larger feature subset is needed to avoid generating empty documents. In consequence, naïve Bayes cannot achieve high reduction rate. The above relations between feature ranks and frequencies can be explained by the feature-ranking functions of the two methods. 'Nb-tf' uses the log probability ratio log pðwjposÞ=pðwjnegÞ to measure feature weights (Mladenic and Grobelnik, 1999). For example, if words A and B occur in exactly the same documents, and B's occurrences in each document is always twice as A's occurrences, 'nb-tf' would assign the same weights for A and B. 'Svm-ntf' uses the function w j ¼ P l i¼1 i y i x ij to measure feature weights. In this function, x ij is the normalized frequency for word w j in the support vector i; i is the support vector's non-negative coefficient and y i is its class label (1 or À1). Therefore in the above example, 'svm-ntf' would assign word B with doubled weight of word A. In Dickinson's poems, most words are not frequent, but their frequency ratio in the two classes could be high. Naïve Bayes assigns heavy weights to these words while SVM devalues them.
The difference between the two feature-selection methods is also related to the feature informativeness. Naïve Bayes selects unique words in each category as top features, which are usually in low frequencies. The scholars are surprised at the first sight of these words (e.g. 'write', 'mine', and 'Vinnie'), but they managed to make sense of them later. A possible explanation is that the occurrences of these words are very low; therefore, it is not hard for the scholars to associate the words with their context and infer their relevance to eroticism. In contrast, SVM chooses the high-frequency words as top features, such as the pronouns 'you', 'I', 'my', 'me', 'your', and 'her'. It is within the scholars' prior knowledge that pronouns are necessary to construct personal conversations. Although these features do not surprise the scholars, they exhibit the common characteristics of Dickinson's erotic poems. This result is consistent with the stopword experiment result, in that pronouns are highly discriminant features for eroticism classification.

Learning curve and confidence curve
Both classifiers' learning curves and confidence curves are plotted in Figs 5 and 6. Figure 5 shows that naïve Bayes learns faster than SVM in this task. However, both learning curves do not level off, which indicates that the classifiers need more training data to reach stable performance. In Fig. 6, the classification accuracies decrease at similar speed with the decrease of the confidence for both algorithms. content words. But they suggested that proper nouns be excluded from the feature set because most of them are character names. A sentimentalism classifier aims to learn the sentimental language rather than the character designs in particular novels. The sentimentalism collection consists of 184 chapters, among which ninety-five chapters have high level of sentimentality and eighty-nine chapters have low sentimental level. The original vocabulary contains 19,585 word tokens. Because the average chapter length is much longer than that of the Dickinson poems, the minimum word frequency is arbitrarily set to 5. Again, the Brill tagger is used to extract content words-nouns (without proper nouns), verbs, adjectives, and adverbs. Eventually, the feature set consists of 5,704 words.

Text representation model selection
Boolean feature representations are the best for sentimentalism classification (Table 9). Both 'svmbool' and 'nb-bool' (see their P-values with asterisks in Table 10) are significantly better than the majority baseline, but their difference with other SVM and naïve Bayes variations are not significant (Table 10). 'Svm-bool' and 'nb-bool' are then used in the following experiments. Table 11 lists the classification accuracies with different stopword groups as feature sets. Neither the Brown stopwords nor the function word groups achieved accuracies significantly higher than the trivial majority baseline for both algorithms. This result confirms the scholars' heuristic that content words are more relevant in this case.

Stemming
For sentimentalism classification, the accuracies of both classifiers do not change significantly after    stemming. Stemming reduces the feature set size by 36% (Table 12). Some conflations are good, such as merging 'difficulties' with 'difficulty' and 'wheels' with 'wheel'. Other conflations are bad, such as merging 'wildness' with 'wild' and 'pitying' with 'pity' (Table 13). 'Wildness' is exclusively used in highly sentimental chapters while 'wild' occurs in both high sentimental and low sentimental chapters with similar frequencies.
Following are a few examples of using the word 'wildness' in sentimental chapter: 'There was a piercing wildness in the cry . .

Statistical feature selection
For the naïve Bayes algorithm, the classification accuracy with stemmed features increases from 70.2% to 88.0% after self-feature selection (Table  14). The performance difference is significant (t ¼ 7.796, P < 0.001). The feature-reduction rate is as high as 80%. Feature reduction without stemming produces even more accuracy improvement (from 65.4% to 92.4%) and higher reduction rate (90%). However, stemming does not significantly affect the naïve Bayes feature-reduction results. For the SVM algorithm, the classification accuracy with stemmed features fluctuates with the increase of feature-reduction rate (Table 15). There is no significant accuracy improvement after feature reduction. However, without stemming the SVM classification accuracy steadily improves with the increase of feature-reduction rate. With top 10% non-stemmed features the SVM classifier achieves 94.1% accuracy.
Why does stemming affect SVM feature selection in sentimentalism classification but not in eroticism classification? Recall that stemming reduces features by 13% for SVM eroticism classification, but the reduction rate is 36% for SVM sentimentalism classification. The stemming     Figure 7 plots the relation between the feature weights measured by the two feature-selection methods. The points disperse away from the diagonal toward both ends of the axes. In other words, the two weighting measures agree basically upon the light-weighted features, but they disagree upon what features should have heaviest weights. Figure 8 plots the feature ranks and their weights for both classifiers. Similar to Fig. 2 in the Dickinson erotic poem classification, the SVM feature weights decrease quickly and smoothly from top ranks to bottom ranks. This time the relations between feature ranks and weights for naïve and SVM are similar, except that SVM feature weights decrease faster. This is consistent with the results that the two methods have similar feature-reduction rate for sentimentalism classification. Figure 9 plots feature ranks and their frequencies for both classifiers. Similar to Fig. 3 in the Dickinson erotic poem classification, the top naïve Bayes features in Fig. 10 are all low frequency words, while the frequencies of the top SVM features (in Fig. 9) are more distributed across the range. This time both algorithms use Boolean feature values, hence the frequencies as shown in Figs 9 and 10 are the words' document frequencies 7 .
Both SVM and naïve classifiers include some sentimental words in their top feature lists, such as 'die', 'sorrow', 'beloved', and 'agony'. However, many features in both lists do not seem 'sentimental' to the literary scholars, such as the words 'to-morrow', 'paternal', and 'payment'. The novel chapters are generally longer than the Dickinson poems. It is not surprising to find low sentimental text snippets mixed with highly sentimental ones. In consequence, some words that are not sentimental are also measured as sentimental because of their sentimental context.  5.5 Learning curves and confidence curves Figures 11 and 12 show the learning curves and confidence curves of both classifiers. Figure 11 shows that the SVM learning curve starts with low accuracy but improves fast with the increase of training example numbers.  Figure 12 shows that the confidence level of naïve Bayes predictions decreases more slowly than that of SVM. Overall, naïve Bayes has better learning curve and confidence curve in sentimentalism classification.

Conclusion
The evaluation results in this study demonstrate that SVMs are not all winners in literary text classification tasks. Both SVM and naïve Bayes classifiers achieved high accuracies in sentimental chapter classification, but the naïve Bayes classifier outperformed SVM in erotic poem classification. Selffeature selection helped both algorithms improve their performance in both tasks. However, the two algorithms selected relevant features in different  frequency ranges, and therefore captured different characteristics of the target classes. The naïve Bayes classifiers prefer words unique to the classes, which are often not frequent. In contrast, SVMs prefer high frequent and discriminant words, which are scarce in some genres such as poems. For the purpose of feature relevance analysis, the two methods should be used as complementary to each other rather than one over the other.
High classification accuracy is not necessarily associated with good generalizability. Despite the high accuracy in erotic poem classification, the naïve Bayes classifier is not a good example-based eroticism retrieval tool. Its learning curve does not level off with the increase of training examples, which indicates limited generalizability. In other words, this classifier is only good for summarizing the characteristics of the training data. Both algorithms yield high potential for example-based sentimentalism retrieval because of their fast increasing learning curves and strong confidences in predictions.
The evaluation results in this study also suggest that arbitrary feature-reduction steps such as stemming and stopword removal should be taken very carefully. Stopwords were highly discriminative features for erotic poem classification. In sentimental chapter classification, stemming undermined subsequent feature selection by aggressively conflating and neutralizing discriminative features.
Overall, while the use of text classification methods is very promising in literary text analysis applications, empirical experience on classification methods obtained from other domains should be carefully examined before applying to the new domain.