SURFACE at Syracuse University SURFACE at Syracuse University

Experiments were conducted to test several hypotheses on methods for improving document classification for the malicious insider threat problem within the Intelligence Community. Bag-of-words (BOW) representations of documents were compared to Natural Language Processing (NLP) based representations in both the typical and one-class classification problems using the Support Vector Machine algorithm. Results show that the NLP features significantly improved classifier performance over the BOW approach both in terms of precision and recall, while using many fewer features. The one-class algorithm using NLP features demonstrated robustness when tested on new domains.


Introduction
This paper reports on further developments in the research [1,2] that leverages Natural Language Processing (NLP) and Machine Learning (ML) technologies to improve one aspect of security within the Intelligence Community (IC). This would be done by monitoring insiders' workflow documents and alerting the system assurance administrator if the content of the documents shifts away from what is expected, given the insiders' assignments. This capability is being implemented as one piece of a tripartite system prototype within the context of the ARDA-funded project, A Context, Role and Semantic-based Approach for Countering Malicious Insider [3]. In particular, we evaluate the applicability of a one-class categorization algorithm -Support Vector Machines (SVM) -which, unlike a regular classifier, is trained on 'typical' examples only and then used to detect both 'typical' and 'atypical' data. This is warranted by the context of the problem where the potential subject domain of interest to the malicious insider is unknown in advance and, therefore, it is not feasible to provide 'off-topic' examples to train a classifier.

Problem Background
It is known from Subject Matter Experts (SMEs) from the IC that analysts operate within a mission-based context, focused mainly on specific topics of interest (TOIs) and geo-political areas of interest (AOIs). The information accessed by analysts ranges from news articles to analyst reports, official documents, emails, queries, and the role and the task assigned to the analyst dictates the scope of their TOI/AOI. Within this mission-focused context, our hypothesis is that the ML-based text categorization of documents produced by the NLP-based semantic analysis of texts will enable a system to measure the extent to which an insider's document workflow is within the scope of the assigned task, in terms of TOI and AOI.
To illustrate the problem, consider the following "Threat Scenario", which is one of the six developed by the project team, based on a review of known malicious insider cases and consultations with the IC. An analyst with appropriate security clearance works on problems dealing with the Biological Weapons Program (TOI) in Iraq (AOI). For some reason, the analyst begins collecting information on ballistic missiles in North Korea. Since the topic is beyond his assigned task, these actions are covert, interspersed with his 'normal', 'on-topic' communications. Now and then he would query a database and retrieve documents on North Korea's missiles; occasionally, he would send a question to another analyst from the North Korea shop and receive documents via email; to pass the information to his external partners, he would copy data to a CD or print documents out. As these actions involve such textual artifacts as documents, database queries, and emails, analysis of their semantic content should be indicative of which topics are of interest to the analyst. Further comparison of these topics to what is expected, given the analyst's task, would reveal whether they are beyond the expected scope.
In addition to monitoring insider's communications, semantic analysis can be run ex-post-facto, if an information assurance engineer grew suspicious of an individual. Alternatively, it can help quickly characterize large collections of documents by separating them into semantic-driven categories for a wide range of applications.
It is important to note that the system will not replace human supervisors, but assist them by reducing the data to analyze to just the detected 'anomalies'.

Related Work
Until recently, the problem of detecting malicious insider activity was mainly approached from the cyber security standpoint, with systems as the main object of potential attack [4,5]. The 2003 and 2004 Symposia on Intelligence and Security Informatics (ISI) demonstrated an increased appreciation of information as an important factor of national security. As information is often represented through textual artifacts, linguistic analysis has been applied to the problems of cyber security. Sreenath [6] showed how reconstruction of users' queries from their online logs with latent semantic analysis can be applied to detect malicious intent. Studies also looked at linguistic indicators of deception in interview transcripts [7], email messages [8], and online chat [9]. Bengel [10] applied classification algorithms to the task of chat topic detection.
Another line of text classification research addresses situations when providing 'negative' examples for training is not feasible, for example, in intrusion detection [11], adaptive information filtering [12,13], and spam filtering [14]. Recently, research effort has focused on application of a one-class categorization algorithm, which is trained on positive examples only and then tested on the data that contain both positive and negative examples. Conceptually, the task is to acquire all possible knowledge about one class and then apply it to identify examples that do not belong to this class. As the one-class Support Vector Machines (SVM) [15] was shown to outperform other algorithms [12,13,16], it was chosen for our experiments. The novelty of our approach is in evaluating its effectiveness on various sets of features selected to represent documents. In particular, we compared the BOW representation with different combinations of linguistic features generated using NLP techniques.

Proposed Solution
The task of identifying 'off-topic' documents is modeled as a text categorization problem. Categorization models of expected topics are first built from the semantic content of a given set of documents, representing the analyst's 'normal' workflow. New documents are then categorized as on-or off-topic based on their semantic similarity to this Expected Model. The effectiveness of the solution is dependent on how well we can model expected communications, as well as on the accuracy of the categorization model and its generalizability to new documents. The most commonly used document representation has been the BOW [17,18]. It has been shown that the knowledge of statistical distribution of terms in texts is sufficient to achieve high classification performance. However, in situations where the available training data is limited (as is frequently true in real-life applications), classification performance on BOW suffers. Our hypothesis is that the use of fewer, more discriminative linguistic features can outperform the traditional bag-of-words representation, particularly in the case of limited training data.
The novelty of the proposed approach is in using linguistic features either extracted or assigned by our NLP-based system [19]. Such features include entities (nouns and noun phrases), named entities (proper names), and their semantic categories (i.e. PERSON, ORGANIZATION). Furthermore, the system can map these features into higher-level concepts from external knowledge sources, particularly, those indicative of TOI and AOI. By utilizing these more abstract features, the system can produce document vectors that are well separated in the feature space.
The NLP analysis is performed by TextTagger, a text processing system built at the Center for Natural Language Processing (CNLP) [20]. The system employs a partof-speech tagger and a sequence of rule-based shallow parsing phases that use lexicosemantic and syntactic clues to identify and categorize entities, named entities, events, as well as relations among them. Next, individual topics and locations are mapped to appropriate categories from knowledge bases. The choice of knowledge bases was driven by the project context. Concept inference for TOI is supported by an ontology developed for the Center for Nonproliferation Studies' (CNS) [21] collection of documents from the weapons of mass destruction (WMD) domain. For the conceptual organization of AOI, we utilize the SPAWAR Gazetteer [22]. Given that analysts usually operate on the country-level of AOI, the inference for geographical concepts is set to the 'Country' level, but other levels of granularity are possible. The entity and event extractions are output as frames, with relation extractions as frame slots. Authorities suspect the Bavarian Liberation Army, an extreme right-wing organization, may be responsible.

Bavarian Liberation Army
Country=Austria CNS_Superclasses=Terrorist-Group The NLP-extracted features are then used to generate document vectors for machine learning algorithms.

Experimentation Dataset
Experiments were run on a subset of the larger Insider Threat collection created for the project. Its core comes from the CNS collection and covers such topics as WMD and Terrorism, and such genres as newswires, articles, analytic reports, international treaties, emails, and so on. Training and Testing document sets were drawn from the collection based on the project scenarios. These scenarios are synthetic datasets that represent the insiders' workflow through atomic actions (e.g. 'search database', 'open document'), some of which are associated with documents. The scenarios span a period of six months each and include a baseline case (with no malicious activity) and six threat cases. The scenarios cover the workflow of hundreds of insiders with different work roles and tasks; for our experiments, we focused on one analyst from the Iraq/Biological Weapon shop. The above described Threat Scenario set the base for the Training and Testing datasets.
The documents were retrieved in a manner simulating the analysts' work: manually constructed task-specific queries ( Figure 2) were run against the Insider Threat collection. Sets of such queries were also included in the Training and Testing datasets.

(a)
+UNMOVIC +inspect* +biolog* +Iraq* (b) +missile +test* North +Korea Both sets included 'noise' (webpages on topics of general interest) as it is realistic to assume that, in the course of their workday, analysts may use the Web for personal reasons as well.
Documents retrieved by the 'North Korea' queries were labeled as OFF-topic. All other documents were labeled as ON-topic, since, for the purposes of the project, it will suffice if the classifier distinguishes the 'off-topic' documents from the rest. The Training set contained only ON-topic documents, whereas the Testing set also included OFF-topic documents. Table 1 shows the content and the volume of the resulting Training and Testing datasets. The relatively small share of OFF-topic documents in the Testing set (only 8.4%), though realistic given the context of the project, represented yet another challenge, as classification algorithms tend to favor more populated classes.

Classification experiments
For classification experiments, we used an SVM classifier not only because it has been shown to outperform kNN, Naïve Bayes, and other classifiers on the Reuters Collection [23,24], but also because it can handle one-class categorization problems as well. Experiments were run in LibSVM [25], modified to handle file names in the feature vectors, and to compute a confusion matrix for evaluation. We experimented with the following feature sets: 1. Bag-of-words representation (BOW): each unique word in the document is used as a feature in the document vector. The results of the experiments can be represented in a confusion matrix (Table 2), where TrueON are documents correctly classified as ON-topic; FalseON are OFFtopic documents assigned to the ON-topic class; TrueOFF are correctly detected OFF-topic documents, and FalseOFF are ON-topic documents misclassified into the OFF-topic class. Classifier performance was assessed using standard metrics of precision and recall [26] and a weighted F-score, calculated for each class. Figure 3 shows sample formulas for precision on the ON-topic (1) and the recall of the OFF-topic (2)  In mainstream text categorization research, the performance focus is usually on the 'positive' class, so the scores (precision, recall, F-measure) are often reported for this class only. The context of our project, however, gives much greater importance to detecting the 'negative' (i.e. potentially malicious) cases, while keeping the rate of 'false alarms' (FalseOFF) down. This provided a rather uncommon task for training the classifier: to aim not only for higher precision on ON-topic, but also for greater recall of OFF-topic. Therefore, in evaluating the classifier, we focused on the scores for the OFF-topic class, therefore, for the OFF-topic class, the F-measure was calculated with the weight β=10 (i.e. the Recall was weighted 10 times as important as Precision). The F-score for the ON-topic class was calculated using the standard weight β=1. Figure 4 shows the F-measure formula used. The actual value of β is not significant as long as it is greater than zero, since it places a higher emphasis on the precision than recall and F-score is not used to tune parameters of the learning algorithm.
(β+1) * Precision* Recall F-score = β*Precision + Recall The results (Table 3) demonstrate that, similarly to what was observed in experiments with the regular SVM classifier [2], document representations using TOI/AOI features only (TOI/AOI) or in combination with domain-important categories (AOI/TOI_cat) improve the classifier performance over the baseline (BOW), while using many fewer features. In particular, AOI/TOI shows over 5% improvement in Recall (OFF) while using forty-nine times fewer features. Using a combination of AOI/TOI and category information (AOI/TOI_cat) achieves 16% improvement on Recall (OFF) and over 12% improvement on the weighted F-OFF over the baseline with nine times fewer features than BOW. Although the decision to switch from the regular to the one-class SVM was guided by the context of our project, it was supported by the significantly higher performance of the one-class SVM on the OFF-topic class (Table 4). Regular SVM suffered from training on a weakly representative set for the OFF-topic class. Considering that the one-class SVM was able to achieve up to 94% of recall of 'off-topic' examples with no prior knowledge of what constitutes 'off-topic', the improvement is impressive. The downside of such a high recall of the OFF-topic, however, was the deteriorated recall of the ON-topic. In other words, the one-class SVM errs in favor of the previously unknown 'negative' class, thus, causing 'false alarms'. Next, as in our experiments with the regular SVM [2], we wanted to assess how the one-class SVM will perform on a different 'off-topic' domain. We used the same Training set, and the ON-topic part of the Testing set. For the OFF-part of the Testing set, the documents were retrieved from the Insider Threat dataset with queries on the topic of 'China/Nuclear weapons' (Table 5): Experimental results (Tables 6 and 7) support the trend observed in the prior experiments. One-class categorization on the NLP-enhanced document representations achieves superior performance, particularly on the 'off-topic' class, compared to the baseline (BOW). Besides, the domain change for the 'off-topic' documents does not seem to impact the classifier performance to a significant extent, which was the case with the regular SVM. Such robustness is quite reasonable, since the one-class SVM is not biased (via training) towards a particular kind of 'negative' data. Overall, the results show that the one-class SVM performs impressively well, especially, on recall of the OFF-topic class. Another important point is that the algorithm appears to be robust to handle different subject domains of 'negative' examples. We believe, therefore, that it can be effectively applied to categorization problems where only 'positive' examples are available. The results also demonstrate that the use of NLP-based features achieves better performance in categorization while using many fewer features than the commonly used bag-of-words representation.

Conclusion and directions for future research
The experiments described herein show that leveraging one-class SVM with the NLPextracted features for document representation improves classification effectiveness and efficiency. In future research we will seek to evaluate the impact of different combinations of linguistic features, extractions from text, and concepts inferred from external knowledge bases on categorization accuracy. In addition, to further explore the robustness of the one-class classifier, we plan to test it on a combination of different subject domains for the 'off-topic' class.
The one-class approach fits particularly well the situations where it is not feasible to provide 'atypical' examples. Overall, the research reported herein holds potential for providing the IC with the analytic tools to recognize anomalous insider activity; as well as to build content profiles of vast document collections when applied in a broader context.