ITEA 2016 was in Reston, VA October 3-5!
We gave the talk Applied Text Analytics for Test and Evaluation
Your test may have text or unstructured data such as a survey free-form text field, mission reports, failure/incident reports, or other known text-intensive data elements. There may also be some other opportunities to exploit unstructured data such as voice-to-text translations of events, previous test reports, web-scraping, or other sources of potentially valuable information that is just waiting to be analyzed. Text analytics may not be required if there are only a few open-ended survey questions given to a handful of operators, but in many instances the size of the text corpus will be too large to efficiently analyze without some automated methods.
This tutorial will provide an introduction to the exploration of text data to discover ultimately useful relationships to support the test process.
We will demonstrate methods to quickly preprocess text data for issues such as parsing, removing misspellings, creating synonym groups, and specifying custom stop words that we do not want included in the analysis. We can easily create word frequency plots and wordclouds that provide some insight to topics and themes. Once we create a document term matrix (DTM), we have opened the gateway to advanced techniques. Multivariate statistical methods decompose this sparse DTM into a modest number of columns (factors) that allow for topic/theme extraction, grouping similar documents (e.g. survey subjects), grouping closely associated words, and finding which words are closest to a specific word. We will show how the text data can be transformed into a set of predictor variables to be used to help predict a response variable. We will also explore sentiment analysis techniques to characterize overall feelings and attitudes based on the input text and use association analysis (market basket analysis) to determine commonly occurring word groupings. The workshop will use relevant example surveys and incident reports and demonstrate text analytics using common software packages across the test community such as R and JMP.