Text Mining Course Outline
Text mining introduction and historical perspectives
Data Capture – JMP Data Table
– Text files
– Excel files
– Folder of text files
– Folder of pdf and ppt
– Emails
– Web crawling
– Twitter NTSB (JMP data table)
Basic text mining – Use of word frequencies
String Processing – Bag of words
– Isolate individual words
– Remove punctuation
– Normalize case
– Remove numbers Cars slid into curb (simple)
Natural Language Processing – Zipf’s Law
– Stopwords
– Custom stopwords
– Collocation, synonyms (find/replace)
– Stem text
– Filter by character length
– Filter by number of words that do not appear in more than X documents
Document Term Matrix (DTM)
– Representing text with numbers, DTM
– Properties of the DTM
– Transformations of the DTM
— Binary
— Ternary
— Term Frequency
— Log
— tf-idf
Statistical Approaches – Latent Semantic Analysis (LSA)
– Singular Value Decomposition (SVD)
– Bivariate plot of SVDs
– Synonyms
Topic Analysis/Concept Extraction
– Varimax rotation of document space
– Varimax rotation of the term space NSF
Unsupervised Learning – Clustering Methods
– Ward’s method
– k-means
– Cluster words (V matrix)
– Concept linkage with cluster distances
– Cluster documents (U matrix)
Sentiment analysis
– Positive/Negative Words
– Custom sentiment analysis
Supervised Learning – classification trees with structured data
– Logistic Regression on structured data
– Classification tree on words
– Graphical methods and cross-tabulation
More Advanced
– Probabilistic Topic Modeling
— Latent Dirichlet Allocation (Variational Expectation Maximization and Gibbs)
— Conditional Topic Modeling
– Custom startlist NSF