Text Mining Training

Text Mining Course Outline

Text mining introduction and historical perspectivesadsurgo word cloud

Data Capture – JMP Data Table
– Text files
– Excel files
– Folder of text files
– Folder of pdf and ppt
– Emails
– Web crawling
– Twitter NTSB (JMP data table)

Basic text mining – Use of word frequencies

String Processing – Bag of words
– Isolate individual words
– Remove punctuation
– Normalize case
– Remove numbers Cars slid into curb (simple)

Natural Language Processing – Zipf’s Law
– Stopwords
– Custom stopwords
– Collocation, synonyms (find/replace)
– Stem text
– Filter by character length
– Filter by number of words that do not appear in more than X documents

Document Term Matrix (DTM)
– Representing text with numbers, DTM
– Properties of the DTM
– Transformations of the DTM
— Binary
— Ternary
— Term Frequency
— Log
— tf-idf

Statistical Approaches – Latent Semantic Analysis (LSA)
– Singular Value Decomposition (SVD)
– Bivariate plot of SVDs
– Synonyms

Topic Analysis/Concept Extraction
– Varimax rotation of document space
– Varimax rotation of the term space NSF

Unsupervised Learning – Clustering Methods
– Ward’s method
– k-means
– Cluster words (V matrix)
– Concept linkage with cluster distances
– Cluster documents (U matrix)

Sentiment analysis
– Positive/Negative Words
– Custom sentiment analysis

Supervised Learning – classification trees with structured data
– Logistic Regression on structured data
– Classification tree on words
– Graphical methods and cross-tabulation

More Advanced
– Probabilistic Topic Modeling
— Latent Dirichlet Allocation (Variational Expectation Maximization and Gibbs)
— Conditional Topic Modeling
– Custom startlist NSF