Structured information refers to computerized information which can be easily interpreted and used by a computer program supporting a range of tasks, including process automation, data fusion, decision making, information filtering and retrieval. For example, the information stored in a relational database is structured, whereas a web page containing text, video and images is usually unstructured. The majority of information and knowledge are being authored, especially on the Internet, are unstructured information for human consumption. However, the distinction between structured and unstructured information can be minimized via automation in semantic content analysis and standardization. The role of machine learning and natural language processing (NLP) are very crucial in structuring textual information and thereby providing support for descriptive and predictive analytics. This tutorial will describe categorization of unstructured content, deep linguistics processing techniques for extracting evidence from categorized unstructured contents (e.g. text in the form of blogs, reviews, emails, human intelligence), and then detail techniques for model-based situation and risk assessments and predictions based on extracted evidence.
This tutorial consists of four parts:
- Statistical and data mining techniques for descriptive and predictive analytics
- Information extraction via natural language processing
- Text categorization via data mining techniques
- Case study demonstrations
The first part of the tutorial will cover a diverse range traditional statistical and data mining techniques, including cluster analysis, regression analysis, Bayesian belief networks, Decision Trees, neural networks. These techniques are suitable for modeling uncertain knowledge under different situations, and their suitability will be discussed in each case.
The second part of the tutorial will cover various text categorization techniques, including simple bag-of-words based naive Bayesian classifier, vector based methods such as latent semantics analysis, kernel based methods such as support vector machine, and a number of unsupervised methods.
The third part of the tutorial will cover Natural Language Processing (NLP), which is concerned with the computer-based analysis (understanding) and generation of human languages, in their written form (texts) or spoken form (speech). We will focus on automatic analysis of unstructured text documents, i.e. the automatic building, from streams of characters, of more formal and structured representations. We will present the multiple processing levels involved in such task, namely, tokenization, morphological analysis, part-of-Speech (POS) tagging, parsing (syntactic analysis), semantic analysis. We will use the Stanford Parser, a statistical parser, as a mean of illustration for these different levels of analysis.
The fourth part of the tutorial will provide examples and prototype demos involving text parsing, evidence structuring and follow-on descriptive and predictive analytics, and sentiment analysis.
- Developers of systems for the diagnosis of opinion mining and sentiment analysis
- Business analytics practitioners from both academia and industry
- Developers of military intelligence system for handling human intelligence
Lesson 1: Statistical Data Mining – Introduction to Probability and Statistics, Regression Analysis, Cluster Analysis, Bayesian Belief Networks, Neural Networks, Decision Tress.
Lesson 2: Text Categorization – Naive Bayesian Classifier, Latent Semantic Analysis, Support Vector Machine, Probabilistic Latent Semantic Analysis, Latent Dirichlet Allocation.
Lesson 3: Natural Language Processing – Tokenization, Morphological analysis, Parsing, Named-Entity Recognition, Coreference resolution, Semantic analysis, Stemming.
Lesson 4: Evidence Structuring for Situation Assessment, Sentiment Analysis.