Chapter
II.1.4 Isolating Interesting Patterns
Interestingness with Respect to Distributions and Proportions
II.1.5 Analyzing Document Collections over Time
From Context Relationships to Trend Graphs
Context Phrases and Context Relationships
Handling Dynamically Updated Data
The Borders Incremental Text Mining Algorithm
II.1.6 Citations and Notes
II.2 USING BACKGROUND KNOWLEDGE FOR TEXT MINING
II.2.1 Domains and Background Knowledge
II.2.4 Introducing Background Knowledge into Text Mining Systems
II.2.5 Real-World Example: FACT
General Approach and Functionality
Experimental Performance Results
II.2.6 Citations and Notes
II.3 TEXT MINING QUERY LANGUAGES
II.3.1 Real World Example: KDTL
II.3.2 KDTL Query Examples
II.3.3 KDTL Query Interface Implementations
II.3.4 Citations and Notes
III Text Mining Preprocessing Techniques
III.1 TASK-ORIENTED APPROACHES
III.1.1 General Purpose NLP Tasks
III.1.2 Problem-Dependent Tasks: Text Categorization and Information Extraction
General Information Extraction
IV.1 APPLICATIONS OF TEXT CATEGORIZATION
IV.1.1 Indexing of Texts Using Controlled Vocabulary
IV.1.2 Document Sorting and Text Filtering
IV.1.3 Hierarchical Web Page Categorization
IV.2 DEFINITION OF THE PROBLEM
IV.2.1 Single-Label versus Multilabel Categorization
IV.2.2 Document-Pivoted versus Category-Pivoted Categorization
IV.2.3 Hard versus Soft Categorization
IV.3 DOCUMENT REPRESENTATION
IV.3.2 Dimensionality Reduction by Feature Extraction
IV.4 KNOWLEDGE ENGINEERING APPROACH TO TC
IV.5 MACHINE LEARNING APPROACH TO TC
IV.5.1 Probabilistic Classifiers
IV.5.2 Bayesian Logistic Regression
IV.5.3 Decision Tree Classifiers
IV.5.4 Decision Rule Classifiers
IV.5.5 Regression Methods
IV.5.6 The Rocchio Methods
IV.5.8 Example-Based Classifiers
IV.5.9 Support Vector Machines
IV.5.10 Classifier Committees: Bagging and Boosting
IV.6 USING UNLABELED DATA TO IMPROVE CLASSIFICATION
IV.7 EVALUATION OF TEXT CLASSIFIERS
IV.7.1 Performance Measures
IV.7.2 Benchmark Collections
IV.7.3 Comparison among Classifiers
V.1 CLUSTERING TASKS IN TEXT ANALYSIS
V.1.1 Improving Search Recall
V.1.2 Improving Search Precision
V.1.4 Query-Specific Clustering
V.2 THE GENERAL CLUSTERING PROBLEM
V.2.1 Problem Representation
V.2.2 Similarity Measures
V.3 CLUSTERING ALGORITHMS
V.3.2 EM-based Probabilistic Clustering Algorithm
V.3.3 Hierarchical Agglomerative Clustering (HAC)
V.3.4 Other Clustering Algorithms
V.4 CLUSTERING OF TEXTUAL DATA
V.4.1 Representation of Text Clustering Problems
V.4.2 Dimension Reduction with Latent Semantic Indexing
V.4.3 Singular Value Decomposition
Using SVD for Dimension Reduction
Using Naïve Bayes Mixture Models with the EM Clustering Algorithm
V.4.4 Data Abstraction in Text Clustering
V.4.5 Evaluation of Text Clustering
VI Information Extraction
VI.1 INTRODUCTION TO INFORMATION EXTRACTION
VI.1.1 Elements That Can Be Extracted from Text
VI.2 HISTORICAL EVOLUTION OF IE: THE MESSAGE UNDERSTANDING CONFERENCES AND TIPSTER
VI.2.1 Named Entity Recognition
VI.2.2 Template Element Task
VI.2.3 Template Relationship (TR) Task
VI.2.4 Scenario Template (ST)
VI.2.5 Coreference Task (CO)
VI.2.6 Some Notes about IE Evaluation
VI.3.1 Case 1: Simplistic Tagging, News Domain
VI.3.2 Case 2: Natural Disasters Domain
VI.3.3 Case 3: Terror-Related Article, MUC-4
VI.3.4 Technology-Related Article, TIPSTER-Style Tagging
VI.3.5 Case 5: Comprehensive Stage-by-Stage Example
VI.4 ARCHITECTURE OF IE SYSTEMS
VI.4.1 Information Flow in an IE System
Processing the Initial Lexical Content: Tokenization and Lexical Analysis
Proper Name Identification
VI.5.1 Pronominal Anaphora
VI.5.2 Proper Names Coreference
VI.5.4 Predicate Nominative
VI.5.6 Function–Value Coreference
VI.5.9 Part–Whole Coreference
VI.5.10 Approaches to Anaphora Resolution
VI.5.10.1 Hobbs Algorithm
VI.5.11 CogNIAC (Baldwin 1995)
VI.5.11.1 Kennedy and Boguraev
VI.5.11.3 Evaluation of Knowledge-Poor Approaches
VI.5.11.4 Machine Learning Approaches
VI.6 INDUCTIVE ALGORITHMS FOR IE
VI.6.3 The (LP)2 Algorithm
VI.6.4 Experimental Evaluation
VI.7.1 Introduction to Structural IE
VI.7.2 Overall Problem Definition
VI.7.3 The Visual Elements Perceptual Grouping Subtask
VI.7.4 Problem Formulation for the Perceptual Grouping Subtask
VI.7.5 Algorithm for Constructing a Document O-Tree
VI.7.6 Structural Mapping
VI.7.8 Experimental Results
VII Probabilistic Models for Information Extraction
VII.1 HIDDEN MARKOV MODELS
VII.1.1 The Three Classic Problems Related to HMMs
VII.1.2 The Forward–Backward Procedure
VII.1.3 The Viterbi Algorithm
VII.1.4 The Training of the HMM
VII.1.5 Dealing with Training Data Sparseness
VII.2 STOCHASTIC CONTEXT-FREE GRAMMARS
VII.3 MAXIMAL ENTROPY MODELING
VII.3.1 Computing the Parameters of the Model
VII.4 MAXIMAL ENTROPY MARKOV MODELS
VII.4.1 Training the MEMM
VII.5 CONDITIONAL RANDOM FIELDS
VII.5.1 The Three Classic Problems Relating to CRF
VII.5.2 Computing the Conditional Probability
VII.5.3 Finding the Most Probable Label Sequence
VIII Preprocessing Applications Using Probabilistic and Hybrid Approaches
VIII.1 APPLICATIONS OF HMM TO TEXTUAL ANALYSIS
VIII.1.1 Using HMM to Extract Fields from Whole Documents
VIII.1.2 Learning HMM Structure from Data
VIII.1.3 Nymble: An HMM with Context-Dependent Probabilities
VIII.2 USING MEMM FOR INFORMATION EXTRACTION
VIII.3 APPLICATIONS OF CRFs TO TEXTUAL ANALYSIS
VIII.3.1 POS-Tagging with Conditional Random Fields
VIII.3.2 Shallow Parsing with Conditional Random Fields
VIII.4 TEG: USING SCFG RULES FOR HYBRID STATISTICAL–KNOWLEDGE-BASED IE
VIII.4.1 Introduction to a Hybrid System
VIII.4.2 TEG: Bridging the Gap between Statistical and Rule-Based IE Systems
VIII.4.3 Syntax of a TEG Rulebook
VIII.4.5 Additional features
VIII.4.6 Example of Real Rules
VIII.4.7 Experimental Evaluation of TEG
The MUC-7 Corpus Evaluation – Comparison with HMM-based NER
ACE-2 Evaluation: Extracting Relationships
VIII.5.1 Introduction to Bootstrapping: The AutoSlog-TS Approach
VIII.5.2 Mutual Bootstrapping
VIII.5.3 Metabootstrapping
Evaluation of the Metabootstrapping Algorithm
VIII.5.4 Using Strong Syntactic Heuristics
VIII.5.4.1 Evaluation of the Strong Syntactic Heuristics
VIII.5.4.2 Using Cotraining
VIII.5.5 The Basilisk Algorithm
VIII.5.5.1 Evaluation of Basilisk on Single-Category Bootstrapping
VIII.5.5.2 Using Multiclass Bootstrapping
VIII.5.5.3 Evaluation of the Multiclass Bootstrapping
VIII.5.6 Bootstrapping by Using Term Categorization
IX Presentation-Layer Considerations for Browsing and Query Refinement
IX.1.1 Displaying and Browsing Distributions
IX.1.2 Displaying and Exploring Associations
IX.1.3 Navigation and Exploration by Means of Concept Hierarchies
IX.1.4 Concept Hierarchy and Taxonomy Editors
IX.1.5 Clustering Tools to Aid Data Exploration
IX.2 ACCESSING CONSTRAINTS AND SIMPLE SPECIFICATION FILTERS AT THE PRESENTATION LAYER
IX.3 ACCESSING THE UNDERLYING QUERY LANGUAGE
X Visualization Approaches
X.1.1 Citations and Notes
X.2 ARCHITECTURAL CONSIDERATIONS
X.2.1 Citations and Notes
X.3 COMMON VISUALIZATION APPROACHES FOR TEXT MINING
X.3.2 Simple Concept Graphs
Simple Concept Set Graphs
Simple Concept Association Graphs
Similarity Functions for Simple Concept Association Graphs
Equivalence Classes, Partial Orderings, Redundancy Filters
Typical Interactive Operations Using Simple Concept Graphs
Drawbacks of Simple Concept Graphs
Multiple Circle Graph and Combination Graph Approaches
X.3.6 Self-Organizing Map (SOM) Approaches
X.3.8 Three-Dimensional (3-D) Effects
X.3.10 Citations and Notes
X.4 VISUALIZATION TECHNIQUES IN LINK ANALYSIS
X.4.1 Practical Approaches Using Generic Visualization Tools
Applications to Link Detection and General Effectiveness of Fisheye Approaches
X.4.3 Spring-Embedded Network Graphs
X.4.4 Critical Path and Pathway Analysis Graphs
X.4.5 Citations and Notes
X.5 REAL-WORLD EXAMPLE: THE DOCUMENT EXPLORER SYSTEM
X.5.1 Presentation-Layer Elements
Visual Administrative Tools: Term Hierarchy Editor
The Knowledge Discovery Toolkit
X.5.2 Citations and Notes
XI.1.1 Running Example: 9/11 Hijackers
XI.2 AUTOMATIC LAYOUT OF NETWORKS
XI.2.1 Force-Directed Graph Layout Algorithms
Kamada and Kawai’s (KK) Method
Fruchterman–Reingold (FR) Method
XI.2.2 Drawing Large Graphs
XI.3 PATHS AND CYCLES IN GRAPHS
XI.4.2 Closeness Centrality
XI.4.3 Betweeness Centrality
XI.4.4 Eigenvector Centrality
XI.4.6 Network Centralization
XI.5 PARTITIONING OF NETWORKS
Algorithm for finding the main core
XI.5.2 Classic Graph Analysis Algorithms
Strong and Weak Components
Biconnected Components and Articulation Points
XI.5.3 Equivalence between Entities
Finding the Best Block Model
Block Modeling of the Hijacker Network
XI.6 PATTERN MATCHING IN NETWORKS
XI.7 SOFTWARE PACKAGES FOR LINK ANALYSIS
XII Text Mining Applications
XII.1 GENERAL CONSIDERATIONS
XII.1.1 Background Knowledge
XII.1.2 Generalized Background Knowledge versus Specialized Background Knowledge
XII.1.3 Leveraging Preset Queries and Constraints in Generalized Browsing Interfaces
XII.1.4 Specialized Visualization Approaches
XII.1.5 Citations and Notes
XII.2 CORPORATE FINANCE: MINING INDUSTRY LITERATURE FOR BUSINESS INTELLIGENCE
XII.2.1 Industry Analyzer: Basic Architecture and Functionality
Data and Background Knowledge Sources
Core Mining Operations and Refinement Constraints
Presentation Layer – GUI and Visualization Tools
XII.2.2 Application Usage Scenarios
Examining the Biotech Industry Trade Press for Information on Merger Activity
Exploring Corporate Earnings Announcements
Exploring Available Information about Drugs Still in Clinical Trials
XII.2.3 Citations and Notes
XII.3 A “HORIZONTAL” TEXT MINING APPLICATION: PATENT ANALYSIS SOLUTION LEVERAGING A COMMERCIAL TEXT ANALYTICS PLATFORM
XII.3.1 Patent Researcher: Basic Architecture and Functionality
Data and Background Knowledge Sources
Core Mining Operations and Refinement Constraints
Presentation Layer – GUI and Visualization Tools
XII.3.2 Application Usage Scenarios
Looking at the Frequency Distributions among Patents in the Document Collection
Exploring Trends in Issued Patents
XII.3.3 Citations and Notes
XII.4 LIFE SCIENCES RESEARCH: MINING BIOLOGICAL PATHWAY INFORMATION WITH GENEWAYS
XII.4.1 GeneWays: Basic Architecture and Functionality
Data and Background Knowledge Sources
Core Mining Operations and Presentation Layer Elements
XII.4.2 Implementation and Typical Usage
XII.4.3 Citations and Notes
APPENDIX A DIAL: A Dedicated Information Extraction Language for Text Mining
A.1 WHAT IS THE DIAL LANGUAGE?
A.2 INFORMATION EXTRACTION IN THE DIAL ENVIRONMENT
A.4 CONCEPT AND RULE STRUCTURE
A.6.5 Character-Level Regular Expressions
A.7.1 Comparison Constraints
A.7.2 Boolean Constraints
A.9 COMPLETE DIAL EXAMPLES
A.9.1 Extracting People Names Based on Title/Position
A.9.2 Extracting Lists of People Names Based on a Preceding Verb
A.9.3 Using a Thesaurus to Extract Location Names
A.9.4 Creating a Thesaurus of Local People Names
A.9.5 A Simplified Anaphora Resolution Rule for Resolving a Person’s Pronoun
A.9.6 Anaphoric Family Relation
A.9.7 Meeting between People