The Text Mining Handbook :Advanced Approaches in Analyzing Unstructured Data

Publication subTitle :Advanced Approaches in Analyzing Unstructured Data

Author: Ronen Feldman; James Sanger  

Publisher: Cambridge University Press‎

Publication year: 2006

E-ISBN: 9780511331855

P-ISBN(Paperback): 9780521836579

Subject: TP39 computer application

Keyword: 计算机的应用

Language: ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

The Text Mining Handbook

Description

Text mining is a new and exciting area of computer science research that tries to solve the crisis of information overload by combining techniques from data mining, machine learning, natural language processing, information retrieval, and knowledge management. Similarly, link detection – a rapidly evolving approach to the analysis of text that shares and builds upon many of the key elements of text mining – also provides new tools for people to better leverage their burgeoning textual data resources. The Text Mining Handbook presents a comprehensive discussion of the state-of-the-art in text mining and link detection. In addition to providing an in-depth examination of core text mining and link detection algorithms and operations, the book examines advanced pre-processing techniques, knowledge representation considerations, and visualization approaches. Finally, the book explores current real-world, mission-critical applications of text mining and link detection in such varied fields as M&A business intelligence, genomics research and counter-terrorism activities.

Chapter

II.1.4 Isolating Interesting Patterns

Interestingness with Respect to Distributions and Proportions

II.1.5 Analyzing Document Collections over Time

Trend Analysis

Ephemeral Associations

Deviation Detection

From Context Relationships to Trend Graphs

Context Phrases and Context Relationships

The Context Graph

The Trend Graph

Handling Dynamically Updated Data

The Borders Incremental Text Mining Algorithm

II.1.6 Citations and Notes

Section II.1.–II.1.1

Section II.1.2

Section II.1.3

Section II.1.4

Section II.1.5

II.2 USING BACKGROUND KNOWLEDGE FOR TEXT MINING

II.2.1 Domains and Background Knowledge

II.2.2 Domain Ontologies

II.2.3 Domain Lexicons

II.2.4 Introducing Background Knowledge into Text Mining Systems

II.2.5 Real-World Example: FACT

General Approach and Functionality

System Architecture

Implementation

Experimental Performance Results

II.2.6 Citations and Notes

Section II.2.1

Section II.2.2

Section II.2.3

Sections II.2.4.–II.2.5

II.3 TEXT MINING QUERY LANGUAGES

II.3.1 Real World Example: KDTL

II.3.2 KDTL Query Examples

II.3.3 KDTL Query Interface Implementations

II.3.4 Citations and Notes

Sections II.3–II.3.2

III Text Mining Preprocessing Techniques

III.1 TASK-ORIENTED APPROACHES

III.1.1 General Purpose NLP Tasks

Tokenization

Part-of-Speech Tagging

Syntactical Parsing

Shallow Parsing

III.1.2 Problem-Dependent Tasks: Text Categorization and Information Extraction

III.2 FURTHER READING

POS Tagging

Shallow Parsing

Constituency Grammars

Dependency Grammars

General Information Extraction

IV Categorization

IV.1 APPLICATIONS OF TEXT CATEGORIZATION

IV.1.1 Indexing of Texts Using Controlled Vocabulary

IV.1.2 Document Sorting and Text Filtering

IV.1.3 Hierarchical Web Page Categorization

IV.2 DEFINITION OF THE PROBLEM

IV.2.1 Single-Label versus Multilabel Categorization

IV.2.2 Document-Pivoted versus Category-Pivoted Categorization

IV.2.3 Hard versus Soft Categorization

IV.3 DOCUMENT REPRESENTATION

IV.3.1 Feature Selection

IV.3.2 Dimensionality Reduction by Feature Extraction

IV.4 KNOWLEDGE ENGINEERING APPROACH TO TC

IV.5 MACHINE LEARNING APPROACH TO TC

IV.5.1 Probabilistic Classifiers

IV.5.2 Bayesian Logistic Regression

IV.5.3 Decision Tree Classifiers

IV.5.4 Decision Rule Classifiers

IV.5.5 Regression Methods

IV.5.6 The Rocchio Methods

IV.5.7 Neural Networks

IV.5.8 Example-Based Classifiers

IV.5.9 Support Vector Machines

IV.5.10 Classifier Committees: Bagging and Boosting

IV.6 USING UNLABELED DATA TO IMPROVE CLASSIFICATION

IV.7 EVALUATION OF TEXT CLASSIFIERS

IV.7.1 Performance Measures

IV.7.2 Benchmark Collections

IV.7.3 Comparison among Classifiers

IV.8 CITATIONS AND NOTES

Section IV.1

Section IV.2

Section IV.3

Section IV.5.3–IV.5.4

Section IV.5.5

Section IV.5.8

Section IV.5.9

Section IV.5.10

Additional Algorithms

Section IV.7

V Clustering

V.1 CLUSTERING TASKS IN TEXT ANALYSIS

V.1.1 Improving Search Recall

V.1.2 Improving Search Precision

V.1.3 Scatter/Gather

V.1.4 Query-Specific Clustering

V.2 THE GENERAL CLUSTERING PROBLEM

V.2.1 Problem Representation

V.2.2 Similarity Measures

V.3 CLUSTERING ALGORITHMS

V.3.1 K-Means Algorithm

V.3.2 EM-based Probabilistic Clustering Algorithm

V.3.3 Hierarchical Agglomerative Clustering (HAC)

V.3.4 Other Clustering Algorithms

V.4 CLUSTERING OF TEXTUAL DATA

V.4.1 Representation of Text Clustering Problems

V.4.2 Dimension Reduction with Latent Semantic Indexing

V.4.3 Singular Value Decomposition

Using SVD for Dimension Reduction

Medoids

Using Naïve Bayes Mixture Models with the EM Clustering Algorithm

V.4.4 Data Abstraction in Text Clustering

V.4.5 Evaluation of Text Clustering

V.5 CITATIONS AND NOTES

Section V.1

Section V.3

Section V.4

VI Information Extraction

VI.1 INTRODUCTION TO INFORMATION EXTRACTION

VI.1.1 Elements That Can Be Extracted from Text

VI.2 HISTORICAL EVOLUTION OF IE: THE MESSAGE UNDERSTANDING CONFERENCES AND TIPSTER

VI.2.1 Named Entity Recognition

VI.2.2 Template Element Task

VI.2.3 Template Relationship (TR) Task

VI.2.4 Scenario Template (ST)

VI.2.5 Coreference Task (CO)

VI.2.6 Some Notes about IE Evaluation

VI.3 IE EXAMPLES

VI.3.1 Case 1: Simplistic Tagging, News Domain

VI.3.2 Case 2: Natural Disasters Domain

VI.3.3 Case 3: Terror-Related Article, MUC-4

VI.3.4 Technology-Related Article, TIPSTER-Style Tagging

VI.3.5 Case 5: Comprehensive Stage-by-Stage Example

VI.4 ARCHITECTURE OF IE SYSTEMS

VI.4.1 Information Flow in an IE System

Processing the Initial Lexical Content: Tokenization and Lexical Analysis

Proper Name Identification

Shallow Parsing

Building Relations

Inferencing

VI.5 ANAPHORA RESOLUTION

VI.5.1 Pronominal Anaphora

VI.5.2 Proper Names Coreference

VI.5.3 Apposition

VI.5.4 Predicate Nominative

VI.5.5 Identical Sets

VI.5.6 Function–Value Coreference

VI.5.7 Ordinal Anaphora

VI.5.8 One-Anaphora

VI.5.9 Part–Whole Coreference

VI.5.10 Approaches to Anaphora Resolution

VI.5.10.1 Hobbs Algorithm

VI.5.11 CogNIAC (Baldwin 1995)

VI.5.11.1 Kennedy and Boguraev

VI.5.11.2 Mitkov

VI.5.11.3 Evaluation of Knowledge-Poor Approaches

VI.5.11.4 Machine Learning Approaches

VI.6 INDUCTIVE ALGORITHMS FOR IE

VI.6.1 WHISK

VI.6.2 BWI

VI.6.3 The (LP)2 Algorithm

VI.6.4 Experimental Evaluation

VI.7 STRUCTURAL IE

VI.7.1 Introduction to Structural IE

VI.7.2 Overall Problem Definition

VI.7.3 The Visual Elements Perceptual Grouping Subtask

VI.7.4 Problem Formulation for the Perceptual Grouping Subtask

VI.7.5 Algorithm for Constructing a Document O-Tree

VI.7.6 Structural Mapping

VI.7.6.1 Basic Algorithm

VI.7.7 Templates

VI.7.8 Experimental Results

VI.8 FURTHER READING

Section VI.1

Section VI.4

Section VI.5

Section VI.6

VII Probabilistic Models for Information Extraction

VII.1 HIDDEN MARKOV MODELS

VII.1.1 The Three Classic Problems Related to HMMs

VII.1.2 The Forward–Backward Procedure

VII.1.3 The Viterbi Algorithm

VII.1.4 The Training of the HMM

VII.1.5 Dealing with Training Data Sparseness

VII.2 STOCHASTIC CONTEXT-FREE GRAMMARS

VII.2.1 Using SCFGs

VII.3 MAXIMAL ENTROPY MODELING

VII.3.1 Computing the Parameters of the Model

VII.4 MAXIMAL ENTROPY MARKOV MODELS

VII.4.1 Training the MEMM

VII.5 CONDITIONAL RANDOM FIELDS

VII.5.1 The Three Classic Problems Relating to CRF

VII.5.2 Computing the Conditional Probability

VII.5.3 Finding the Most Probable Label Sequence

VII.5.4 Training the CRF

VII.6 FURTHER READING

Section VII.1

Section VII.2

Section VII.3

Section VII.4

Section VII.5

VIII Preprocessing Applications Using Probabilistic and Hybrid Approaches

VIII.1 APPLICATIONS OF HMM TO TEXTUAL ANALYSIS

VIII.1.1 Using HMM to Extract Fields from Whole Documents

VIII.1.2 Learning HMM Structure from Data

VIII.1.3 Nymble: An HMM with Context-Dependent Probabilities

VIII.2 USING MEMM FOR INFORMATION EXTRACTION

VIII.3 APPLICATIONS OF CRFs TO TEXTUAL ANALYSIS

VIII.3.1 POS-Tagging with Conditional Random Fields

VIII.3.2 Shallow Parsing with Conditional Random Fields

VIII.4 TEG: USING SCFG RULES FOR HYBRID STATISTICAL–KNOWLEDGE-BASED IE

VIII.4.1 Introduction to a Hybrid System

VIII.4.2 TEG: Bridging the Gap between Statistical and Rule-Based IE Systems

VIII.4.3 Syntax of a TEG Rulebook

VIII.4.4 TEG Training

VIII.4.5 Additional features

VIII.4.6 Example of Real Rules

VIII.4.7 Experimental Evaluation of TEG

The MUC-7 Corpus Evaluation – Comparison with HMM-based NER

ACE-2 Evaluation: Extracting Relationships

VIII.5 BOOTSTRAPPING

VIII.5.1 Introduction to Bootstrapping: The AutoSlog-TS Approach

VIII.5.2 Mutual Bootstrapping

VIII.5.3 Metabootstrapping

Evaluation of the Metabootstrapping Algorithm

VIII.5.4 Using Strong Syntactic Heuristics

VIII.5.4.1 Evaluation of the Strong Syntactic Heuristics

VIII.5.4.2 Using Cotraining

VIII.5.5 The Basilisk Algorithm

VIII.5.5.1 Evaluation of Basilisk on Single-Category Bootstrapping

VIII.5.5.2 Using Multiclass Bootstrapping

VIII.5.5.3 Evaluation of the Multiclass Bootstrapping

VIII.5.6 Bootstrapping by Using Term Categorization

VIII.5.7 Summary

VIII.6 FURTHER READING

Section VIII.1

Section VIII.2

Section VIII.3

Section VIII.4

Section VIII.5

IX Presentation-Layer Considerations for Browsing and Query Refinement

IX.1 BROWSING

IX.1.1 Displaying and Browsing Distributions

IX.1.2 Displaying and Exploring Associations

IX.1.3 Navigation and Exploration by Means of Concept Hierarchies

IX.1.4 Concept Hierarchy and Taxonomy Editors

IX.1.5 Clustering Tools to Aid Data Exploration

IX.2 ACCESSING CONSTRAINTS AND SIMPLE SPECIFICATION FILTERS AT THE PRESENTATION LAYER

IX.3 ACCESSING THE UNDERLYING QUERY LANGUAGE

IX.4 CITATIONS AND NOTES

Section IX.1

Section IX.2

Section IX.3

X Visualization Approaches

X.1 INTRODUCTION

X.1.1 Citations and Notes

X.2 ARCHITECTURAL CONSIDERATIONS

X.2.1 Citations and Notes

X.3 COMMON VISUALIZATION APPROACHES FOR TEXT MINING

X.3.1 Overview

X.3.2 Simple Concept Graphs

Simple Concept Set Graphs

Simple Concept Association Graphs

Similarity Functions for Simple Concept Association Graphs

Equivalence Classes, Partial Orderings, Redundancy Filters

Typical Interactive Operations Using Simple Concept Graphs

Drawbacks of Simple Concept Graphs

X.3.3 Histograms

X.3.4 Line Graphs

X.3.5 Circle Graphs

Category-Connecting Maps

Multiple Circle Graph and Combination Graph Approaches

X.3.6 Self-Organizing Map (SOM) Approaches

WEBSOM

SOM Algorithm

X.3.7 Hyperbolic Trees

X.3.8 Three-Dimensional (3-D) Effects

X.3.9 Hybrid Tools

X.3.10 Citations and Notes

Sections X.3.1–X.3.3

Sections X.3.4–X.3.7

Section X.3.8–X.3.9

X.4 VISUALIZATION TECHNIQUES IN LINK ANALYSIS

X.4.1 Practical Approaches Using Generic Visualization Tools

X.4.2 “Fisheye” Diagrams

Distorting Fisheye Views

Filtering Fisheye Views

Applications to Link Detection and General Effectiveness of Fisheye Approaches

X.4.3 Spring-Embedded Network Graphs

X.4.4 Critical Path and Pathway Analysis Graphs

X.4.5 Citations and Notes

Sections X.4–X.4.3

Sections X.4.4–X.4.5

X.5 REAL-WORLD EXAMPLE: THE DOCUMENT EXPLORER SYSTEM

X.5.1 Presentation-Layer Elements

Visual Administrative Tools: Term Hierarchy Editor

The Knowledge Discovery Toolkit

Browsers

Visualization Tools

X.5.2 Citations and Notes

XI Link Analysis

XI.1 PRELIMINARIES

XI.1.1 Running Example: 9/11 Hijackers

XI.2 AUTOMATIC LAYOUT OF NETWORKS

XI.2.1 Force-Directed Graph Layout Algorithms

Kamada and Kawai’s (KK) Method

Fruchterman–Reingold (FR) Method

XI.2.2 Drawing Large Graphs

XI.3 PATHS AND CYCLES IN GRAPHS

XI.4 CENTRALITY

XI.4.1 Degree Centrality

XI.4.2 Closeness Centrality

XI.4.3 Betweeness Centrality

XI.4.4 Eigenvector Centrality

XI.4.5 Power Centrality

XI.4.6 Network Centralization

XI.4.7 Summary Diagram

XI.5 PARTITIONING OF NETWORKS

XI.5.1 Cores

Algorithm for finding the main core

XI.5.2 Classic Graph Analysis Algorithms

Strong and Weak Components

Biconnected Components and Articulation Points

XI.5.3 Equivalence between Entities

Structural Equivalence

Regular Equivalence

XI.5.4 Block Modeling

Formal Notations

Finding the Best Block Model

Block Modeling of the Hijacker Network

XI.6 PATTERN MATCHING IN NETWORKS

XI.7 SOFTWARE PACKAGES FOR LINK ANALYSIS

XI.7.1 Pajek

XI.7.2 UCINET

XI.7.3 NetMiner

XI.8 CITATIONS AND NOTES

Section XI.1

Section XI.2

Section XI.4

Section XI.5

Section XI.6

XII Text Mining Applications

XII.1 GENERAL CONSIDERATIONS

XII.1.1 Background Knowledge

XII.1.2 Generalized Background Knowledge versus Specialized Background Knowledge

XII.1.3 Leveraging Preset Queries and Constraints in Generalized Browsing Interfaces

XII.1.4 Specialized Visualization Approaches

XII.1.5 Citations and Notes

XII.2 CORPORATE FINANCE: MINING INDUSTRY LITERATURE FOR BUSINESS INTELLIGENCE

XII.2.1 Industry Analyzer: Basic Architecture and Functionality

Data and Background Knowledge Sources

Preprocessing Operations

Core Mining Operations and Refinement Constraints

Presentation Layer – GUI and Visualization Tools

XII.2.2 Application Usage Scenarios

Examining the Biotech Industry Trade Press for Information on Merger Activity

Exploring Corporate Earnings Announcements

Exploring Available Information about Drugs Still in Clinical Trials

XII.2.3 Citations and Notes

XII.3 A “HORIZONTAL” TEXT MINING APPLICATION: PATENT ANALYSIS SOLUTION LEVERAGING A COMMERCIAL TEXT ANALYTICS PLATFORM

XII.3.1 Patent Researcher: Basic Architecture and Functionality

Data and Background Knowledge Sources

Preprocessing Operations

Core Mining Operations and Refinement Constraints

Presentation Layer – GUI and Visualization Tools

XII.3.2 Application Usage Scenarios

Looking at the Frequency Distributions among Patents in the Document Collection

Exploring Trends in Issued Patents

XII.3.3 Citations and Notes

XII.4 LIFE SCIENCES RESEARCH: MINING BIOLOGICAL PATHWAY INFORMATION WITH GENEWAYS

XII.4.1 GeneWays: Basic Architecture and Functionality

Data and Background Knowledge Sources

Preprocessing Operations

Core Mining Operations and Presentation Layer Elements

XII.4.2 Implementation and Typical Usage

XII.4.3 Citations and Notes

APPENDIX A DIAL: A Dedicated Information Extraction Language for Text Mining

A.1 WHAT IS THE DIAL LANGUAGE?

A.2 INFORMATION EXTRACTION IN THE DIAL ENVIRONMENT

A.3 TEXT TOKENIZATION

A.4 CONCEPT AND RULE STRUCTURE

A.4.1 Context

A.5 PATTERN MATCHING

A.6 PATTERN ELEMENTS

A.6.1 String Constants

A.6.2 Wordclass Names

A.6.3 Thesaurus Names

A.6.4 Concept Names

A.6.5 Character-Level Regular Expressions

A.6.6 Character Classes

A.6.7 Scanner Properties

A.6.8 Token Elements

A.7 RULE CONSTRAINTS

A.7.1 Comparison Constraints

A.7.2 Boolean Constraints

A.8 CONCEPT GUARDS

A.9 COMPLETE DIAL EXAMPLES

A.9.1 Extracting People Names Based on Title/Position

A.9.2 Extracting Lists of People Names Based on a Preceding Verb

A.9.3 Using a Thesaurus to Extract Location Names

A.9.4 Creating a Thesaurus of Local People Names

A.9.5 A Simplified Anaphora Resolution Rule for Resolving a Person’s Pronoun

A.9.6 Anaphoric Family Relation

A.9.7 Meeting between People

Bibliography

Index

The users who browse this book also browse