Web Data Management

Author： Serge Abiteboul; Ioana Manolescu; Philippe Rigaux

Publisher： Cambridge University Press‎

Publication year： 2011

E-ISBN: 9781139211444

Subject： TP39 computer application

Keyword：计算机的应用

Language： ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Web Data Management

Description

The Internet and World Wide Web have revolutionized access to information. Users now store information across multiple platforms from personal computers to smartphones and websites. As a consequence, data management concepts, methods and techniques are increasingly focused on distribution concerns. Now that information largely resides in the network, so do the tools that process this information. This book explains the foundations of XML with a focus on data distribution. It covers the many facets of distributed data management on the Web, such as description logics, that are already emerging in today's data integration applications and herald tomorrow's semantic Web. It also introduces the machinery used to manipulate the unprecedented amount of data collected on the Web. Several 'Putting into Practice' chapters describe detailed practical applications of the technologies and techniques. The book will serve as an introduction to the new, global, information systems for Web professionals and master's level courses.

Chapter

MOTIVATION FOR THE BOOK

SCOPE AND ORGANIZATION OF THE BOOK

Part 1: Modeling Web Data

Part 2: Web Data Semantics and Integration

Part 3: Building Web Scale Applications

INTENDED AUDIENCE

COMPANION WEB SITE

ACKNOWLEDGMENTS

PART 1: Modeling Web Data

1 Data Model

1.1 SEMISTRUCTURED DATA

1.2 XML

1.2.1 XML Documents

1.2.2 Serialized and Tree-Based Forms

1.2.3 XML Syntax

Elements and Text

Attributes

Well-Formed XML Document

1.2.4 Typing and Namespaces

1.2.5 To Type or Not to Type

1.3 WEB DATA MANAGEMENT WITH XML

1.3.1 Data Exchange

1.3.2 Data Integration

1.4 THE XML WORLD

1.4.1 XML Dialects

1.4.2 XML Standards

Programming Interfaces: SAX and DOM

Query Languages: XPath, XQuery

1.6.2 XML Standards

2 XPath and XQuery

2.1 INTRODUCTION

2.2 BASICS

2.2.1 XPath and XQuery Data Model for Documents

2.2.2 The XQuery Model (Continued) and Sequences

2.2.3 Specifying Paths in a Tree: XPath

2.2.4 A First Glance at XQuery Expressions

2.2.5 XQuery vs XSLT

2.3 XPATH

2.3.1 Steps and Path Expressions

2.3.2 Evaluation of Path Expressions

2.3.3 Generalities on Axes and Node Tests

2.3.4 Axes

2.3.5 Node Tests and Abbreviations

2.3.6 Predicates

Conversions in XPath

Conversion to a Boolean

Converting a Node Set to a String

2.3.7 XPath 2.0

2.4 FLWOR EXPRESSIONS IN XQUERY

2.4.1 Defining Variables: The for and let Clauses

2.4.2 Filtering: The Where Clause

2.4.3 The return Clause

2.4.4 Advanced Features of XQuery

2.5 XPATH FOUNDATIONS

2.5.1 A Relational View of an XML Tree

2.5.2 Navigational XPath

2.5.3 Evaluation

2.5.4 Expressiveness and First-Order Logic

2.5.5 Other XPath Fragments

2.6 FURTHER READING

XPath

XQuery

XPath Foundations

2.7 EXERCISES

3 Typing

3.1 MOTIVATING TYPING

Dynamic and Static Typing

3.2 AUTOMATA

3.2.1 Automata on Words

3.2.2 Automata on Ranked Trees

3.2.3 Unranked Trees

3.2.4 Trees and Monadic Second-Order Logic

3.3 SCHEMA LANGUAGES FOR XML

3.3.1 Document Type Definitions

3.3.2 XML Schema

3.3.3 Other Schema Languages for XML

3.4 TYPING GRAPH DATA

3.4.1 Graph Semistructured Data

3.4.2 Graph Bisimulation

3.4.3 Data Guides

3.5 FURTHER READING

Schema Inference and Static Typing

Automata

Schema Languages for XML

Typing Languages

Typing Graph Data

3.6 EXERCISES

4 XML Query Evaluation

4.1 FRAGMENTING XML DOCUMENTS ON DISK

4.2 XML NODE IDENTIFIERS

4.2.1 Region-Based Identifiers

4.2.2 Dewey-Based Identifiers

4.2.3 Structural Identifiers and Updates

4.3 XML QUERY EVALUATION TECHNIQUES

4.3.1 Structural Join

4.3.2 Optimizing Structural Join Queries

4.3.3 Holistic Twig Joins

4.4 FURTHER READING

4.5 EXERCISES

5 Putting into Practice: Managing an XML Database with EXIST

5.1 PREREQUISITES

5.2 INSTALLING E X IST

5.3 GETTING STARTED WITH E X IST

5.4 RUNNING XPATH AND XQUERY QUERIESWITH THE SANDBOX

5.4.1 XPath

5.4.2 XQuery

5.4.3 Complement: XPath and XQuery Operators and Functions

Operators

5.5 PROGRAMMING WITH E X IST

5.5.1 Using the XML:DB API with EXIST

5.5.2 Accessing EXIST with Web Services

5.6 PROJECTS

5.6.1 Getting Started

5.6.2 Shakespeare Opera Omnia

5.6.3 MusicXML Online

Data

Core Functions

Advanced Options

6 Putting into Practice: Tree Pattern Evaluation Using SAX

6.1 TREE-PATTERN DIALECTS

6.2 CTP EVALUATION

6.3 EXTENSIONS TO RICHER TREE PATTERNS

PART 2: Web Data Semantics and Integration

7 Ontologies, RDF, and OWL

7.1 INTRODUCTION

7.2 ONTOLOGIES BY EXAMPLE

7.3 RDF, RDFS, AND OWL

7.3.1 Web Resources, URI, Namespaces

7.3.2 RDF

RDF Syntax: RDF Triplets

RDF Semantics

7.3.3 RDFS: RDF Schema

Syntax of RDFS

RDFS Semantics

7.3.4 OWL

Expressing class disjointness constraints

Union and Intersection

Class and Property Equivalence

Functional Constraints

Intentional Class Definitions

Union and Intersection

Class and Property Equivalence

7.4 ONTOLOGIES AND (DESCRIPTION) LOGICS

7.4.1 Preliminaries: The DL Jargon

FOL Semantics of DL

Reasoning Problems Considered in DLs

7.4.2 ALC: The Prototypical DL

7.4.3 Simple DLs for Which Reasoning Is Polynomial

7.4.4 The DL-LITE Family: A Good Trade-off

7.5 FURTHER READING

7.6 EXERCISES

8 Querying Data Through Ontologies

8.1 INTRODUCTION

8.2 QUERYING RDF DATA: NOTATION AND SEMANTICS

8.3 QUERYING THROUGH RDFS ONTOLOGIES

8.4 ANSWERING QUERIES THROUGH DL-L ITE ONTOLOGIES

8.4.1 DL-LITE

8.4.2 Consistency Checking

8.4.3 Answer Set Evaluation

8.4.4 Impact of Combining DL-LITE and DL-LITE on Query Answering

8.5 FURTHER READING

8.6 EXERCISES

9 Data Integration

9.1 INTRODUCTION

9.2 CONTAINMENT OF CONJUNCTIVE QUERIES

9.3 GLOBAL-AS-VIEW MEDIATION

9.4 LOCAL-AS-VIEW MEDIATION

9.4.1 The Bucket Algorithm

Bucket Creation

Construction of Candidate Rewritings

9.4.2 The Minicon Algorithm

First Step of Minicon: Creation of MCDs

Second Step of Minicon: Combination of the MCDs

9.4.3 The Inverse Rules Algorithm

9.4.4 Discussion

9.5 ONTOLOGY-BASED MEDIATORS

9.5.1 Adding Functionality Constraints

9.5.2 Query Rewriting Using Views in DL-LITE

9.6 PEER-TO-PEER DATA MANAGEMENT SYSTEMS

9.6.1 Answering Queries Using GLAV Mappings Is Undecidable

Reduction from a Decision Problem B to a Decision Problem B'

The Dependency Implication Problem

Undecidability of the GLAV Query Answering Problem

9.6.2 Decentralized DL-LITE

9.7 FURTHER READING

9.8 EXERCISES

10 Putting into Practice: Wrappers and Data Extraction with XSLT

10.1 EXTRACTING DATA FROM WEB PAGES

10.2 RESTRUCTURING DATA

11 Putting into Practice: Ontologies inPractice (by Fabian M. Suchanek)

11.1 EXPLORING AND INSTALLING YAGO

11.2 QUERYING YAGO

11.3 WEB ACCESS TO ONTOLOGIES

11.3.1 Cool URIs

11.3.2 Linked Data

12 Putting into Practice: Mashups with YAHOO! PIPES and XProc

12.1 YAHOO! PIPES: A GRAPHICAL MASHUP EDITOR

12.2 XPROC: AN XML PIPELINE LANGUAGE

PART 3: Building Web Scale Applications

13 Web Search

13.1 THE WORLD WIDE WEB

13.2 PARSING THE WEB

13.2.1 Crawling the Web

Discovering URLs

Deduplicating Web Pages

Crawling Ethics

Design Issues

13.2.2 Text Preprocessing

Tokenization

Stemming

Stop-Word Removal

13.3 WEB INFORMATION RETRIEVAL

13.3.1 Inverted Files

Content of Inverted Lists

Assessing Document Relevance

13.3.2 Answering Keyword Queries

Boolean Queries

Ranked Queries: Basic Algorithm

Fagin’s Threshold Algorithm

13.3.3 Large-Scale Indexing with Inverted Files

Performance of Inverted Files

Building and Updating an Inverted File

Indexing Dynamic Collections

Compression of Inverted Lists

Variable Byte Encoding

Variable Bit Encoding

13.3.4 Clustering

13.3.5 Beyond Classical IR

13.4 WEB GRAPH MINING

13.4.1 PageRank

Online Computation

13.4.2 HITS

13.4.3 Spamdexing

13.4.4 Discovering Communities on the Web

13.5 HOT TOPICS IN WEB SEARCH

Web 2.0

Deep Web

Information Extraction

13.6 FURTHER READING

Web Standards

Web Parsing and Indexing

Graph Mining

The Deep Web and Information Extraction

13.7 EXERCISES

14 An Introduction to Distributed Systems

14.1 BASICS OF DISTRIBUTED SYSTEMS

14.1.1 Networking Infrastructures

14.1.2 Performance of a Distributed Storage System

14.1.3 Data Replication and Consistency

14.2 FAILURE MANAGEMENT

14.2.1 Failure Recovery

14.2.2 Distributed Transactions

14.3 REQUIRED PROPERTIES OF A DISTRIBUTED SYSTEM

14.3.1 Reliability

14.3.2 Scalability

14.3.3 Availability

14.3.4 Efficiency

14.3.5 Putting Everything Together: The CAP Theorem

14.4 PARTICULARITIES OF P2P NETWORKS

14.5 CASE STUDY: A DISTRIBUTED FILE SYSTEM FOR VERY LARGE FILES

14.5.1 Large-Scale File System

14.5.2 Architecture

14.5.3 Failure Handling

14.6 FURTHER READING

15 Distributed Access Structures

15.1 HASH-BASED STRUCTURES

Dynamicity

Location of the Hash Directory

15.1.1 Distributed Linear Hashing

Linear Hashing

Distributed Linear Hashing

Reducing Maintenance Cost by Lazy Adjustment

Details on the LH* Algorithms

15.1.2 Consistent Hashing

Distributing Data with Consistent Hashing

Refinements

The Hash Directory

15.1.3 Case Study: CHORD

Overview

Routing Tables

CHORD Operations

15.2 DISTRIBUTED INDEXING: SEARCH TREES

15.2.1 Design Issues

15.2.2 Case Study: BATON

Kernel Structure

Performance

Routing Tables

BATON Operations

15.2.3 Case Study: BIGTABLE

Structure Overview

Distribution Strategy

Adjustment of the Client Image

Persistence

15.3 FURTHER READING

15.4 EXERCISES

16 Distributed Computing with MAPREDUCEand PIG

16.1 MAPREDUCE

16.1.1 Programming Model

16.1.2 The Programming Environment

16.1.3 MAPREDUCE Internals

16.2 PIG

16.2.1 A Simple Session

16.2.2 The Data Model

16.2.3 The Operators

16.2.4 Using MAPREDUCE to Optimize PIG Programs

16.3 FURTHER READING

16.4 EXERCISES

17 Putting into Practice: Full-Text Indexing with LUCENE (by Nicolas Travers)

17.1 PRELIMINARY: A L UCENE SANDBOX

17.2 INDEXING PLAIN TEXT WITH L UCENE -- A FULL EXAMPLE

17.2.1 The Main Program

17.2.2 Create the Index

17.2.3 Adding Documents

17.2.4 Searching the Index

17.2.5 LUCENE Querying Syntax

17.3 PUT IT INTO PRACTICE!

17.3.1 Indexing a Directory Content

17.3.2 Web Site Indexing (Project)

17.4 LUCENE - TUNING THE SCORING (PROJECT)

18 Putting into Practice: Recommendation Methodologies (by Alban Galland)

18.1 INTRODUCTION TO RECOMMENDATION SYSTEMS

18.2 PREREQUISITES

18.3 DATA ANALYSIS

18.4 GENERATING SOME RECOMMENDATIONS

18.4.1 Global Recommendation

18.4.2 User-Based Collaborative Filtering

Recommendation

18.4.3 Item-Based Collaborative Filtering

18.5 PROJECTS

18.5.1 Scaling

18.5.2 The Probabilistic Way

18.5.3 Improving Recommendation

19 Putting into Practice: Large-Scale Data Management with HADOOP

19.1 INSTALLING AND RUNNING HADOOP

19.2 RUNNING MAP REDUCE JOBS

19.3 PIGLATIN SCRIPTS

19.4 RUNNING IN CLUSTER MODE (OPTIONAL)

19.4.1 Configuring HADOOP in Cluster Mode

19.4.2 Starting, Stopping, and Managing HADOOP

19.5 EXERCISES

20 Putting into Practice: COUCHDB, a JSON Semistructured Database

20.1 INTRODUCTION TO THE COUCHDB DOCUMENT DATABASE

20.1.1 JSON, a Lightweight Semistructured Format

Key-Value Pairs

Complex Values: Objects and Arrays

JSON Documents

20.1.2 COUCHDB, Architecture, and Principles

20.1.3 Preliminaries: Set Up Your COUCHDB Environment

20.1.4 Adding Data

20.1.5 Views

20.1.6 Querying Views

20.1.7 Distribution Strategies: Master–Master, Master–Slave, and Shared–Nothing

Replication

Distribution Options

Conflict Management

Shared-Nothing architecture

20.2 PUTTING COUCHDB INTO PRACTICE!

20.2.1 Exercises

20.2.2 Project: Build a Distributed Bibliographic Database with COUCHDB

20.3 FURTHER READING

Bibliography

Index

The users who browse this book also browse

Description

Chapter

The users who browse this book also browse

No browse record.