Understanding Machine Learning :From Theory to Algorithms

Publication subTitle :From Theory to Algorithms

Author: Shai Shalev-Shwartz; Shai Ben-David  

Publisher: Cambridge University Press‎

Publication year: 2014

E-ISBN: 9781139950619

P-ISBN(Paperback): 9781107057135

Subject: TP181 automatic reasoning, machine learning

Keyword: 计算机的应用

Language: ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Understanding Machine Learning

Description

Machine learning is one of the fastest growing areas of computer science, with far-reaching applications. The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way. The book provides a theoretical account of the fundamentals underlying machine learning and the mathematical derivations that transform these principles into practical algorithms. Following a presentation of the basics, the book covers a wide array of central topics unaddressed by previous textbooks. These include a discussion of the computational complexity of learning and the concepts of convexity and stability; important algorithmic paradigms including stochastic gradient descent, neural networks, and structured output learning; and emerging theoretical concepts such as the PAC-Bayes approach and compression-based bounds. Designed for advanced undergraduates or beginning graduates, the text makes the fundamentals and algorithms of machine learning accessible to students and non-expert readers in statistics, computer science, mathematics and engineering.

Chapter

3.2.2 The Scope of Learning Problems Modeled

3.3 Summary

3.4 Bibliographic Remarks

3.5 Exercises

4 Learning via Uniform Convergence

4.1 Uniform Convergence Is Sufficient for Learnability

4.2 Finite Classes Are Agnostic PAC Learnable

4.3 Summary

4.4 Bibliographic Remarks

4.5 Exercises

5 The Bias-Complexity Trade-off

5.1 The No-Free-Lunch Theorem

5.1.1 No-Free-Lunch and Prior Knowledge

5.2 Error Decomposition

5.3 Summary

5.4 Bibliographic Remarks

5.5 Exercises

6 The VC-Dimension

6.1 Infinite-Size Classes Can Be Learnable

6.2 The VC-Dimension

6.3 Examples

6.3.1 Threshold Functions

6.3.2 Intervals

6.3.3 Axis Aligned Rectangles

6.3.4 Finite Classes

6.3.5 VC-Dimension and the Number of Parameters

6.4 The Fundamental Theorem of PAC Learning

6.5 Proof of Theorem 6.7

6.5.1 Sauer's Lemma and the Growth Function

6.5.2 Uniform Convergence for Classes of Small Effective Size

6.6 Summary

6.7 Bibliographic Remarks

6.8 Exercises

7 Nonuniform Learnability

7.1 Nonuniform Learnability

7.1.1 Characterizing Nonuniform Learnability

7.2 Structural Risk Minimization

7.3 Minimum Description Length and Occam's Razor

7.3.1 Occam's Razor

7.4 Other Notions of Learnability – Consistency

7.5 Discussing the Different Notions of Learnability

7.5.1 The No-Free-Lunch Theorem Revisited

7.6 Summary

7.7 Bibliographic Remarks

7.8 Exercises

8 The Runtime of Learning

8.1 Computational Complexity of Learning

8.1.1 Formal Definition*

8.2 Implementing the ERM Rule

8.2.1 Finite Classes

8.2.2 Axis Aligned Rectangles

8.2.3 Boolean Conjunctions

8.2.4 Learning 3-Term DNF

8.3 Efficiently Learnable, but Not by a Proper ERM

8.4 Hardness of Learning*

8.5 Summary

8.6 Bibliographic Remarks

8.7 Exercises

Part 2 From Theory to Algorithms

9 Linear Predictors

9.1 Halfspaces

9.1.1 Linear Programming for the Class of Halfspaces

9.1.2 Perceptron for Halfspaces

9.1.3 The VC Dimension of Halfspaces

9.2 Linear Regression

9.2.1 Least Squares

9.2.2 Linear Regression for Polynomial Regression Tasks

9.3 Logistic Regression

9.4 Summary

9.5 Bibliographic Remarks

9.6 Exercises

10 Boosting

10.1 Weak Learnability

10.1.1 Efficient Implementation of ERM for Decision Stumps

10.2 AdaBoost

10.3 Linear Combinations of Base Hypotheses

10.3.1 The VC-Dimension of L(B,T)

10.4 AdaBoost for Face Recognition

10.5 Summary

10.6 Bibliographic Remarks

10.7 Exercises

11 Model Selection and Validation

11.1 Model Selection Using SRM

11.2 Validation

11.2.1 Hold Out Set

11.2.2 Validation for Model Selection

11.2.3 The Model-Selection Curve

11.2.4 k-Fold Cross Validation

11.2.5 Train-Validation-Test Split

11.3 What to Do If Learning Fails

11.4 Summary

11.5 Exercises

12 Convex Learning Problems

12.1 Convexity, Lipschitzness, and Smoothness

12.1.1 Convexity

12.1.2 Lipschitzness

12.1.3 Smoothness

12.2 Convex Learning Problems

12.2.1 Learnability of Convex Learning Problems

12.2.2 Convex-Lipschitz/Smooth-Bounded Learning Problems

12.3 Surrogate Loss Functions

12.4 Summary

12.5 Bibliographic Remarks

12.6 Exercises

13 Regularization and Stability

13.1 Regularized Loss Minimization

13.1.1 Ridge Regression

13.2 Stable Rules Do Not Overfit

13.3 Tikhonov Regularization as a Stabilizer

13.3.1 Lipschitz Loss

13.3.2 Smooth and Nonnegative Loss

13.4 Controlling the Fitting-Stability Trade-off

13.5 Summary

13.6 Bibliographic Remarks

13.7 Exercises

14 Stochastic Gradient Descent

14.1 Gradient Descent

14.1.1 Analysis of GD for Convex-Lipschitz Functions

14.2 Subgradients

14.2.1 Calculating Subgradients

14.2.2 Subgradients of Lipschitz Functions

14.2.3 Subgradient Descent

14.3 Stochastic Gradient Descent (SGD)

14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions

14.4 Variants

14.4.1 Adding a Projection Step

14.4.2 Variable Step Size

14.4.3 Other Averaging Techniques

14.4.4 Strongly Convex Functions*

14.5 Learning with SGD

14.5.1 SGD for Risk Minimization

14.5.2 Analyzing SGD for Convex-Smooth Learning Problems

14.5.3 SGD for Regularized Loss Minimization

14.6 Summary

14.7 Bibliographic Remarks

14.8 Exercises

15 Support Vector Machines

15.1 Margin and Hard-SVM

15.1.1 The Homogenous Case

15.1.2 The Sample Complexity of Hard-SVM

15.2 Soft-SVM and Norm Regularization

15.2.1 The Sample Complexity of Soft-SVM

15.2.2 Margin and Norm-Based Bounds versus Dimension

15.2.3 The Ramp Loss*

15.3 Optimality Conditions and "Support Vectors"*

15.4 Duality*

15.5 Implementing Soft-SVM Using SGD

15.6 Summary

15.7 Bibliographic Remarks

15.8 Exercises

16 Kernel Methods

16.1 Embeddings into Feature Spaces

16.2 The Kernel Trick

16.2.1 Kernels as a Way to Express Prior Knowledge

16.2.2 Characterizing Kernel Functions*

16.3 Implementing Soft-SVM with Kernels

16.4 Summary

16.5 Bibliographic Remarks

16.6 Exercises

17 Multiclass and Ranking

17.1 One-versus-All and All-Pairs

17.2 Linear Multiclass Predictors

17.2.1 How to Construct

17.2.2 Cost-Sensitive Classification

17.2.3 ERM

17.2.4 Generalized Hinge Loss

17.2.5 Multiclass SVM and SGD

17.3 Structured Output Prediction

17.4 Ranking

17.4.1 Linear Predictors for Ranking

17.5 Bipartite Ranking and Multivariate Performance Measures

17.5.1 Linear Predictors for Bipartite Ranking

17.6 Summary

17.7 Bibliographic Remarks

17.8 Exercises

18 Decision Trees

18.1 Sample Complexity

18.2 Decision Tree Algorithms

18.2.1 Implementations of the Gain Measure

18.2.2 Pruning

18.2.3 Threshold-Based Splitting Rules for Real-Valued Features

18.3 Random Forests

18.4 Summary

18.5 Bibliographic Remarks

18.6 Exercises

19 Nearest Neighbor

19.1 k Nearest Neighbors

19.2 Analysis

19.2.1 A Generalization Bound for the 1-NN Rule

19.2.2 The ``Curse of Dimensionality''

19.3 Efficient Implementation*

19.4 Summary

19.5 Bibliographic Remarks

19.6 Exercises

20 Neural Networks

20.1 Feedforward Neural Networks

20.2 Learning Neural Networks

20.3 The Expressive Power of Neural Networks

20.3.1 Geometric Intuition

20.4 The Sample Complexity of Neural Networks

20.5 The Runtime of Learning Neural Networks

20.6 SGD and Backpropagation

20.7 Summary

20.8 Bibliographic Remarks

20.9 Exercises

Part 3 Additional Learning Models

21 Online Learning

21.1 Online Classification in the Realizable Case

21.1.1 Online Learnability

21.2 Online Classification in the Unrealizable Case

21.2.1 Weighted-Majority

21.3 Online Convex Optimization

21.4 The Online Perceptron Algorithm

21.5 Summary

21.6 Bibliographic Remarks

21.7 Exercises

22 Clustering

22.1 Linkage-Based Clustering Algorithms

22.2 k-Means and Other Cost Minimization Clusterings

22.2.1 The k-Means Algorithm

22.3 Spectral Clustering

22.3.1 Graph Cut

22.3.2 Graph Laplacian and Relaxed Graph Cuts

22.3.3 Unnormalized Spectral Clustering

22.4 Information Bottleneck*

22.5 A High-Level View of Clustering

22.6 Summary

22.7 Bibliographic Remarks

22.8 Exercises

23 Dimensionality Reduction

23.1 Principal Component Analysis (PCA)

23.1.1 A More Efficient Solution for the Case d >> m

23.1.2 Implementation and Demonstration

23.2 Random Projections

23.3 Compressed Sensing

23.3.1 Proofs*

23.4 PCA or Compressed Sensing?

23.5 Summary

23.6 Bibliographic Remarks

23.7 Exercises

24 Generative Models

24.1 Maximum Likelihood Estimator

24.1.1 Maximum Likelihood Estimation for Continuous Random Variables

24.1.2 Maximum Likelihood and Empirical Risk Minimization

24.1.3 Generalization Analysis

24.2 Naive Bayes

24.3 Linear Discriminant Analysis

24.4 Latent Variables and the EM Algorithm

24.4.1 EM as an Alternate Maximization Algorithm

24.4.2 EM for Mixture of Gaussians (Soft k-Means)

24.5 Bayesian Reasoning

24.6 Summary

24.7 Bibliographic Remarks

24.8 Exercises

25 Feature Selection and Generation

25.1 Feature Selection

25.1.1 Filters

25.1.2 Greedy Selection Approaches

25.1.3 Sparsity-Inducing Norms

25.2 Feature Manipulation and Normalization

25.2.1 Examples of Feature Transformations

25.3 Feature Learning

25.3.1 Dictionary Learning Using Auto-Encoders

25.4 Summary

25.5 Bibliographic Remarks

25.6 Exercises

Part 4 Advanced Theory

26 Rademacher Complexities

26.1 The Rademacher Complexity

26.1.1 Rademacher Calculus

26.2 Rademacher Complexity of Linear Classes

26.3 Generalization Bounds for SVM

26.4 Generalization Bounds for Predictors with Low l1 Norm

26.5 Bibliographic Remarks

27 Covering Numbers

27.1 Covering

27.1.1 Properties

27.2 From Covering to Rademacher Complexity via Chaining

27.3 Bibliographic Remarks

28 Proof of the Fundamental Theorem of Learning Theory

28.1 The Upper Bound for the Agnostic Case

28.2 The Lower Bound for the Agnostic Case

28.2.1 Showing That m([epsilon],[delta]) [geq] 0.5 log(1/(4[delta]))/[epsilon]2

28.2.2 Showing That m([epsilon],1/8) [geq] 8d/[epsilon]2

28.3 The Upper Bound for the Realizable Case

28.3.1 From -Nets to PAC Learnability

29 Multiclass Learnability

29.1 The Natarajan Dimension

29.2 The Multiclass Fundamental Theorem

29.2.1 On the Proof of Theorem 29.3

29.3 Calculating the Natarajan Dimension

29.3.1 One-versus-All Based Classes

29.3.2 General Multiclass-to-Binary Reductions

29.3.3 Linear Multiclass Predictors

29.4 On Good and Bad ERMs

29.5 Bibliographic Remarks

29.6 Exercises

30 Compression Bounds

30.1 Compression Bounds

30.2 Examples

30.2.1 Axis Aligned Rectangles

30.2.2 Halfspaces

30.2.3 Separating Polynomials

30.2.4 Separation with Margin

30.3 Bibliographic Remarks

31 PAC-Bayes

31.1 PAC-Bayes Bounds

31.2 Bibliographic Remarks

31.3 Exercises

A Technical Lemmas

B Measure Concentration

B.1 Markov's Inequality

B.2 Chebyshev's Inequality

B.3 Chernoff's Bounds

B.4 Hoeffding's Inequality

B.5 Bennet's and Bernstein's Inequalities

B.5.1 Application

B.6 Slud's Inequality

B.7 Concentration of 2 Variables

C Linear Algebra

C.1 Basic Definitions

C.2 Eigenvalues and Eigenvectors

C.3 Positive Definite Matrices

C.4 Singular Value Decomposition (SVD)

References

Index

The users who browse this book also browse