Chapter
3.2.2 The Scope of Learning Problems Modeled
3.4 Bibliographic Remarks
4 Learning via Uniform Convergence
4.1 Uniform Convergence Is Sufficient for Learnability
4.2 Finite Classes Are Agnostic PAC Learnable
4.4 Bibliographic Remarks
5 The Bias-Complexity Trade-off
5.1 The No-Free-Lunch Theorem
5.1.1 No-Free-Lunch and Prior Knowledge
5.4 Bibliographic Remarks
6.1 Infinite-Size Classes Can Be Learnable
6.3.1 Threshold Functions
6.3.3 Axis Aligned Rectangles
6.3.5 VC-Dimension and the Number of Parameters
6.4 The Fundamental Theorem of PAC Learning
6.5.1 Sauer's Lemma and the Growth Function
6.5.2 Uniform Convergence for Classes of Small Effective Size
6.7 Bibliographic Remarks
7 Nonuniform Learnability
7.1 Nonuniform Learnability
7.1.1 Characterizing Nonuniform Learnability
7.2 Structural Risk Minimization
7.3 Minimum Description Length and Occam's Razor
7.4 Other Notions of Learnability – Consistency
7.5 Discussing the Different Notions of Learnability
7.5.1 The No-Free-Lunch Theorem Revisited
7.7 Bibliographic Remarks
8 The Runtime of Learning
8.1 Computational Complexity of Learning
8.2 Implementing the ERM Rule
8.2.2 Axis Aligned Rectangles
8.2.3 Boolean Conjunctions
8.2.4 Learning 3-Term DNF
8.3 Efficiently Learnable, but Not by a Proper ERM
8.4 Hardness of Learning*
8.6 Bibliographic Remarks
Part 2 From Theory to Algorithms
9.1.1 Linear Programming for the Class of Halfspaces
9.1.2 Perceptron for Halfspaces
9.1.3 The VC Dimension of Halfspaces
9.2.2 Linear Regression for Polynomial Regression Tasks
9.5 Bibliographic Remarks
10.1.1 Efficient Implementation of ERM for Decision Stumps
10.3 Linear Combinations of Base Hypotheses
10.3.1 The VC-Dimension of L(B,T)
10.4 AdaBoost for Face Recognition
10.6 Bibliographic Remarks
11 Model Selection and Validation
11.1 Model Selection Using SRM
11.2.2 Validation for Model Selection
11.2.3 The Model-Selection Curve
11.2.4 k-Fold Cross Validation
11.2.5 Train-Validation-Test Split
11.3 What to Do If Learning Fails
12 Convex Learning Problems
12.1 Convexity, Lipschitzness, and Smoothness
12.2 Convex Learning Problems
12.2.1 Learnability of Convex Learning Problems
12.2.2 Convex-Lipschitz/Smooth-Bounded Learning Problems
12.3 Surrogate Loss Functions
12.5 Bibliographic Remarks
13 Regularization and Stability
13.1 Regularized Loss Minimization
13.2 Stable Rules Do Not Overfit
13.3 Tikhonov Regularization as a Stabilizer
13.3.2 Smooth and Nonnegative Loss
13.4 Controlling the Fitting-Stability Trade-off
13.6 Bibliographic Remarks
14 Stochastic Gradient Descent
14.1.1 Analysis of GD for Convex-Lipschitz Functions
14.2.1 Calculating Subgradients
14.2.2 Subgradients of Lipschitz Functions
14.2.3 Subgradient Descent
14.3 Stochastic Gradient Descent (SGD)
14.3.1 Analysis of SGD for Convex-Lipschitz-Bounded Functions
14.4.1 Adding a Projection Step
14.4.2 Variable Step Size
14.4.3 Other Averaging Techniques
14.4.4 Strongly Convex Functions*
14.5.1 SGD for Risk Minimization
14.5.2 Analyzing SGD for Convex-Smooth Learning Problems
14.5.3 SGD for Regularized Loss Minimization
14.7 Bibliographic Remarks
15 Support Vector Machines
15.1.1 The Homogenous Case
15.1.2 The Sample Complexity of Hard-SVM
15.2 Soft-SVM and Norm Regularization
15.2.1 The Sample Complexity of Soft-SVM
15.2.2 Margin and Norm-Based Bounds versus Dimension
15.3 Optimality Conditions and "Support Vectors"*
15.5 Implementing Soft-SVM Using SGD
15.7 Bibliographic Remarks
16.1 Embeddings into Feature Spaces
16.2.1 Kernels as a Way to Express Prior Knowledge
16.2.2 Characterizing Kernel Functions*
16.3 Implementing Soft-SVM with Kernels
16.5 Bibliographic Remarks
17 Multiclass and Ranking
17.1 One-versus-All and All-Pairs
17.2 Linear Multiclass Predictors
17.2.2 Cost-Sensitive Classification
17.2.4 Generalized Hinge Loss
17.2.5 Multiclass SVM and SGD
17.3 Structured Output Prediction
17.4.1 Linear Predictors for Ranking
17.5 Bipartite Ranking and Multivariate Performance Measures
17.5.1 Linear Predictors for Bipartite Ranking
17.7 Bibliographic Remarks
18.2 Decision Tree Algorithms
18.2.1 Implementations of the Gain Measure
18.2.3 Threshold-Based Splitting Rules for Real-Valued Features
18.5 Bibliographic Remarks
19.2.1 A Generalization Bound for the 1-NN Rule
19.2.2 The ``Curse of Dimensionality''
19.3 Efficient Implementation*
19.5 Bibliographic Remarks
20.1 Feedforward Neural Networks
20.2 Learning Neural Networks
20.3 The Expressive Power of Neural Networks
20.3.1 Geometric Intuition
20.4 The Sample Complexity of Neural Networks
20.5 The Runtime of Learning Neural Networks
20.6 SGD and Backpropagation
20.8 Bibliographic Remarks
Part 3 Additional Learning Models
21.1 Online Classification in the Realizable Case
21.1.1 Online Learnability
21.2 Online Classification in the Unrealizable Case
21.3 Online Convex Optimization
21.4 The Online Perceptron Algorithm
21.6 Bibliographic Remarks
22.1 Linkage-Based Clustering Algorithms
22.2 k-Means and Other Cost Minimization Clusterings
22.2.1 The k-Means Algorithm
22.3.2 Graph Laplacian and Relaxed Graph Cuts
22.3.3 Unnormalized Spectral Clustering
22.4 Information Bottleneck*
22.5 A High-Level View of Clustering
22.7 Bibliographic Remarks
23 Dimensionality Reduction
23.1 Principal Component Analysis (PCA)
23.1.1 A More Efficient Solution for the Case d >> m
23.1.2 Implementation and Demonstration
23.4 PCA or Compressed Sensing?
23.6 Bibliographic Remarks
24.1 Maximum Likelihood Estimator
24.1.1 Maximum Likelihood Estimation for Continuous Random Variables
24.1.2 Maximum Likelihood and Empirical Risk Minimization
24.1.3 Generalization Analysis
24.3 Linear Discriminant Analysis
24.4 Latent Variables and the EM Algorithm
24.4.1 EM as an Alternate Maximization Algorithm
24.4.2 EM for Mixture of Gaussians (Soft k-Means)
24.7 Bibliographic Remarks
25 Feature Selection and Generation
25.1.2 Greedy Selection Approaches
25.1.3 Sparsity-Inducing Norms
25.2 Feature Manipulation and Normalization
25.2.1 Examples of Feature Transformations
25.3.1 Dictionary Learning Using Auto-Encoders
25.5 Bibliographic Remarks
26 Rademacher Complexities
26.1 The Rademacher Complexity
26.1.1 Rademacher Calculus
26.2 Rademacher Complexity of Linear Classes
26.3 Generalization Bounds for SVM
26.4 Generalization Bounds for Predictors with Low l1 Norm
26.5 Bibliographic Remarks
27.2 From Covering to Rademacher Complexity via Chaining
27.3 Bibliographic Remarks
28 Proof of the Fundamental Theorem of Learning Theory
28.1 The Upper Bound for the Agnostic Case
28.2 The Lower Bound for the Agnostic Case
28.2.1 Showing That m([epsilon],[delta]) [geq] 0.5 log(1/(4[delta]))/[epsilon]2
28.2.2 Showing That m([epsilon],1/8) [geq] 8d/[epsilon]2
28.3 The Upper Bound for the Realizable Case
28.3.1 From -Nets to PAC Learnability
29 Multiclass Learnability
29.1 The Natarajan Dimension
29.2 The Multiclass Fundamental Theorem
29.2.1 On the Proof of Theorem 29.3
29.3 Calculating the Natarajan Dimension
29.3.1 One-versus-All Based Classes
29.3.2 General Multiclass-to-Binary Reductions
29.3.3 Linear Multiclass Predictors
29.4 On Good and Bad ERMs
29.5 Bibliographic Remarks
30.2.1 Axis Aligned Rectangles
30.2.3 Separating Polynomials
30.2.4 Separation with Margin
30.3 Bibliographic Remarks
31.2 Bibliographic Remarks
B.2 Chebyshev's Inequality
B.4 Hoeffding's Inequality
B.5 Bennet's and Bernstein's Inequalities
B.7 Concentration of 2 Variables
C.2 Eigenvalues and Eigenvectors
C.3 Positive Definite Matrices
C.4 Singular Value Decomposition (SVD)