An Introduction to Categorical Data Analysis

Publication year： 2018

E-ISBN: 9781119405276

P-ISBN(Paperback): 9781119405269

Subject： O211 probability (probability theory, probability theory)

Keyword： Categorical Response Data Probability Distributions Categorical Data Statistical Discrete Data Bayesian Inference R Software Contingency Tables Probability Structure Odds Ratio Chi-Squared Tests of Independence Ordinal Variables Exact Frequentist Generalized Linear Models Statistical Inference Model Checking Logistic Regression ROC Curves Binary Regression Models Model Selection Penalized Likelihood Conditional Likelihood Linear Probability Probit Models Logit Models Lin

Language： ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Chapter

An Introduction to Categorical Data Analysis

Contents

Preface

About the Companion Website

1 Introduction

1.1 CATEGORICAL RESPONSE DATA

1.1.1 Response Variable and Explanatory Variables

1.1.2 Binary–Nominal–Ordinal Scale Distinction

1.1.3 Organization of this Book

1.2 PROBABILITY DISTRIBUTIONS FOR CATEGORICAL DATA

1.2.1 Binomial Distribution

1.2.2 Multinomial Distribution

1.3 STATISTICAL INFERENCE FOR A PROPORTION

1.3.1 Likelihood Function and Maximum Likelihood Estimation

1.3.2 Significance Test About a Binomial Parameter

1.3.3 Example: Surveyed Opinions About Legalized Abortion

1.3.4 Confidence Intervals for a Binomial Parameter

1.3.5 Better Confidence Intervals for a Binomial Proportion

1.4 STATISTICAL INFERENCE FOR DISCRETE DATA

1.4.1 Wald, Likelihood-Ratio, and Score Tests

1.4.2 Example: Wald, Score, and Likelihood-Ratio Binomial Tests

1.4.3 Small-Sample Binomial Inference and the Mid P-Value

1.5 BAYESIAN INFERENCE FOR PROPORTIONS

1.5.1 The Bayesian Approach to Statistical Inference

1.5.2 Bayesian Binomial Inference: Beta Prior Distributions

1.5.3 Example: Opinions about Legalized Abortion, Revisited

1.5.4 Other Prior Distributions

1.6 USING R SOFTWARE FOR STATISTICAL INFERENCE ABOUT PROPORTIONS

1.6.1 Reading Data Files and Installing Packages

1.6.2 Using R for Statistical Inference about Proportions

1.6.3 Summary: Choosing an Inference Method

Exercises

2 Analyzing Contingency Tables

2.1 PROBABILITY STRUCTURE FOR CONTINGENCY TABLES

2.1.1 Joint, Marginal, and Conditional Probabilities

2.1.2 Example: Sensitivity and Specificity

2.1.3 Statistical Independence of Two Categorical Variables

2.1.4 Binomial and Multinomial Sampling

2.2 COMPARING PROPORTIONS IN 2×2 CONTINGENCY TABLES

2.2.1 Difference of Proportions

2.2.2 Example: Aspirin and Incidence of Heart Attacks

2.2.3 Ratio of Proportions (Relative Risk)

2.2.4 Using R for Comparing Proportions in 2×2 Tables

2.3 THE ODDS RATIO

2.3.1 Properties of the Odds Ratio

2.3.2 Example: Odds Ratio for Aspirin Use and Heart Attacks

2.3.3 Inference for Odds Ratios and Log Odds Ratios

2.3.4 Relationship Between Odds Ratio and Relative Risk

2.3.5 Example: The Odds Ratio Applies in Case-Control Studies

2.3.6 Types of Studies: Observational Versus Experimental

2.4 CHI-SQUARED TESTS OF INDEPENDENCE

2.4.1 Pearson Statistic and the Chi-Squared Distribution

2.4.2 Likelihood-Ratio Statistic

2.4.3 Testing Independence in Two-Way Contingency Tables

2.4.4 Example: Gender Gap in Political Party Affiliation

2.4.5 Residuals for Cells in a Contingency Table

2.4.6 Partitioning Chi-Squared Statistics

2.4.7 Limitations of Chi-Squared Tests

2.5 TESTING INDEPENDENCE FOR ORDINAL VARIABLES

2.5.1 Linear Trend Alternative to Independence

2.5.2 Example: Alcohol Use and Infant Malformation

2.5.3 Ordinal Tests Usually Have Greater Power

2.5.4 Choice of Scores

2.5.5 Trend Tests for r×2 and 2×c and Nominal–Ordinal Tables

2.6 EXACT FREQUENTIST AND BAYESIAN INFERENCE

2.6.1 Fisher’s Exact Test for 2×2 Tables

2.6.2 Example: Fisher’s Tea Tasting Colleague

2.6.3 Conservatism for Actual (Type I Error); Mid -Values

2.6.4 Small-Sample Confidence Intervals for Odds Ratio

2.6.5 Bayesian Estimation for Association Measures

2.6.6 Example: Bayesian Inference in a Small Clinical Trial

2.7 ASSOCIATION IN THREE-WAY TABLES

2.7.1 Partial Tables

2.7.2 Example: Death Penalty Verdicts and Race

2.7.3 Simpson’s Paradox

2.7.4 Conditional and Marginal Odds Ratios

2.7.5 Homogeneous Association

Exercises

3 Generalized Linear Models

3.1 COMPONENTS OF A GENERALIZED LINEAR MODEL

3.1.1 Random Component

3.1.2 Linear Predictor

3.1.3 Link Function

3.1.4 Ordinary Linear Model: GLM with Normal Random Component

GENERALIZED LINEAR MODELS FOR BINARY DATA

3.2.1 Linear Probability Model

3.2.2 Logistic Regression Model

3.2.3 Example: Snoring and Heart Disease

3.2.4 Using R to Fit Generalized Linear Models for Binary Data

3.2.5 Data Files: Ungrouped or Grouped Binary Data

3.3 GENERALIZED LINEAR MODELS FOR COUNTS AND RATES

3.3.1 Poisson Distribution for Counts

3.3.2 Poisson Loglinear Model

3.3.3 Example: Female Horseshoe Crabs and their Satellites

3.3.4 Overdispersion: Greater Variability than Expected

3.4 STATISTICAL INFERENCE AND MODEL CHECKING

3.4.1 Wald, Likelihood-Ratio, and Score Inference Use the Likelihood Function

3.4.2 Example: Political Ideology and Belief in Evolution

3.4.3 The Deviance of a GLM

3.4.4 Model Comparison Using the Deviance

3.4.5 Residuals Comparing Observations to the Model Fit

3.5 FITTING GENERALIZED LINEAR MODELS

3.5.1 The Fisher Scoring Algorithm Fits GLMs

3.5.2 Bayesian Methods for Generalized Linear Models

3.5.3 GLMs: A Unified Approach to Statistical Analysis

Exercises

4 Logistic Regression

4.1 THE LOGISTIC REGRESSION MODEL

4.1.1 The Logistic Regression Model

4.1.2 Odds Ratio and Linear Approximation Interpretations

4.1.3 Example: Whether a Female Horseshoe Crab Has Satellites

4.1.4 Logistic Regression with Retrospective Studies

4.1.5 Normally Distributed X Implies Logistic Regression for Y

4.2 STATISTICAL INFERENCE FOR LOGISTIC REGRESSION

4.2.1 Confidence Intervals for Effects

4.2.2 Significance Testing

4.2.3 Fitted Values and Confidence Intervals for Probabilities

4.2.4 Why Use a Model to Estimate Probabilities?

4.3 LOGISTIC REGRESSION WITH CATEGORICAL PREDICTORS

4.3.1 Indicator Variables Represent Categories of Predictors

4.3.2 Example: Survey about Marijuana Use

4.3.3 ANOVA-Type Model Representation of Factors

4.3.4 Tests of Conditional Independence and of Homogeneity for Three-Way Contingency Tables

4.4 MULTIPLE LOGISTIC REGRESSION

4.4.1 Example: Horseshoe Crabs with Color and Width Predictors

4.4.2 Model Comparison to Check Whether a Term is Needed

4.4.3 Example: Treating Color as Quantitative or Binary

4.4.4 Allowing Interaction between Explanatory Variables

4.4.5 Effects Depend on Other Explanatory Variables in Model

4.5 SUMMARIZING EFFECTS IN LOGISTIC REGRESSION

4.5.1 Probability-Based Interpretations

4.5.2 Marginal Effects and Their Average

4.5.3 Standardized Interpretations

4.6 SUMMARIZING PREDICTIVE POWER: CLASSIFICATION TABLES, ROC CURVES, AND MULTIPLE CORRELATION

4.6.1 Summarizing Predictive Power: Classification Tables

4.6.2 Summarizing Predictive Power: ROC Curves

4.6.3 Summarizing Predictive Power: Multiple Correlation

EXERCISES

5 Building and Applying Logistic Regression Models

5.1 STRATEGIES IN MODEL SELECTION

5.1.1 How Many Explanatory Variables Can the Model Handle?

5.1.2 Example: Horseshoe Crab Satellites Revisited

5.1.3 Stepwise Variable Selection Algorithms

5.1.4 Purposeful Selection of Explanatory Variables

5.1.5 Example: Variable Selection for Horseshoe Crabs

5.1.6 AIC and the Bias/Variance Tradeoff

5.2 MODEL CHECKING

5.2.1 Goodness of Fit: Model Comparison Using the Deviance

5.2.2 Example: Goodness of Fit for Marijuana Use Survey

5.2.3 Goodness of Fit: Grouped versus Ungrouped Data and Continuous Predictors

5.2.4 Residuals for Logistic Models with Categorical Predictors

5.2.5 Example: Graduate Admissions at University of Florida

5.2.6 Standardized versus Pearson and Deviance Residuals

5.2.7 Influence Diagnostics for Logistic Regression

5.2.8 Example: Heart Disease and Blood Pressure

5.3 INFINITE ESTIMATES IN LOGISTIC REGRESSION

5.3.1 Complete and Quasi-Complete Separation: Perfect Discrimination

5.3.2 Example: Infinite Estimate for Toy Example

5.3.3 Sparse Data and Infinite Effects with Categorical Predictors

5.3.4 Example: Risk Factors for Endometrial Cancer Grade

5.4 BAYESIAN INFERENCE, PENALIZED LIKELIHOOD, AND CONDITIONAL LIKELIHOOD FOR LOGISTIC REGRESSION

5.4.1 Bayesian Modeling: Specification of Prior Distributions

5.4.2 Example: Risk Factors for Endometrial Cancer Revisited

5.4.3 Penalized Likelihood Reduces Bias in Logistic Regression

5.4.4 Example: Risk Factors for Endometrial Cancer Revisited

5.4.5 Conditional Likelihood and Conditional Logistic Regression

5.4.6 Conditional Logistic Regression and Exact Tests for Contingency Tables

5.5 ALTERNATIVE LINK FUNCTIONS: LINEAR PROBABILITY AND PROBIT MODELS

5.5.1 Linear Probability Model

5.5.2 Example: Political Ideology and Belief in Evolution

5.5.3 Probit Model and Normal Latent Variable Model

5.5.4 Example: Snoring and Heart Disease Revisited

5.5.5 Latent Variable Models Imply Binary Regression Models

5.5.6 CDFs and Shapes of Curves for Binary Regression Models

5.6 SAMPLE SIZE AND POWER FOR LOGISTIC REGRESSION

5.6.1 Sample Size for Comparing Two Proportions

5.6.2 Sample Size in Logistic Regression Modeling

5.6.3 Example: Modeling the Probability of Heart Disease

Exercises

6 Multicategory Logit Models

6.1 BASELINE-CATEGORY LOGIT MODELS FOR NOMINAL RESPONSES

6.1.1 Baseline-Category Logits

6.1.2 Example: What Do Alligators Eat?

6.1.3 Estimating Response Probabilities

6.1.4 Checking Multinomial Model Goodness of Fit

6.1.5 Example: Belief in Afterlife

6.1.6 Discrete Choice Models

6.1.7 Example: Shopping Destination Choice

6.2 CUMULATIVE LOGIT MODELS FOR ORDINAL RESPONSES

6.2.1 Cumulative Logit Models with Proportional Odds

6.2.2 Example: Political Ideology and Political Party Affiliation

6.2.3 Inference about Cumulative Logit Model Parameters

6.2.4 Increased Power for Ordinal Analyses

6.2.5 Example: Happiness and Family Income

6.2.6 Latent Variable Linear Models Imply Cumulative Link Models

6.2.7 Invariance to Choice of Response Categories

6.3 CUMULATIVE LINK MODELS: MODEL CHECKING AND EXTENSIONS

6.3.1 Checking Ordinal Model Goodness of Fit

6.3.2 Cumulative Logit Model without Proportional Odds

6.3.3 Simpler Interpretations Use Probabilities

6.3.4 Example: Modeling Mental Impairment

6.3.5 A Latent Variable Probability Comparison of Groups

6.3.6 Cumulative Probit Model

6.3.7 R2 Based on the Latent Variable Model

6.3.8 Bayesian Inference for Multinomial Models

6.3.9 Example: Modeling Mental Impairment Revisited

6.4 PAIRED-CATEGORY LOGIT MODELING OF ORDINAL RESPONSES

6.4.1 Adjacent-Categories Logits

6.4.2 Example: Political Ideology Revisited

6.4.3 Sequential Logits

6.4.4 Example: Tonsil Size and Streptococcus

Exercises

7 Loglinear Models for Contingency Tables and Counts

7.1 LOGLINEAR MODELS FOR COUNTS IN CONTINGENCY TABLES

7.1.1 Loglinear Model of Independence for Two-Way Contingency Tables

7.1.2 Interpretation of Parameters in the Independence Model

7.1.3 Example: Happiness and Belief in Heaven

7.1.4 Saturated Model for Two-Way Contingency Tables

7.1.5 Loglinear Models for Three-Way Contingency Tables

7.1.6 Two-Factor Parameters Describe Conditional Associations

7.1.7 Example: Student Alcohol, Cigarette, and Marijuana Use

7.2 STATISTICAL INFERENCE FOR LOGLINEAR MODELS

7.2.1 Chi-Squared Goodness-of-Fit Tests

7.2.2 Cell Standardized Residuals for Loglinear Models

7.2.3 Significance Tests about Conditional Associations

7.2.4 Confidence Intervals for Conditional Odds Ratios

7.2.5 Bayesian Fitting of Loglinear Models

7.2.6 Loglinear Models for Higher-Dimensional Contingency Tables

7.2.7 Example: Automobile Accidents and Seat Belts

7.2.8 Interpreting Three-Factor Interaction Terms

7.2.9 Statistical Versus Practical Significance: Dissimilarity Index

7.3 THE LOGLINEAR – LOGISTIC MODEL CONNECTION

7.3.1 Using Logistic Models to Interpret Loglinear Models

7.3.2 Example: Auto Accident Data Revisited

7.3.3 Condition for Equivalent Loglinear and Logistic Models

7.3.4 Loglinear/Logistic Model Selection Issues

7.4 INDEPENDENCE GRAPHS AND COLLAPSIBILITY

7.4.1 Independence Graphs

7.4.2 Collapsibility Conditions for Contingency Tables

7.4.3 Example: Loglinear Model Building for Student Substance Use

7.4.4 Collapsibility and Logistic Models

7.5 MODELING ORDINAL ASSOCIATIONS IN CONTINGENCY TABLES

7.5.1 Linear-by-Linear Association Model

7.5.2 Example: Linear-by-Linear Association for Sex Opinions

7.5.3 Ordinal Significance Tests of Independence

7.6 LOGLINEAR MODELING OF COUNT RESPONSE VARIABLES

7.6.1 Count Regression Modeling of Rate Data

7.6.2 Example: Death Rates for Lung Cancer Patients

7.6.3 Negative Binomial Regression Models

7.6.4 Example: Female Horseshoe Crab Satellites Revisited

Exercises

8 Models for Matched Pairs

8.1 COMPARING DEPENDENT PROPORTIONS FOR BINARY MATCHED PAIRS

8.1.1 McNemar Test Comparing Marginal Proportions

8.1.2 Estimating the Difference between Dependent Proportions

8.2 MARGINAL MODELS AND SUBJECT-SPECIFIC MODELS FOR MATCHED PAIRS

8.2.1 Marginal Models for Marginal Proportions

8.2.2 Example: Environmental Opinions Revisited

8.2.3 Subject-Specific and Population-Averaged Tables

8.2.4 Conditional Logistic Regression for Matched-Pairs

8.2.5 Logistic Regression for Matched Case-Control Studies

8.3 COMPARING PROPORTIONS FOR NOMINAL MATCHED-PAIRS RESPONSES

8.3.1 Marginal Homogeneity for Baseline-Category Logit Models

8.3.2 Example: Coffee Brand Market Share

8.3.3 Using the Cochran–Mantel–Haenszel Test to Test Marginal Homogeneity

8.3.4 Symmetry and Quasi-Symmetry Models for Square Contingency Tables

8.3.5 Example: Coffee Brand Market Share Revisited

8.4 COMPARING PROPORTIONS FOR ORDINAL MATCHED-PAIRS RESPONSES

8.4.1 Marginal Homogeneity and Cumulative Logit Marginal Model

8.4.2 Example: Recycle or Drive Less to Help the Environment?

8.4.3 An Ordinal Quasi-Symmetry Model

8.4.4 Example: Recycle or Drive Less Revisited?

8.5 ANALYZING RATER AGREEMENT

8.5.1 Example: Agreement on Carcinoma Diagnosis

8.5.2 Cell Residuals for Independence Model

8.5.3 Quasi-Independence Model

8.5.4 Quasi Independence and Odds Ratios Summarizing Agreement

8.5.5 Kappa Summary Measure of Agreement

8.6 BRADLEY–TERRY MODEL FOR PAIRED PREFERENCES

8.6.1 The Bradley–Terry Model and Quasi-Symmetry

8.6.2 Example: Ranking Men Tennis Players

Exercises

9 Marginal Modeling of Correlated, Clustered Responses

9.1 MARGINAL MODELS VERSUS SUBJECT-SPECIFIC MODELS

9.1.1 Marginal Models for a Clustered Binary Response

9.1.2 Example: Repeated Responses on Similar Survey Questions

9.1.3 Subject-Specific Models for a Repeated Response

9.2 MARGINAL MODELING: THE GENERALIZED ESTIMATING EQUATIONS (GEE) APPROACH

9.2.1 Quasi-Likelihood Methods

9.2.2 Generalized Estimating Equation Methodology: Basic Ideas

9.2.3 Example: Opinion about Legalized Abortion Revisited

9.2.4 Limitations of GEE Compared to ML

9.3 MARGINAL MODELING FOR CLUSTERED MULTINOMIAL RESPONSES

9.3.1 Example: Insomnia Study

9.3.2 Alternative GEE Specification of Working Association

9.4 TRANSITIONAL MODELING, GIVEN THE PAST

9.4.1 Transitional Models with Explanatory Variables

9.4.2 Example: Respiratory Illness and Maternal Smoking

9.4.3 Group Comparisons Treating Initial Response as a Covariate

9.5 DEALING WITH MISSING DATA

9.5.1 Missing at Random: Impact on ML and GEE Methods

9.5.2 Multiple Imputation: Monte Carlo Prediction of Missing Data

Exercises

10 Random Effects: Generalized Linear Mixed Models

10.1 RANDOM EFFECTS MODELING OF CLUSTERED CATEGORICAL DATA

10.1.1 The Generalized Linear Mixed Model (GLMM)

10.1.2 A Logistic GLMM for Binary Matched Pairs

10.1.3 Example: Environmental Opinions Revisited

10.1.4 Differing Effects in GLMMs and Marginal Models

10.1.5 Model Fitting for GLMMs

10.1.6 Inference for Model Parameters and Prediction

10.2 EXAMPLES: RANDOM EFFECTS MODELS FOR BINARY DATA

10.2.1 Small-Area Estimation of Binomial Probabilities

10.2.2 Example: Estimating Basketball Free Throw Success

10.2.3 Example: Opinions about Legalized Abortion Revisited

10.2.4 Item Response Models: The Rasch Model

10.2.5 Choice of Marginal Model or Random Effects Model

10.3 EXTENSIONS TO MULTINOMIAL RESPONSES AND MULTIPLE RANDOM EFFECT TERMS

10.3.1 Example: Insomnia Study Revisited

10.3.2 Meta-Analysis: Bivariate Random Effects for Association Heterogeneity

10.4 MULTILEVEL (HIERARCHICAL) MODELS

10.4.1 Example: Two-Level Model for Student Performance

10.4.2 Example: Smoking Prevention and Cessation Study

10.5 LATENT CLASS MODELS

10.5.1 Independence Given a Latent Categorical Variable

10.5.2 Example: Latent Class Model for Rater Agreement

Exercises

11 Classification and Smoothing

11.1 CLASSIFICATION: LINEAR DISCRIMINANT ANALYSIS

11.1.1 Classification with Fisher’s Linear Discriminant Function

11.1.2 Example: Horseshoe Crab Satellites Revisited

11.1.3 Discriminant Analysis Versus Logistic Regression

11.2 CLASSIFICATION: TREE-BASED PREDICTION

11.2.1 Classification Trees

11.2.2 Example: A Classification Tree for Horseshoe Crab Mating

11.2.3 How Does the Classification Tree Grow?

11.2.4 Pruning a Tree and Checking Prediction Accuracy

11.2.5 Classification Trees Versus Logistic Regression and Discriminant Analysis

11.3 CLUSTER ANALYSIS FOR CATEGORICAL RESPONSES

11.3.1 Measuring Dissimilarity Between Observations

11.3.2 Hierarchical Clustering Algorithm and Dendrograms

11.3.3 Example: Clustering States on Presidential Elections

11.4 SMOOTHING: GENERALIZED ADDITIVE MODELS

11.4.1 Generalized Additive Models

11.4.2 Example: GAMs for Horseshoe Crab Data

11.4.3 How Much Smoothing? The Bias/Variance Tradeoff

11.4.4 Example: Smoothing to Portray Probability of Kyphosis

11.5 REGULARIZATION FOR HIGH-DIMENSIONAL CATEGORICAL DATA (LARGE p)

11.5.1 Penalized-Likelihood Methods and Lq-Norm Smoothing

11.5.2 Implementing the Lasso

11.5.3 Example: Predicting Opinion on Abortion with Student Survey

11.5.4 Why Shrink ML Estimates Toward 0?

11.5.5 Issues in Variable Selection (Dimension Reduction)

11.5.6 Controlling the False Discovery Rate

11.5.7 Large p also Makes Bayesian Inference Challenging

Exercises

12 A Historical Tour of Categorical Data Analysis

The Pearson–Yule Association Controversy

R.A. Fisher’s Contributions

Logistic Regression

Multiway Contingency Tables and Loglinear Models

Final Comments

Appendix: Software for Categorical Data Analysis

A.1 R FOR CATEGORICAL DATA ANALYSIS

A.2 SAS FOR CATEGORICAL DATA ANALYSIS

Chapters 1–2: Introduction and Contingency Tables

Chapters 3–5: Generalized Linear Models and Logistic Regression

Chapters 6–7: Multicategory Logit Models and Loglinear Models

Chapter 8: Matched Pairs

Chapters 9–10: Marginal Models and Random Effects Models (GLMMs)

Chapter 11: Non-Model-Based Classification and Clustering

A.3 STATA FOR CATEGORICAL DATA ANALYSIS

Chapters 1–2: Introduction and Contingency Tables

Chapters 3–5: Generalized Linear Models and Logistic Regression

Chapters 6–7: Multicategory Logit Models and Loglinear Models

Chapters 8–11: Correlated Observations, Advanced Methods

A.4 SPSS FOR CATEGORICAL DATA ANALYSIS

Chapters 1–2: Introduction and Contingency Tables

Chapters 3–5: Generalized Linear Models and Logistic Regression

Chapters 6–7: Multicategory Logit Models and Loglinear Models

Chapters 8–11: Correlated Observations, Advanced Methods

Brief Solutions to Odd-Numbered Exercises

Chapter 1

Chapter 2

Chapter 3

Chapter 4

Chapter 5

Chapter 6

Chapter 7

Chapter 8

Chapter 9

Chapter 10

Chapter 11

Bibliography

Examples Index

Subject Index

EULA

The users who browse this book also browse

Chapter

The users who browse this book also browse

No browse record.