Chapter
1.3 Summary of Important Points
Part 1 Features and Representations
2 Social Interaction in Temporary Gatherings
2.1 Introduction: Group and Crowd Behavior in Context
2.2 Social Interaction: A Typology and Some Definitions
2.2.1 Unfocused Interaction
2.2.2 Common-Focused Interaction
2.2.3 Jointly-Focused Interaction
2.3 Temporary Gatherings: A Taxonomy and Some Examples
2.3.1 Small Gatherings - Semi/Private Encounters and Group Life
2.3.2 Medium Gatherings - Semi/Public Occasions and Community Life
2.3.3 Large Gatherings - Public Events and Collective Life
2.4 Conclusion: Microsociology Applied to Computer Vision
3 Group Detection and Tracking Using Sociological Features
3.3 Sociological Features
3.3.1.2 Person Velocity and Direction
3.3.1.3 Head & Body Orientation
3.3.2 High-Level Features
3.3.2.1 3D Subjective View Frustum
3.3.2.2 Transactional Segment-Based Frustum
3.4.1 Game-Theoretic Conversational Grouping Model
3.4.2 The Dirichlet Process Mixture Model
3.5.1 DPF for Group Tracking
Individual Proposal p(Xt+1|X0:t,y0:t+1)
Joint Observation Distribution p(yt|Xt ,Tt )
Joint Individual Distribution p( Xt+1|X0:t,Tt )
Joint Group Proposal p( Tt+1|X0:t+1,Tt )
3.6.1 Results of Group Detection
3.6.1.2 Evaluation Metrics
3.6.1.3 Comparing Methods
3.6.1.4 Performance Evaluation
3.6.2 Results of Group Tracking
3.6.2.2 Evaluation Metrics
3.6.2.3 Comparing Methods
3.6.2.4 Performance Analysis
4 Exploring Multitask and Transfer Learning Algorithms for Head Pose Estimation in Dynamic Multiview Scenarios
4.2.1 Head Pose Estimation from Low-Resolution Images
4.3 TL and MTL for Multiview Head Pose Estimation
4.3.2 Transfer Learning for HPE
4.3.2.1 Head-Pan Classification Under Varying Head-Tilt
4.3.2.2 Head-Pan Classification Under Target Motion
4.3.3 Multitask Learning for HPE
5 The Analysis of High Density Crowds in Videos
5.2.1 Crowd Motion Modeling and Segmentation
5.2.2 Estimating Density of People in a Crowded Scene
5.2.3 Crowd Event Modeling and Recognition
5.2.4 Detecting and Tracking in a Crowded Scene
5.3 Data-Driven Crowd Analysis in Videos
5.3.1 Off-Line Analysis of Crowd Video Database
5.3.1.1 Low-Level Representation
5.3.1.2 Mid-Level Representation
5.3.2.1 Global Crowded Scene Matching
5.3.2.2 Local Crowd Patch Matching
5.3.3 Transferring Learned Crowd Behaviors
5.3.4 Experiments and Results
5.4 Density-Aware Person Detection and Tracking in Crowds
5.4.1.1 Tracking Detections
5.5 CrowdNet: Learning a Representation for High Density Crowds in Videos
5.5.2 Overview of the Approach
5.5.3 Crowd Patch Mining in Videos
5.5.5 Learning a Representation for High Density Crowds
5.6 Conclusions and Directions for Future Research
6 Tracking Millions of Humans in Crowded Spaces
6.4 Human Detection in 3D
6.6.1 Social Affinity Map - SAM
6.6.3 Tracklet Association Method
6.6.5 Coarse-to-Fine Data Association
6.7.1 Large-Scale Evaluation
7 Subject-Centric Group Feature for Person Reidentification
7.3.2 Person-Group Feature
7.3.2.1 In-Group Position Signature
7.3.2.2 Metric of Person-Group Feature
7.3.3 Person Reidentification with Person-Group Feature
7.4.1 Features Evaluation
7.4.1.1 Group Extraction Evaluation
7.4.1.2 Group Features Evaluation
7.4.2 Comparison with Baseline Approaches
7.4.3 Comparison with Group-Based Approaches
Part 2 Group and Crowd Behavior Modeling
8 From Groups to Leaders and Back
8.2 Modeling and Observing Groups and Their Leaders in Literature
8.2.1 Sociological Perspective
8.2.2 Computational Approaches
8.3 Technical Preliminaries and Structured Output Prediction
8.3.2 Stochastic Optimization
8.4 The Tools of the Trade in Social and Structured Crowd Analysis
8.4.1 Socially Constrained Structural Learning for Groups Detection in Crowd
8.4.1.2 SSVM Adaptation to Group Detection
8.4.2 Learning to Identify Group Leaders in Crowd
8.4.2.2 SSVM Adaptation to Leader Identification
8.5 Results on Visual Localization of Groups and Leaders
8.6 The Predictive Power of Leaders in Social Groups
8.6.1 Experimental Settings
8.6.2 Leader Centrality in Feature Space
8.6.2.1 Group Recovery Guarantees
8.6.2.2 Validation and Results
9 Learning to Predict Human Behavior in Crowded Scenes
9.2.1 Human-Human Interactions
9.2.2 Activity Forecasting
9.2.3 RNN Models for Sequence Prediction
9.3 Forecasting with Social Forces Model
9.3.2 Modeling Social Sensitivity
9.3.2.1 Social Sensitivity Feature
9.3.3 Forecasting with Social Sensitivity
9.4 Forecasting with Recurrent Neural Network
9.4.1.1 Social Pooling of Hidden States
9.4.1.2 Position Estimation
9.4.1.3 Occupancy Map Pooling
9.4.1.4 Inference for Path Prediction
9.4.2 Implementation Details
9.5.1 Analyzing the Predicted Paths
9.5.2 Discussions and Limitations
10 Deep Learning for Scene-Independent Crowd Analysis
10.2 Large Scale Crowd Datasets
10.2.1 Shanghai World Expo'10 Crowd Dataset
10.2.2.1 Crowd Video Construction
10.2.2.2 Crowd Attribute Annotation
Collecting Crowd Attributes from Web Tags
Crowd Attribute Annotation
10.2.3 User Study on Crowd Attribute
10.3 Crowd Counting and Density Estimation
10.3.1.1 Normalized Crowd Density Map for Training
10.3.2 Nonparametric Fine-Tuning Method for Target Scene
10.3.2.1 Candidate Fine-Tuning Scene Retrieval
10.3.2.2 Local Patch Retrieval
10.3.2.3 Experimental Results
10.4 Attributes for Crowded Scene Understanding
10.4.2 Slicing Convolutional Neural Network
10.4.2.1 Semantic Selectiveness of Feature Maps
10.4.2.2 Feature Map Pruning
10.4.2.3 Semantic Temporal Slices
10.4.3 S-CNN Deep Architecture
10.4.3.1 Single Branch of S-CNN Model
10.4.3.2 Combined S-CNN Model
10.4.4.1 Experimental Setting
10.4.4.2 Ablation Study of S-CNN
Level of Semantics and Temporal Range
Single Branch Model vs. Combined Model
10.4.4.3 Comparison with State-of-the-Art Methods
11 Physics-Inspired Models for Detecting Abnormal Behaviors in Crowded Scenes
11.2 Crowd Anomaly Detection: A General Review
11.3 Physics-Inspired Crowd Models
11.3.1 Social Force Models
11.3.3 Crowd Energy Models
11.3.4 Substantial Derivative
11.4.1 The Substantial Derivative Model
11.4.1.1 Substantial Derivative in Fluid Mechanics
11.4.1.2 Modeling Pedestrian Motion Dynamics
11.4.1.3 Estimation of Local and Convective Forces from Videos
11.5 Experimental Results
11.5.2 Effect of Sampled Patches
11.5.3 Comparison to State-of-the-Art
12.3 Activity Forecasting as Optimal Control
12.3.1 Toward Decision-Theoretic Models
12.3.2 Markov Decision Processes and Optimal Control
12.3.3 Maximum Entropy Inverse Optimal Control (MaxEnt IOC)
12.4 Single Agent Trajectory Forecasting in Static Environment
12.5 Multiagent Trajectory Forecasting
12.6 Dual-Agent Interaction Forecasting
Part 3 Metrics, Benchmarks and Systems
13 Integrating Computer Vision Algorithms and Ontologies for Spectator Crowd Behavior Analysis
13.2 Computer Vision and Ontology
13.3 An Extension of the dolce Ontology for Spectator Crowd
13.3.1 Modeling the Spectator Crowd and the Playground in dolce
13.3.2 A Tractable Fragment of dolce
13.4 Reasoning on the Temporal Alignment of Stands and Playground
13.4.1 A New Description Logic for Video Interpretation
13.4.1.1 ALCTemp: Syntax and Semantics
13.4.1.2 Reasoning Services for Video Interpretation
13.4.2 An Example of Application of the Integrated Approach
14 SALSA: A Multimodal Dataset for the Automated Analysis of Free-Standing Social Interactions
14.2.1 Unimodal Approaches
14.2.1.1 Vision-Based Approaches
14.2.1.2 Audio-Based Approaches
14.2.1.3 Wearable-Sensor Based Approaches
14.2.2 Multimodal Approaches
14.3 Spotting the Research Gap
14.3.2 ASIA Methodologies
14.3.2.1 Human Tracking and Pose Estimation
14.3.2.2 Speech Processing
14.3.2.3 F-Formation Detection
14.4.1 Scenario and Roles
14.4.3.2 Personality Data
14.5 Experiments on SALSA
14.5.1 Visual Tracking of Multiple Targets
14.5.2 Head and Body Pose Estimation from Visual Data
14.5.3 F-Formation Detection
14.6 Conclusions and Future Work
15 Zero-Shot Crowd Behavior Recognition
15.2.2 Zero-Shot Learning
15.2.3 Multilabel Learning
15.2.4 Multilabel Zero-Shot Learning
15.3.1 Probabilistic Zero-Shot Prediction
15.3.2 Modeling Attribute Relation from Context
15.3.2.1 Learning Attribute Relatedness from Text Corpora
15.3.2.2 Context Learning from Visual Cooccurrence
15.4.1 Zero-Shot Multilabel Behavior Inference
15.4.1.1 Experimental Settings
15.4.1.2 Comparative Evaluation
Sate-of-the-Art ZSL Models
Context-Aware Multilabel ZSL Models
15.4.2 Transfer Zero-Shot Recognition in Violence Detection
15.4.2.1 Experiment Settings
Zero-Shot Recognition Models
15.4.2.2 Results and Analysis
15.5.2 Qualitative Illustration of Contextual Cooccurrence Prediction
16.2 Metrics in the Literature
16.3.1 Detection Accuracy Measures
16.3.2 Cardinality Driven Measures
16.4.3 Detection Accuracy Measures
16.4.4 Cardinality Driven Measures
17 Realtime Pedestrian Tracking and Prediction in Dense Crowds
17.2.2 Pedestrian Tracking with Motion Models
17.2.3 Path Prediction and Robot Navigation
17.3.1 Realtime Multiperson Tracking
17.4 Mixture Motion Model
17.4.1 Overview and Notations
17.4.2 Particle Filter for Tracking
17.4.3 Parametrized Motion Model
17.4.4 Mixture of Motion Models
17.5 Realtime Pedestrian Path Prediction
17.5.1 Global Movement Pattern
17.5.2 Local Movement Pattern
17.6 Implementation and Results
17.6.1 Pedestrian Tracking
17.6.4 Pedestrian Prediction
17.6.6 Long-Term Prediction Accuracy
17.6.7 Varying the Pedestrian Density
17.6.8 Comparison with Prior Methods