Chapter
Warps and lock-step execution
3 Correctness issues in GPU programming
Lack of forward progress guarantees
4 The need for effective tools
4.1 A Taxonomy of Current Tools
4.2 Canonical Schedules and the Two-Thread Reduction
Race freedom implies determinism
Detecting races: ``all for one and one for all''
Restricting to a canonical schedule
Reduction to a pair of threads
4.3 Symbolic Bug-Finding Case Study: GKLEE
4.4 Verification Case Study: GPUVerify
GPUs will become more pervasive
Current tools show promise
Solving basic correctness issues
Clarity from vendors and standards bodies
Chapter 2: SnuCL: A unified OpenCL framework for heterogeneous clusters
3 Overview of SnuCL framework
3.1 Limitations of OpenCL
3.4.1 Processing synchronization commands
4 Memory management in SnuCL Cluster
4.1 Space Allocation to Memory Objects
4.2 Minimizing Copying Overhead
4.3 Processing Memory Commands
4.4 Consistency Management
4.5 Detecting Memory Objects Written by a Kernel
5 SnuCL extensions to OpenCL
6.1 Evaluation Methodology
6.2.1 Scalability on the medium-scale GPU cluster
6.2.2 Scalability on the large-scale CPU cluster
Chapter 3: Thread communication and synchronization on massively parallel GPUs
2 Coarse-Grained Communication and Synchronization
2.1 Global Barrier at the Kernel Level
2.2 Local Barrier at the Work-Group Level
2.3 Implicit Barrier at the Wavefront Level
3 Built-In Atomic Functions on Regular Variables
4 Fine-Grained Communication and Synchronization
4.1 Memory Consistency Model
4.1.1 Sequential consistency
4.1.2 Relaxed consistency
4.2 The OpenCL 2.0 Memory Model
4.2.1 Relationships between two memory operations
4.2.2 Special atomic operations and stand-alone memory fence
4.2.3 Release and acquire semantics
4.2.4 Memory order parameters
4.2.5 Memory scope parameters
5 Conclusion and Future Research Direction
Chapter 4: Software-level task scheduling on GPUs
1 Introduction, Problem Statement, and Context
2 Nondeterministic behaviors caused by the hardware
2.3 P3: Nondeterministic Across Runs
2.4 P4: Oblivious to Nonuniform Data Sharing
2.5 P5: Serializing Multikernel Co-Runs
3 SM-centric transformation
3.1.1 SM-centric task selection
3.1.2 Filling-retreating scheme
4 Scheduling-enabled optimizations
4.2 Affinity-Based Scheduling
4.3 SM Partitioning for Multi-Kernel Co-Runs
5 Other scheduling work on GPUs
6 Conclusion and future work
Chapter 5: Data placement on GPUs
3 Memory specification through MSL
4.1 Deriving Data Access Patterns: A Hybrid Approach
4.1.1 Reuse distance model
4.1.2 Staging code to be agnostic to placement
5.1 Lightweight Performance Model
5.2 Finding the Best Placements
5.2.1 Dealing with phases
6.1 Results With Irregular Benchmarks
6.2 Results With Regular Benchmarks
Part 2: Algorithms and applications
Chapter 6: Biological sequence analysis on GPU
2 Pairwise Sequence Comparison and Sequence-Profile Comparison
2.1 Pairwise Sequence Comparison
2.1.1 Smith-Waterman algorithm
Phase 1: Create the similarity matrix
Phase 2: Obtain the best local alignment
Smith-Waterman variations
2.1.2 Basic Local Alignment Tool
2.2 Sequence-Profile Comparison
2.2.1 Hidden Markov models
2.2.2 The Viterbi algorithm
3 Design Aspects of GPU Solutions for Biological Sequence Analysis
3.1 Task-Parallelism vs Data-Parallelism
3.2 Sorting Sequence Database to Achieve Load Balance
3.3 Use of GPU Memory Hierarchy
3.4 GPU Solution Used as a Filter
4 GPU Solutions for Pairwise Sequence Comparison
4.1 GPU Solutions Using Exact Algorithms
4.1.1 Manavski and Valle [27]
4.1.2 Ligowski and Rudnicki [38]
4.1.3 Blazwicz et al. [29]
4.1.7 Sandes and de Melo [39-41] and Sandes et al. [42]
4.1.8 Comparative overview
4.2 GPU Solutions Using BLAST
4.2.1 Vouzis and Sahinidis [31]
4.2.4 Comparative overview
5 GPU Solutions for Sequence-Profile Comparison
5.1 GPU Solutions Using the Viterbi Algorithm
5.1.3 Walters et al. [33]
5.1.5 Ganesan et al. [49]
5.1.6 Ferraz and Moreano [36]
5.2 GPU Solutions Using the MSV Algorithm
5.2.2 Cheng and Butler [37]
5.2.3 Araújo Neto and Moreano [50]
6 Conclusion and Perspectives
Chapter 7: Graph algorithms on GPUs
1 Graph representation for GPUs
2 Graph traversal algorithms: the breadth first search (BFS)
2.1 The Frontier-Based Parallel Implementation of BFS
3 The single-source shortest path (SSSP) problem
3.1 The SSSP Implementations for GPUs
3.2 H-BF: An Efficient Implementation of the Bellman-Ford Algorithm
4.1 The APSP Implementations for GPUs
5 Load Balancing and Memory Accesses: Issuesand Management Techniques
5.1 Static Mapping Techniques
5.1.1 Work-items to threads
5.2 Semidynamic Mapping Techniques
5.2.1 Dynamic virtual warps + dynamic parallelism
5.3 Dynamic Mapping Techniques
5.4 The Multiphase Search Technique
Chapter 8: GPU alignment of two and three sequences
1.2 Alignment of Three Sequences
3.1 Smith-Waterman Algorithm
3.2 Computing the Score of the Best Local Alignment
3.3 Computing the Best Local Alignment
3.3.4 Memory requirements
4 Alignment of three sequences
4.1 Three-Sequence Alignment Algorithm
4.2 Computing the Score of the Best Alignment
GPU computational strategy
GPU computational strategy
4.3 Computing the Best Alignment
4.4.1 Computing the score of the best alignment
4.4.2 Computing the alignment
Chapter 9: Augmented block cimmino distributed algorithm for solving tridiagonal systems on GPU
2 ABCD Solver for tridiagonal systems
3 GPU implementation and optimization
3.1 QR Method and Givens Rotation
3.2 Sparse Storage Format
3.3 Coalesced Memory Access
3.4.1 Padding of the augmented matrix
3.4.2 Padding for Givens rotation
4.1 Comparison With CPU Implementation
4.2 Speedup by Memory Coalescing
5 Conclusion and future work
Chapter 10: GPU computing applied to linear and mixed-integer programming
2 Operations Research in Practice
3 Exact Optimization Algorithms
3.2.2 Multiple-choice knapsack problem
3.3.2 Flow-shop scheduling problem
3.3.3 Traveling salesman problem
4.1.1 The traveling salesman problem
4.1.2 Scheduling problems
4.2 Ant Colony Optimization
4.4.1 Deep greedy switching
4.4.2 Multiobjective local search
4.4.3 Constructive heuristic
Chapter 11: GPU-accelerated shortest paths
1.2 Shortest Path Problems
1.3 Floyd-Warshall's APSP
1.5 Implementation Challenges in a Distributed GPU Environment
3.2 All-Pairs Shortest Path Algorithms
3.2.1 A centralized approach for graphs with negative weights
3.2.2 A decentralized approach for better scaling and improvedmemory distribution
3.3 Single-Pair Shortest Path Query Algorithm
3.4 Fast Shortest Path Primitives on GPU
3.4.1 In situ tropical product
3.4.2 Ex situ tropical product
4 Computational Complexity Analysis
4.1 Analysis of the Centralized APSP
4.2 Analysis of the Decentralized APSP
4.3 Analysis of the SSST Query Algorithm
5 Experiments and Results
5.1 Partitioned APSP With Real Edge Weights
5.2 Better-Scaling Partitioned APSP Using Dijkstra
5.3 SPSP Query Using Dijkstra
5.4 Discussion and Perspectives
Chapter 12: GPU sorting algorithms
1.2 CUDA Programming Model
1.4 Computation Models for GPUs
2 Generic Programming Strategies for GPU
3.4 Other Sorting Algorithms
3.6 Sorting Algorithms for Large Data Sets
3.7 Comparison of Sorting Methods
Chapter 13: A floating-point compression algorithm for GPUs
2.2 Measuring Throughput and Energy
2.4 Algorithmic Components
3.1 Synthesis Analysis and Derivation of MPC
3.1.1 Observations about individual algorithms
3.1.2 Observations about the Best algorithm
3.1.3 Derivation of the MPC algorithm
3.3 Compression and Decompression Speed
3.4 Compression and Decompression Energy
4 Summary and Conclusions
Chapter 14: Adaptive sparse matrix representation for efficient matrix-vector multiplication
2 Sparse matrix-vector multiplication
3 GPU architecture and programming model
3.1 Hardware Architectures
4 Optimization principles for SpMV
5 Platform (Adaptive Runtime System)
6.1 Parameters Setting Experiments
6.2 Adaptive Runtime System
Part 3: Architecture and performance
Chapter 15: A framework for accelerating bottlenecks in GPU execution with assist warps
Use of CABA for compression
Unutilized compute resources
Unutilized on-chip memory
4.2 Design of the CABA Framework
4.2.1 Hardware-based management of threads
4.2.2 Register file/shared memory allocation
4.2.3 Programmer/developer interface
4.3 Main Hardware Additions
Dynamic feedback and throttling
Communication and control
4.5 Applications of the CABA Framework
5 A Case for CABA: Data Compression
5.1 Mapping Compression Algorithms Into Assist Warps
5.1.2 Mapping BDI to CABA
5.1.3 Implementing other algorithms
5.1.4 Implementing the FPC algorithm
Implementing the C-Pack algorithm
5.2 Walkthrough of CABA-Based Compression
5.2.1 The decompression mechanism
5.2.2 The compression mechanism
5.3 Realizing Data Compression
5.3.1 Initial setup and profiling
5.3.2 Memory controller changes
7.1 Effect on Performance and Bandwidth Utilization
7.3 Effect of Enabling Different Compression Algorithms
7.4 Sensitivity to Peak Main Memory Bandwidth
7.5 Selective Cache Compression With CABA
8 Other Uses of the CABA Framework
Speculative precomputation
Handling interrupts and exceptions
Profiling and instrumentation
Memory latency and bandwidth optimizations in GPUs
Chapter 16: Accelerating GPU accelerators through neural algorithmic transformation
Why hardware acceleration?
Why not reuse CPU neural accelerators?
2 Neural transformation for GPUs
2.1 Safe Programming Interface
3 Instruction-set-architecture design
4 Neural accelerator: design and integration
4.1 Integrating the Neural Accelerator to GPUs
4.2 Executing Neurally Transformed Threads
4.3 Orchestrating Neurally Enhanced SIMD Lanes
5 Controlling quality trade-offs
6.1 Applications and Neural Transformation
Evaluation/training data sets
Cycle-accurate simulations
Energy modeling and overhead
Performance and energy benefits
Opportunity for further improvements
Sensitivity to accelerator speed
Sensitivity to off-chip bandwidth
Controlling quality trade-offs
Comparison with prior CPU neural acceleration
Chapter 17: Need for heterogeneous NoC architectures with GPGPUs: A case study with photonic interconnects
2.1 Characterization of Traffic Demands of Heterogeneous Multicore Chips
2.2 Recent Photonic NoC Architectures
2.3 Waveguides, Laser Sources, and Optical Couplers
2.4 Microring Resonator (MRR)
3 The Need for Heterogeneous Interconnections
3.1 Photonic Interconnect Architecture Model
4 Characterization of GPGPU Performance
4.2 Speedup Analysis With Varying Bandwidth
4.4 Bandwidth Selection Policy
4.5 Case Study With DBA in a Heterogeneous Multicore Chip With a Photonic NoC
4.5.2 Performance evaluation of the d-HetPNoC
4.5.3 Case studies with synthetic and real application-based traffic patterns
Chapter 18: Accurately modeling GPGPU frequency scaling with the CRISP performance model
2 Motivation and related work
2.1 CPU-Based Performance Models for DVFS
Memory/computation overlap
Complex stall classification
3 GPGPU DVFS performance model
3.1 Critical Stalled Path
3.3 Compute/Store Path Portion
3.5 Parameterizing the Model
3.6 Hardware Mechanism and Overhead
5.1 Execution Time Prediction
5.3 Simplified CRISP Model
Part 4: Power and reliability
Chapter 19: Energy and power considerations of GPUs
1.1 Compute- and Memory-Bound Codes on GPUs
1.2 Regular and Irregular Codes on GPUs
2.2 System, Compiler, and Power Measurement
3 Power profiling of regular and irregular programs
3.1 Idealized Power Profile
3.2 Power Profiles of Regular Codes
3.3 Power Profiles of Irregular Codes
3.4 Comparing Expected, Regular, and Irregular Profiles
4 Affecting power and energy on GPUs
4.1 Algorithm Implementation
4.3 Core and Memory Clock-Frequency Scaling
4.5 Source-Code Optimizations
4.5.1 None vs all optimizations
4.5.2 Effectiveness of optimizations
4.5.3 Lowest and highest settings
4.5.4 Energy efficiency vs performance
4.5.5 Most biased optimizations
Chapter 20: Architecting the last-level cache for GPUs using STT-MRAM nonvolatile memory
2.1 GPU and Its Memory Hierarchy
4 Two-Part L2 Cache Architecture
4.2 The Proposed Two-Part Cache Architecture
5 Dynamic Write Threshold Detection Mechanism
LR cache size and structure
LR and HR cache retention times
7.2 Parallel vs Sequential Search Mechanism
Chapter 21: Power management of mobile GPUs
2 GPU Power Management for Mobile Games
2.1 State-of-the-Art Techniques
Texture-directed gaming governor
Control-theoretic gaming governor
Regression-based gaming governor
2.3 Power-Performance Trade-Off
3 GPU Power Management for GPGPU Applications
3.1 State-of-the-Art Techniques
3.2.1 Hardware environment
3.2.2 Software environment
OpenCL code partitioning across CPU-GPU
3.3 GPGPU Applications on Mobile GPUs
3.4 DVFS for Improving Power/Energy-Efficiency
3.5 Design Space Exploration
3.5.1 Work-group size manipulation
3.5.2 Execution time and power estimation
Concurrent execution and effect of memory contention
CPU-GPU partitioning ratio
4.1 Open Problems for Gaming Applications
4.2 Open Problems for GPGPU Applications
Chapter 22: Advances in GPU reliability research
1.1 Radiation-Induced Soft Errors
1.2 GPUs as HPC Coprocessors
1.3 Evidence of Soft Errors in HPC Systems
2 Evaluating GPU reliability
2.1 Architectural and Program Vulnerability Analysis
2.3 Accelerated Particle Beam Testing
3 Hardware reliability enhancements
3.1 Protecting SRAM Structures
3.2 Protecting Pipeline Structures
3.3 Taking Advantage of Underutilization
4 Software reliability enhancements
4.1 Redundant Multithreading
4.2 Symptom-Based Fault Detectionand Dynamic-Reliability Management
Chapter 23: Addressing hardware reliability challenges in general-purpose GPUs
3 Modeling and Characterizing GPGPUs Reliability in the Presence of Soft Errors [25]
3.1 Microarchitecture-Level Soft-Error Vulnerability Analysis
3.2 Soft-Error Vulnerability Modeling in GPGPU Microarchitecture
3.3 Soft-Error Vulnerability Analysis on GPGPUs
3.3.1 Experimental methodologies
3.3.2 Soft-error vulnerability of the GPGPU microarchitecturestructures
Analysis on structure's AVF in SM
Vulnerability variations at streaming multiprocessor level
3.3.3 Impact of architecture optimizations on GPGPU microarchitecture-level soft-error vulnerability
4 RISE: Improving the Streaming Processors' Reliability Against Soft Errors in GPGPUs [36]
4.1.1 Request pending aware Full-RISE
The observation on resource contentions among memory requests
The concept of request pending aware Full-RISE
4.1.2 Idle SMs aware Full-RISE
4.1.3 The implementation of Full-RISE
4.2.1 The concept of Partial-RISE
4.2.2 The implementation of Partial-RISE
4.3 RISE: Putting It All Together
4.4.2 Effectiveness of RISE
4.4.3 Sensitivity analysis
5 Mitigating the Susceptibility of GPGPUs to PVs [43]
5.1 Background: Highly Banked Register File in GPGPUs
5.2 Frequency Optimization for GPGPUs' Register File Under PVs
5.2.1 Modeling PV impact on GPGPUs' register file
5.2.2 Variable-latency subbanks (VL-SB) in GPGPUs register file
Extending VL-RF to GPGPUs RF
Variable-latency subbanks (VL-SB) in RF
5.2.3 Register file bank reorganization
5.2.5 Feasibility in alternative register file architecture
5.3 Mitigating the IPC Degradation Under VL-SB and RF-BRO
5.3.1 Register mapping under VL-SB and RF-BRO
5.3.2 Fast-bank aware register mapping
5.3.3 Fast-warp aware scheduling (FWAS) policy
Evaluation of VL-SB with RF-BRO
5.4.2 Overall performance improvement