Advances in GPU Research and Practice ( Emerging Trends in Computer Science and Applied Computing )

Publication series ：Emerging Trends in Computer Science and Applied Computing

Author： Sarbazi-Azad Hamid

Publisher： Elsevier Science‎

Publication year： 2016

E-ISBN: 9780128037881

P-ISBN(Paperback): 9780128037386

Subject： TP2 自动化技术及设备;TP301.6 algorithm theory;TP31 computer software

Keyword：计算机软件,算法理论,自动化技术及设备

Language： ENG

Access to resources Favorite

Disclaimer: Any content in publications that violate the sovereignty, the constitution or regulations of the PRC is not accepted or approved by CNPIEC.

Description

Advances in GPU Research and Practice focuses on research and practices in GPU based systems. The topics treated cover a range of issues, ranging from hardware and architectural issues, to high level issues, such as application systems, parallel programming, middleware, and power and energy issues.

Divided into six parts, this edited volume provides the latest research on GPU computing. Part I: Architectural Solutions focuses on the architectural topics that improve on performance of GPUs, Part II: System Software discusses OS, compilers, libraries, programming environment, languages, and paradigms that are proposed and analyzed to help and support GPU programmers. Part III: Power and Reliability Issues covers different aspects of energy, power, and reliability concerns in GPUs. Part IV: Performance Analysis illustrates mathematical and analytical techniques to predict different performance metrics in GPUs. Part V: Algorithms presents how to design efficient algorithms and analyze their complexity for GPUs. Part VI: Applications and Related Topics provides use cases and examples of how GPUs are used across many sectors.

Discusses how to maximize power and obtain peak reliability when designing, building, and using GPUs
Covers system software (OS, compilers), programming environments, languages, and paradigms proposed to help and support GPU programmers
Explains how to use mathematical and analytical techniques to predict different

Chapter

Front Cover

Advances in GPU Research and Practice

Part 1: Programming and tools

Chapter 1: Formal analysis techniques for reliable GPU programming: current solutions and call to action

1 GPUs in Support of Parallel Computing

Bugs in parallel and GPU code

2 A quick introduction to GPUs

Organization of threads

Memory spaces

Barrier synchronization

Warps and lock-step execution

Dot product example

3 Correctness issues in GPU programming

Data races

Lack of forward progress guarantees

Floating-point accuracy

4 The need for effective tools

4.1 A Taxonomy of Current Tools

4.2 Canonical Schedules and the Two-Thread Reduction

Race freedom implies determinism

Detecting races: ``all for one and one for all''

Restricting to a canonical schedule

Reduction to a pair of threads

4.3 Symbolic Bug-Finding Case Study: GKLEE

4.4 Verification Case Study: GPUVerify

5 Call to Action

GPUs will become more pervasive

Current tools show promise

Solving basic correctness issues

Equivalence checking

Clarity from vendors and standards bodies

User validation of tools

Acknowledgments

References

Chapter 2: SnuCL: A unified OpenCL framework for heterogeneous clusters

1 Introduction

2 OpenCL

2.1 Platform Model

2.2 Execution Model

2.3 Memory Model

2.4 Synchronization

2.5 Memory Consistency

2.6 OpenCL ICD

3 Overview of SnuCL framework

3.1 Limitations of OpenCL

3.2 SnuCL CPU

3.3 SnuCL Single

3.4 SnuCL Cluster

3.4.1 Processing synchronization commands

4 Memory management in SnuCL Cluster

4.1 Space Allocation to Memory Objects

4.2 Minimizing Copying Overhead

4.3 Processing Memory Commands

4.4 Consistency Management

4.5 Detecting Memory Objects Written by a Kernel

5 SnuCL extensions to OpenCL

6 Performance evaluation

6.1 Evaluation Methodology

6.2 Performance

6.2.1 Scalability on the medium-scale GPU cluster

6.2.2 Scalability on the large-scale CPU cluster

7 Conclusions

Acknowledgments

References

Chapter 3: Thread communication and synchronization on massively parallel GPUs

1 Introduction

2 Coarse-Grained Communication and Synchronization

2.1 Global Barrier at the Kernel Level

2.2 Local Barrier at the Work-Group Level

2.3 Implicit Barrier at the Wavefront Level

3 Built-In Atomic Functions on Regular Variables

4 Fine-Grained Communication and Synchronization

4.1 Memory Consistency Model

4.1.1 Sequential consistency

4.1.2 Relaxed consistency

4.2 The OpenCL 2.0 Memory Model

4.2.1 Relationships between two memory operations

4.2.2 Special atomic operations and stand-alone memory fence

4.2.3 Release and acquire semantics

4.2.4 Memory order parameters

4.2.5 Memory scope parameters

5 Conclusion and Future Research Direction

References

Chapter 4: Software-level task scheduling on GPUs

1 Introduction, Problem Statement, and Context

2 Nondeterministic behaviors caused by the hardware

2.1 P1: Greedy

2.2 P2: Not Round-Robin

2.3 P3: Nondeterministic Across Runs

2.4 P4: Oblivious to Nonuniform Data Sharing

2.5 P5: Serializing Multikernel Co-Runs

3 SM-centric transformation

3.1 Core Ideas

3.1.1 SM-centric task selection

Correctness issues

3.1.2 Filling-retreating scheme

3.2 Implementation

3.2.1 Details

4 Scheduling-enabled optimizations

4.1 Parallelism Control

4.2 Affinity-Based Scheduling

4.2.1 Evaluation

4.3 SM Partitioning for Multi-Kernel Co-Runs

4.3.1 Evaluation

5 Other scheduling work on GPUs

5.1 Software Approaches

5.2 Hardware Approaches

6 Conclusion and future work

Acknowledgments

References

Chapter 5: Data placement on GPUs

1 Introduction

2 Overview

3 Memory specification through MSL

4 Compiler support

4.1 Deriving Data Access Patterns: A Hybrid Approach

4.1.1 Reuse distance model

4.1.2 Staging code to be agnostic to placement

5 Runtime support

5.1 Lightweight Performance Model

5.2 Finding the Best Placements

5.2.1 Dealing with phases

6 Results

6.1 Results With Irregular Benchmarks

6.2 Results With Regular Benchmarks

7 Related Work

8 Summary

References

Part 2: Algorithms and applications

Chapter 6: Biological sequence analysis on GPU

1 Introduction

2 Pairwise Sequence Comparison and Sequence-Profile Comparison

2.1 Pairwise Sequence Comparison

2.1.1 Smith-Waterman algorithm

Phase 1: Create the similarity matrix

Phase 2: Obtain the best local alignment

Smith-Waterman variations

2.1.2 Basic Local Alignment Tool

Phase 1: Seeding

Phase 2: Extension

Phase 3: Evaluation

BLAST

2.2 Sequence-Profile Comparison

2.2.1 Hidden Markov models

2.2.2 The Viterbi algorithm

2.2.3 The MSV algorithm

3 Design Aspects of GPU Solutions for Biological Sequence Analysis

3.1 Task-Parallelism vs Data-Parallelism

3.2 Sorting Sequence Database to Achieve Load Balance

3.3 Use of GPU Memory Hierarchy

3.4 GPU Solution Used as a Filter

4 GPU Solutions for Pairwise Sequence Comparison

4.1 GPU Solutions Using Exact Algorithms

4.1.1 Manavski and Valle [27]

4.1.2 Ligowski and Rudnicki [38]

4.1.3 Blazwicz et al. [29]

4.1.4 Li et al. [52]

4.1.5 Ino et al. [28,30]

4.1.6 Liu et al. [45-47]

4.1.7 Sandes and de Melo [39-41] and Sandes et al. [42]

4.1.8 Comparative overview

4.2 GPU Solutions Using BLAST

4.2.1 Vouzis and Sahinidis [31]

4.2.2 Liu et al. [48]

4.2.3 Zhang et al. [43]

4.2.4 Comparative overview

5 GPU Solutions for Sequence-Profile Comparison

5.1 GPU Solutions Using the Viterbi Algorithm

5.1.1 Horn et al. [32]

5.1.2 Du et al. [44]

5.1.3 Walters et al. [33]

5.1.4 Yao et al. [34]

5.1.5 Ganesan et al. [49]

5.1.6 Ferraz and Moreano [36]

5.2 GPU Solutions Using the MSV Algorithm

5.2.1 Li et al. [35]

5.2.2 Cheng and Butler [37]

5.2.3 Araújo Neto and Moreano [50]

5.3 Comparative Overview

6 Conclusion and Perspectives

References

Chapter 7: Graph algorithms on GPUs

1 Graph representation for GPUs

1.1 Adjacency Matrices

1.2 Adjacency Lists

1.3 Edge Lists

2 Graph traversal algorithms: the breadth first search (BFS)

2.1 The Frontier-Based Parallel Implementation of BFS

2.2 BFS-4K

3 The single-source shortest path (SSSP) problem

3.1 The SSSP Implementations for GPUs

3.2 H-BF: An Efficient Implementation of the Bellman-Ford Algorithm

4 The APSP problem

4.1 The APSP Implementations for GPUs

5 Load Balancing and Memory Accesses: Issuesand Management Techniques

5.1 Static Mapping Techniques

5.1.1 Work-items to threads

5.1.2 Virtual warps

5.2 Semidynamic Mapping Techniques

5.2.1 Dynamic virtual warps + dynamic parallelism

5.2.2 CTA + warp + scan

5.3 Dynamic Mapping Techniques

5.3.1 Direct search

5.3.2 Local warp search

5.3.3 Block search

5.3.4 Two-phase search

5.4 The Multiphase Search Technique

5.5 Coalesced Expansion

5.6 Iterated Searches

References

Chapter 8: GPU alignment of two and three sequences

1 Introduction

1.1 Pairwise alignment

1.2 Alignment of Three Sequences

2 GPU architecture

3 Pairwise alignment

3.1 Smith-Waterman Algorithm

3.2 Computing the Score of the Best Local Alignment

3.3 Computing the Best Local Alignment

3.3.1 StripedAlignment

3.3.2 ChunkedAlignment1

3.3.3 ChunkedAlignment2

3.3.4 Memory requirements

3.4 Experimental Results

StripedScore

StripedAlignment

ChunkedAlignment1

4 Alignment of three sequences

4.1 Three-Sequence Alignment Algorithm

4.2 Computing the Score of the Best Alignment

4.2.1 Layered algorithm

GPU computational strategy

Analysis

4.2.2 Sloped algorithm

GPU computational strategy

Analysis

4.3 Computing the Best Alignment

4.3.1 Layered-BT1

4.3.2 Layered-BT2

4.3.3 Layered-BT3

4.4 Experimental Results

4.4.1 Computing the score of the best alignment

4.4.2 Computing the alignment

5 Conclusion

References

Chapter 9: Augmented block cimmino distributed algorithm for solving tridiagonal systems on GPU

1 Introduction

2 ABCD Solver for tridiagonal systems

3 GPU implementation and optimization

3.1 QR Method and Givens Rotation

3.2 Sparse Storage Format

3.3 Coalesced Memory Access

3.4 Boundary Padding

3.4.1 Padding of the augmented matrix

3.4.2 Padding for Givens rotation

4 Performance evaluation

4.1 Comparison With CPU Implementation

4.2 Speedup by Memory Coalescing

4.3 Boundary Padding

5 Conclusion and future work

References

Chapter 10: GPU computing applied to linear and mixed-integer programming

1 Introduction

2 Operations Research in Practice

3 Exact Optimization Algorithms

3.1 The Simplex Method

3.2 Dynamic Programming

3.2.1 Knapsack problems

3.2.2 Multiple-choice knapsack problem

3.3 Branch-and-Bound

3.3.1 Knapsack problem

3.3.2 Flow-shop scheduling problem

3.3.3 Traveling salesman problem

4 Metaheuristics

4.1 Genetic Algorithms

4.1.1 The traveling salesman problem

4.1.2 Scheduling problems

4.1.3 Knapsack problems

4.2 Ant Colony Optimization

4.2.1 Max-min ant system

4.3 Tabu Search

4.4 Other Metaheuristics

4.4.1 Deep greedy switching

4.4.2 Multiobjective local search

4.4.3 Constructive heuristic

5 Conclusions

Acknowledgments

Conflicts of Interest

References

Chapter 11: GPU-accelerated shortest paths

1 Introduction

1.1 Motivation

1.2 Shortest Path Problems

1.3 Floyd-Warshall's APSP

1.4 Dijkstra's SSSP

1.5 Implementation Challenges in a Distributed GPU Environment

2 Related Work

3 Partitioned Approaches

3.1 Graph Partitioning

3.2 All-Pairs Shortest Path Algorithms

3.2.1 A centralized approach for graphs with negative weights

3.2.2 A decentralized approach for better scaling and improvedmemory distribution

3.3 Single-Pair Shortest Path Query Algorithm

3.3.1 Preprocessing mode

3.3.2 Query mode

3.4 Fast Shortest Path Primitives on GPU

3.4.1 In situ tropical product

3.4.2 Ex situ tropical product

4 Computational Complexity Analysis

4.1 Analysis of the Centralized APSP

4.2 Analysis of the Decentralized APSP

4.3 Analysis of the SSST Query Algorithm

5 Experiments and Results

5.1 Partitioned APSP With Real Edge Weights

5.2 Better-Scaling Partitioned APSP Using Dijkstra

5.3 SPSP Query Using Dijkstra

5.4 Discussion and Perspectives

References

About the Authors

Chapter 12: GPU sorting algorithms

1 Introduction

1.1 GPU Architecture

1.2 CUDA Programming Model

1.3 CUDA Memory Model

1.4 Computation Models for GPUs

2 Generic Programming Strategies for GPU

3 Sorting algorithms

3.1 Bitonic Sort

3.2 Radix Sort

3.3 Merge Sort

3.4 Other Sorting Algorithms

3.5 Scalable Sorting

3.6 Sorting Algorithms for Large Data Sets

3.7 Comparison of Sorting Methods

References

Chapter 13: A floating-point compression algorithm for GPUs

1 Introduction

2 Methodology

2.1 System and Compilers

2.2 Measuring Throughput and Energy

2.3 Data Sets

2.4 Algorithmic Components

3 Experimental Results

3.1 Synthesis Analysis and Derivation of MPC

3.1.1 Observations about individual algorithms

3.1.2 Observations about the Best algorithm

3.1.3 Derivation of the MPC algorithm

3.2 Compression Ratios

3.3 Compression and Decompression Speed

3.4 Compression and Decompression Energy

3.5 Component Count

4 Summary and Conclusions

Acknowledgments

References

Chapter 14: Adaptive sparse matrix representation for efficient matrix-vector multiplication

1 Introduction

2 Sparse matrix-vector multiplication

3 GPU architecture and programming model

3.1 Hardware Architectures

NVIDIA GeForce GTX 480

NVIDIA Tesla K20X

3.2 Programming Model

3.3 Matrices

4 Optimization principles for SpMV

4.1 Block Size

4.2 Register

4.3 Memory

5 Platform (Adaptive Runtime System)

6 Results and analysis

6.1 Parameters Setting Experiments

Block size

Memory

6.2 Adaptive Runtime System

7 Summary

References

Part 3: Architecture and performance

Chapter 15: A framework for accelerating bottlenecks in GPU execution with assist warps

1 Introduction

Use of CABA for compression

Other uses of CABA

Contributions

2 Background

3 Motivation

Unutilized compute resources

Unutilized on-chip memory

Our goal

4 The CABA Framework

4.1 Goals and Challenges

4.2 Design of the CABA Framework

4.2.1 Hardware-based management of threads

4.2.2 Register file/shared memory allocation

4.2.3 Programmer/developer interface

4.3 Main Hardware Additions

Assist warp store

Assist warp controller

Assist warp buffer

4.4 The Mechanism

Trigger and deployment

Execution

Dynamic feedback and throttling

Communication and control

4.5 Applications of the CABA Framework

5 A Case for CABA: Data Compression

5.1 Mapping Compression Algorithms Into Assist Warps

5.1.1 Algorithm overview

5.1.2 Mapping BDI to CABA

Decompression

Compression

5.1.3 Implementing other algorithms

5.1.4 Implementing the FPC algorithm

Decompression

Compression

Implementing the C-Pack algorithm

Decompression

Compression

5.2 Walkthrough of CABA-Based Compression

5.2.1 The decompression mechanism

L1 access

L2/memory access

5.2.2 The compression mechanism

L1 access

L2/memory access

5.3 Realizing Data Compression

5.3.1 Initial setup and profiling

5.3.2 Memory controller changes

6 Methodology

Evaluated applications

Evaluated metrics

7 Results

7.1 Effect on Performance and Bandwidth Utilization

7.2 Effect on Energy

Power consumption

Energy-delay product

7.3 Effect of Enabling Different Compression Algorithms

7.4 Sensitivity to Peak Main Memory Bandwidth

7.5 Selective Cache Compression With CABA

7.6 Other Optimizations

Uncompressed L2

Uncoalesced requests

8 Other Uses of the CABA Framework

8.1 Memoization

8.2 Prefetching

8.3 Other Uses

Redundant multithreading

Speculative precomputation

Handling interrupts and exceptions

Profiling and instrumentation

9 Related Work

Helper threading

Memory latency and bandwidth optimizations in GPUs

Compression

10 Conclusion

Acknowledgments

References

Chapter 16: Accelerating GPU accelerators through neural algorithmic transformation

1 Introduction

Why hardware acceleration?

Why not reuse CPU neural accelerators?

2 Neural transformation for GPUs

2.1 Safe Programming Interface

2.2 Compilation Workflow

3 Instruction-set-architecture design

4 Neural accelerator: design and integration

4.1 Integrating the Neural Accelerator to GPUs

4.2 Executing Neurally Transformed Threads

4.3 Orchestrating Neurally Enhanced SIMD Lanes

5 Controlling quality trade-offs

6 Evaluation

6.1 Applications and Neural Transformation

Applications

Annotations

Evaluation/training data sets

Neural networks

Quality

6.2 Experimental Setup

Cycle-accurate simulations

Energy modeling and overhead

6.3 Experimental Results

Performance and energy benefits

Area overhead

Opportunity for further improvements

Sensitivity to accelerator speed

Sensitivity to off-chip bandwidth

Controlling quality trade-offs

Comparison with prior CPU neural acceleration

7 Related work

8 Conclusion

References

Chapter 17: Need for heterogeneous NoC architectures with GPGPUs: A case study with photonic interconnects

1 Introduction

2 Background

2.1 Characterization of Traffic Demands of Heterogeneous Multicore Chips

2.2 Recent Photonic NoC Architectures

2.3 Waveguides, Laser Sources, and Optical Couplers

2.4 Microring Resonator (MRR)

2.5 Photo Detector

3 The Need for Heterogeneous Interconnections

3.1 Photonic Interconnect Architecture Model

4 Characterization of GPGPU Performance

4.1 Methodology

4.2 Speedup Analysis With Varying Bandwidth

4.3 Detailed Analysis

4.4 Bandwidth Selection Policy

4.5 Case Study With DBA in a Heterogeneous Multicore Chip With a Photonic NoC

4.5.1 DBA mechanism

4.5.2 Performance evaluation of the d-HetPNoC

4.5.3 Case studies with synthetic and real application-based traffic patterns

5 Conclusion

References

Chapter 18: Accurately modeling GPGPU frequency scaling with the CRISP performance model

1 Introduction

2 Motivation and related work

2.1 CPU-Based Performance Models for DVFS

Stall time

Miss model

Leading load

Critical path

Example

2.2 DVFS in GPGPUs

Memory/computation overlap

Store stalls

Complex stall classification

L1 cache miss

2.3 Models for GPGPUs

3 GPGPU DVFS performance model

3.1 Critical Stalled Path

3.2 LCP Portion

3.3 Compute/Store Path Portion

3.4 Example

3.5 Parameterizing the Model

Identifying stalls

Classifying computation

3.6 Hardware Mechanism and Overhead

4 Methodology

5 Results

5.1 Execution Time Prediction

5.2 Energy Savings

Optimizing for EDP

Optimizing for ED2P

5.3 Simplified CRISP Model

5.4 Generality of CRISP

6 Conclusion

Acknowledgments

References

Part 4: Power and reliability

Chapter 19: Energy and power considerations of GPUs

1 Introduction

1.1 Compute- and Memory-Bound Codes on GPUs

1.2 Regular and Irregular Codes on GPUs

2 Evaluation methodology

2.1 Benchmark Programs

2.2 System, Compiler, and Power Measurement

3 Power profiling of regular and irregular programs

3.1 Idealized Power Profile

3.2 Power Profiles of Regular Codes

3.3 Power Profiles of Irregular Codes

3.4 Comparing Expected, Regular, and Irregular Profiles

4 Affecting power and energy on GPUs

4.1 Algorithm Implementation

4.2 Arithmetic Precision

4.3 Core and Memory Clock-Frequency Scaling

4.3.1 Default to 614

4.3.2 614 to 324

4.4 Input Sensitivity

4.5 Source-Code Optimizations

4.5.1 None vs all optimizations

4.5.2 Effectiveness of optimizations

4.5.3 Lowest and highest settings

4.5.4 Energy efficiency vs performance

4.5.5 Most biased optimizations

5 Summary

Appendix

References

About the authors

Chapter 20: Architecting the last-level cache for GPUs using STT-MRAM nonvolatile memory

1 Introduction

2 Background

2.1 GPU and Its Memory Hierarchy

2.2 Nonvolatile Memories

3 Related Work

GPU memory hierarchy

STT-RAM

GPU memory and STT-RAM

4 Two-Part L2 Cache Architecture

4.1 Motivation

4.2 The Proposed Two-Part Cache Architecture

Monitoring mechanism

Search mechanism

Eviction policy

Refreshing mechanism

5 Dynamic Write Threshold Detection Mechanism

5.1 Motivation

5.2 Threshold Analysis

5.3 Proposed Mechanism

6 Implementation

LR cache size and structure

LR and HR cache retention times

7 Evaluation Result

7.1 Lifetime Evaluation

7.2 Parallel vs Sequential Search Mechanism

8 Conclusion

References

Chapter 21: Power management of mobile GPUs

1 Introduction

2 GPU Power Management for Mobile Games

2.1 State-of-the-Art Techniques

Texture-directed gaming governor

Control-theoretic gaming governor

Regression-based gaming governor

2.2 CPU-GPU Relationship

2.3 Power-Performance Trade-Off

2.4 Gaming Bottlenecks

2.5 Performance Modeling

2.6 Utilization Models

2.7 Gaming Governor

Meeting FPS

Saving power

2.8 Results

3 GPU Power Management for GPGPU Applications

3.1 State-of-the-Art Techniques

3.2 Background

3.2.1 Hardware environment

3.2.2 Software environment

OpenCL background

OpenCL runtime

OpenCL code partitioning across CPU-GPU

3.3 GPGPU Applications on Mobile GPUs

3.4 DVFS for Improving Power/Energy-Efficiency

3.5 Design Space Exploration

3.5.1 Work-group size manipulation

3.5.2 Execution time and power estimation

GPU estimation

CPU estimation

Concurrent execution and effect of memory contention

CPU-GPU partitioning ratio

4 Future Outlook

4.1 Open Problems for Gaming Applications

4.2 Open Problems for GPGPU Applications

5 Conclusions

Acknowledgments

References

Chapter 22: Advances in GPU reliability research

1 Introduction

1.1 Radiation-Induced Soft Errors

1.2 GPUs as HPC Coprocessors

1.3 Evidence of Soft Errors in HPC Systems

2 Evaluating GPU reliability

2.1 Architectural and Program Vulnerability Analysis

2.2 Fault-Injection

2.3 Accelerated Particle Beam Testing

3 Hardware reliability enhancements

3.1 Protecting SRAM Structures

3.2 Protecting Pipeline Structures

3.3 Taking Advantage of Underutilization

4 Software reliability enhancements

4.1 Redundant Multithreading

4.2 Symptom-Based Fault Detectionand Dynamic-Reliability Management

5 Summary

References

Chapter 23: Addressing hardware reliability challenges in general-purpose GPUs

1 Introduction

2 GPGPUs Architecture

3 Modeling and Characterizing GPGPUs Reliability in the Presence of Soft Errors [25]

3.1 Microarchitecture-Level Soft-Error Vulnerability Analysis

3.2 Soft-Error Vulnerability Modeling in GPGPU Microarchitecture

3.3 Soft-Error Vulnerability Analysis on GPGPUs

3.3.1 Experimental methodologies

3.3.2 Soft-error vulnerability of the GPGPU microarchitecturestructures

Analysis on structure's AVF in SM

Vulnerability variations at streaming multiprocessor level

3.3.3 Impact of architecture optimizations on GPGPU microarchitecture-level soft-error vulnerability

Dynamic warp formation

Off-chip memory access

Number of threads per SM

Warp scheduling

4 RISE: Improving the Streaming Processors' Reliability Against Soft Errors in GPGPUs [36]

4.1 Full-RISE

4.1.1 Request pending aware Full-RISE

The observation on resource contentions among memory requests

The concept of request pending aware Full-RISE

4.1.2 Idle SMs aware Full-RISE

4.1.3 The implementation of Full-RISE

4.2 Partial-RISE

4.2.1 The concept of Partial-RISE

4.2.2 The implementation of Partial-RISE

4.3 RISE: Putting It All Together

4.4 Evaluations

4.4.1 Experimental setup

4.4.2 Effectiveness of RISE

4.4.3 Sensitivity analysis

5 Mitigating the Susceptibility of GPGPUs to PVs [43]

5.1 Background: Highly Banked Register File in GPGPUs

5.2 Frequency Optimization for GPGPUs' Register File Under PVs

5.2.1 Modeling PV impact on GPGPUs' register file

5.2.2 Variable-latency subbanks (VL-SB) in GPGPUs register file

Extending VL-RF to GPGPUs RF

Variable-latency subbanks (VL-SB) in RF

5.2.3 Register file bank reorganization

5.2.4 Implementation

5.2.5 Feasibility in alternative register file architecture

5.3 Mitigating the IPC Degradation Under VL-SB and RF-BRO

5.3.1 Register mapping under VL-SB and RF-BRO

5.3.2 Fast-bank aware register mapping

5.3.3 Fast-warp aware scheduling (FWAS) policy

5.4 Evaluations

5.4.1 IPC improvement

Evaluation of VL-SB with RF-BRO

Evaluation of FWAS

5.4.2 Overall performance improvement

References

Author Index

Subject Index

Back Cover

The users who browse this book also browse