cuPTW: Leveraging Idle Compute Units for Massively Parallel GPU Page Table Walks

Abstract
Virtual memory has become a cornerstone of modern GPUs, enabling unified address spaces and advanced memory management techniques. However, the performance of address translation has emerged as a critical bottleneck, particularly under irregular workloads with massive memory footprints, where frequent TLB misses and costly page table walks dominate total memory access latency. Prior work has primarily focused on improving TLB effectiveness or optimizing page table walks through batching and coalescing, but these approaches remain limited by the lack of locality and memory bandwidth constraints. In this work, we propose Compute Unit Page Table Walk (cuPTW), a novel address translation architecture that repurposes idle GPU compute resources to accelerate page table walks. Specifically, we introduce (1) a single-threaded synchronous cuPTW, which offloads translation requests to idle functional units within compute units, and (2) two optimizations that further reduce latency and improve throughput by caching page table walks in local data store memory (cuPTW-SW) and parallelizing them across multiple SIMD lanes (cuPTW-MT). When combined, cuPTW-Full transforms low-parallelism page table walks into a massively parallel computation task. Our evaluation across 15 representative GPU workloads, including deep learning, graph analytics and scientific simulations, demonstrates that cuPTW-Full achieves a performance speedup of 4.43× on average (up to 76.09×) by improving page table walk throughput by 9.92× on average over our baseline GPU architecture. Compared to state-of-the-art GPU address translation proposals, cuPTW-Full achieves a 1.97× to 2.08× average speedup.
Type
Publication
In Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 10, Issue 2

Authors
Zihang Chen
(he/him)
PhD student at HKUST(GZ)
Zihang Chen is a PhD student of microelectronics in The Hong Kong University of Science and Technology (Guangzhou). He received his bachelor degree of Computer Science and Technology from Nanjing University of Science and Technology in 2019. His research interests include computer architecture, micro-processor design, micro-architecture exploration, cuda programming and machine learning system. He used to be an intern at XiangShan group, Beijing Institute of Open Source Chip to develop high-performance RISC-V processor and implement some novel architectures on the simulators. Now, he is a third-year PhD student in The Hong Kong University of Science and Technology (Guangzhou), supervised by Prof. Jiayi Huang and Prof. Hongyuan Liu. Currently, he is a visiting student in Ghent University, supervised by Prof. Lieven Eeckhout.