cuPTW: Leveraging Idle Compute Units for Massively Parallel GPU Page Table Walks

Jun 10, 2026·

Zihang Chen

Tianao Ge

Lieven Eeckhout

Hongyuan Liu

Jiayi Huang

· 0 min read

PDF Slides DOI

Abstract

Virtual memory has become a cornerstone of modern GPUs, enabling unified address spaces and advanced memory management techniques. However, the performance of address translation has emerged as a critical bottleneck, particularly under irregular workloads with massive memory footprints, where frequent TLB misses and costly page table walks dominate total memory access latency. Prior work has primarily focused on improving TLB effectiveness or optimizing page table walks through batching and coalescing, but these approaches remain limited by the lack of locality and memory bandwidth constraints. In this work, we propose Compute Unit Page Table Walk (cuPTW), a novel address translation architecture that repurposes idle GPU compute resources to accelerate page table walks. Specifically, we introduce (1) a single-threaded synchronous cuPTW, which offloads translation requests to idle functional units within compute units, and (2) two optimizations that further reduce latency and improve throughput by caching page table walks in local data store memory (cuPTW-SW) and parallelizing them across multiple SIMD lanes (cuPTW-MT). When combined, cuPTW-Full transforms low-parallelism page table walks into a massively parallel computation task. Our evaluation across 15 representative GPU workloads, including deep learning, graph analytics and scientific simulations, demonstrates that cuPTW-Full achieves a performance speedup of 4.43× on average (up to 76.09×) by improving page table walk throughput by 9.92× on average over our baseline GPU architecture. Compared to state-of-the-art GPU address translation proposals, cuPTW-Full achieves a 1.97× to 2.08× average speedup.

Type

Conference paper

Publication

In Proceedings of the ACM on Measurement and Analysis of Computing Systems, Volume 10, Issue 2

Last updated on Jun 10, 2026

GPU Archiecture

Authors

Zihang Chen (he/him)

PhD student at HKUST(GZ)

Zihang Chen is a PhD student of microelectronics in The Hong Kong University of Science and Technology (Guangzhou). He received his bachelor degree of Computer Science and Technology from Nanjing University of Science and Technology in 2019. His research interests include computer architecture, micro-processor design, micro-architecture exploration, cuda programming and machine learning system. He used to be an intern at XiangShan group, Beijing Institute of Open Source Chip to develop high-performance RISC-V processor and implement some novel architectures on the simulators. Now, he is a third-year PhD student in The Hong Kong University of Science and Technology (Guangzhou), supervised by Prof. Jiayi Huang and Prof. Hongyuan Liu. Currently, he is a visiting student in Ghent University, supervised by Prof. Lieven Eeckhout.

← NB-Walker: Non-blocking Page Table Walker to Enhance Address Translation in NUMA GPUs Oct 31, 2026

VESTA: A Secure and Efficient FHE-based Three-Party Vectorized Evaluation System for Tree Aggregation Models Mar 10, 2025 →

No results found

cuPTW: Leveraging Idle Compute Units for Massively Parallel GPU Page Table Walks