Aftermath: Debugging Crashes and TDRs on the GPU

“Device Removed” – the bane of every PC rendering programmers existence. “The GPU has crashed and who knows why?” If you’ve said this (or similar; accounting for variants including profanity) then this blog post is for you!

The Premise

The recent introduction of low-level graphics APIs has presented a new challenge for the developers adopting them; avoid crashing the GPU. With previous high-level graphics APIs, the onus was mostly on the driver to make sure that, all resources in use are correctly paged (page-fault), virtual addresses used are valid (segmentation-fault), and dependencies/waits are in order (timeout) – or put simply; to not crash the GPU! With that recent shift in paradigm we’re seeing an increasing need for the ability to debug the GPU – a lot of this need has been met with the updates to Microsoft’s debug layer (especially GPU-Based Validation) however, we still need another line of defence for when an issue slips through the cracks – and there’s still no option for debugging those crashes that happen “in the wild” (after an game has shipped).

GPUs are a hyper parallel, closed-system – with thousands of commands running simultaneously in a super-scalar pipeline. If a fault arises, it can be difficult to know what caused the problem considering that the number possible faulting commands is likely in the millions. For the same reason it’s difficult to pre-empt a problem. So, what measures can be taken to debug the GPU when a crash does occur?

The Solution

We’re proudly announcing a lightweight library to act as the final line of defence against GPU crashes, “Aftermath” (short for Aftermath Debugger). As the name suggests – Aftermath allows the user insight into why the GPU crashed after that crash has occurred. It is a GPU-based, post-mortem debugging aid. It’s lightweight and performs well enough to ship in a game – and can be hooked easily into any pre-existing crash telemetry system, allowing developers to hone in on why the GPU might be crashing in the wild.

How Does it Work?

As said previously, Aftermath is a lightweight library! So lightweight in fact, that at present, the entire API consists of just 3 calls; Initialise, SetEventMarker and GetData (although that may be subject to change as we add more features). The idea is to first initialise the library and afterwards (assuming initialisation was successful) insert markers inline with the command stream, which can be later used to trace the current location within the command stream, after a crash has occurred (that’s where the GetData call comes in). Furthermore, Aftermath allows the user to query why the GPU has crashed (was this a page-fault, a timeout, or something else?).

Want More?

Check out the Aftermath product page here: NVIDIA Aftermath