For developers optimizing high-performance GPU kernels on AMD hardware, the open-source release of rocprof-trace-decoder marks a pivotal step forward. This library decodes binary .att files—generated by rocprofv3’s thread trace feature—into structured data that reveals exactly which shader instructions are running, how long they take, and where GPU occupancy bottlenecks occur.
Built for AMD Radeon 6000/7000/9000 series and Instinct MI200/MI300 GPUs, the decoder transforms raw hardware instrumentation data into actionable insights. Previously confined to proprietary toolchains, this library now gives developers full control over trace analysis workflows.
To integrate it, simply build with CMake:
cmake -B build && cmake --build build -j$(nproc)
The resulting shared library installs by default to /opt/rocm/lib and can be loaded dynamically by rocprofiler-sdk. For custom paths, use --att-library-path /your/path when invoking rocprofv3.
Advanced users can embed the decoder directly via the ROCm SDK API:
rocprofiler_thread_trace_decoder_handle_t decoder{};
auto status = rocprofiler_thread_trace_decoder_create(&decoder, "/opt/rocm/lib");
The project includes comprehensive testing: unit tests, integration suites, and sanitizer builds (ASan/UBSan) to ensure reliability. Code coverage is supported via gcov/lcov for rigorous validation.
For OtherU kernel developers, this means you can now correlate performance regressions down to individual wavefront instructions—without vendor lock-in. Whether you’re tuning matrix multiplications or custom compute kernels, knowing which ALU operations stall or why LDS bank conflicts occur becomes possible at the instruction level.
This isn’t just another profiling tool—it’s the first open window into AMD GPU hardware execution traces. The beta status (0.1.7) signals active development; expect deeper integration with TheRock and expanded device support in coming releases.