Sunday, September 25, 2011

Texture Semaphore for r500

I just finished some improvements to the r300g instruction scheduler to make better use of the texture semaphore. The texture semaphore is used by instructions that need to read texture data to tell the ALU to delay execution until the desired texture data has been fetched from the texture unit. Previously in the r300g compiler, all instructions were using this semaphore, so even instructions that didn't need texture data were waiting for it to be fetched. With these improvements, we are able to prefetch texture data by placing instructions that don't depend on texture data directly after texture look ups, so they execute while the data is being fetched. This should lead to some performance improvements for certain kinds of shaders. In Lightsmark, there is one shader in particular that really benefits from this optimization, and I'm getting about a 33% speed up in overall FPS, with these new changes on my RV515. I'm curious to see what kind of performance improvements this brings for Lightsmark on other cards and even if there are other applications that benefit. Unfortunately, though, this optimization is only available on r500 cards, so r300 /r400 users are out of luck.

If anyone is interested, I've pushed the code to the tex-sem branch of my fdo git repo (http://cgit.freedesktop.org/~tstellar/mesa/) . When testing this out you can make use of a new environment variable called RADEON_TEX_GROUP, which defines the maximum number of texture lookups to submit at the same time. The default is 8, because it gave me the best Lightsmark performance on my card, but different values might be better for other applications / GPU combinations. To set the maximum number of texture lookups to 12, just do this:

RADEON_TEX_GROUP=12 ./your_app

The values I used for testing were 4, 8, and 12. It probably won't help to go any lower than 4, and I doubt anything higher than 16 will have much of an effect.

There are also a few other optimizations in this branch namely, a smarter instruction scheduler, and the re-enabling of the register rename pass which enhances the effect of all the compiler optimizations. If you are interested, give this branch a try and let me know how it works for you.


Monday, April 18, 2011

Updates to the New R300 Register Allocator

I just pushed an updated version of the new r300 register allocator to http://cgit.freedesktop.org/~tstellar/mesa/ The branch is called new-register-allocator-v2. This new version contains support for loops and a few bug fixes. It has been rebased to included the floating-point texture additions, so it can now be tested on those apps that need floating-point textures.

Monday, March 28, 2011

New Register Allocator in the R300 Compiler

I'm mostly finished with a new and improved register allocator for fragment shaders in the R300 compiler. I still need to clean up the code and add comments, but otherwise it is ready for testing. The new allocator takes advantage of a register allocation algorithm designed for irregular architectures from a paper by Johan Runeson and Sven-Olof Nyström. Eric Anholt implemented this algorithm and added it to mesa, so all drivers could make use of it.

The new register allocator can pack one and two component register writes together into the same register to make full use of the four component temporary registers that the programs have access to. For example a program like this:

ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[1].x, TEMP[0].x, TEMP[0].x
MUL TEMP[2].x, TEMP[1].x, TEMP[1].x
MAD OUT[0].x, TEMP[0].x, TEMP[1].x, TEMP[2].x

will now be transformed to this:

ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[0].y, TEMP[0].x, TEMP[0].x
MUL TEMP[0].z, TEMP[0].y, TEMP[0].y
MAD OUT[0].x, TEMP[0].x, TEMP[0].y, TEMP[0].z

This will have a big impact on shaders that use a lot of scalar values. Some of the bigger shaders in Lightsmark use 30-50% less registers with the new register allocator on my RV515. I also get an improvement in fps from ~4.75 to ~5.30, which is about 10%, but with fps that low I'm not sure the difference is really significant. I'd be interested to see the results on other cards with different games and benchmarks. If anyone wants to test it out, the code is in the new-register-allocator branch here.

If you run programs with the environment variable RADEON_DEBUG=pstat they will print out statistics from the compiled shaders that are useful for evaluating the effectiveness of the new register allocator.