Sunday, September 25, 2011

Texture Semaphore for r500

I just finished some improvements to the r300g instruction scheduler to make better use of the texture semaphore. The texture semaphore is used by instructions that need to read texture data to tell the ALU to delay execution until the desired texture data has been fetched from the texture unit. Previously in the r300g compiler, all instructions were using this semaphore, so even instructions that didn't need texture data were waiting for it to be fetched. With these improvements, we are able to prefetch texture data by placing instructions that don't depend on texture data directly after texture look ups, so they execute while the data is being fetched. This should lead to some performance improvements for certain kinds of shaders. In Lightsmark, there is one shader in particular that really benefits from this optimization, and I'm getting about a 33% speed up in overall FPS, with these new changes on my RV515. I'm curious to see what kind of performance improvements this brings for Lightsmark on other cards and even if there are other applications that benefit. Unfortunately, though, this optimization is only available on r500 cards, so r300 /r400 users are out of luck.

If anyone is interested, I've pushed the code to the tex-sem branch of my fdo git repo (http://cgit.freedesktop.org/~tstellar/mesa/) . When testing this out you can make use of a new environment variable called RADEON_TEX_GROUP, which defines the maximum number of texture lookups to submit at the same time. The default is 8, because it gave me the best Lightsmark performance on my card, but different values might be better for other applications / GPU combinations. To set the maximum number of texture lookups to 12, just do this:

RADEON_TEX_GROUP=12 ./your_app

The values I used for testing were 4, 8, and 12. It probably won't help to go any lower than 4, and I doubt anything higher than 16 will have much of an effect.

There are also a few other optimizations in this branch namely, a smarter instruction scheduler, and the re-enabling of the register rename pass which enhances the effect of all the compiler optimizations. If you are interested, give this branch a try and let me know how it works for you.