Saturday, November 13, 2010

Bug Fixes for the sched-perf Branch

I just pushed a rebased version of the sched-perf branch to a new branch called sched-perf-rebase at git://anongit.freedesktop.org/~tstellar/mesa

This new branch contains bug fixes for the old branch and has no piglit regression vs. master on my RC410 and RV515 cards. In fact this branch has +1 passes on both cards.

This new branch should reduce fragment shader program size by about 10-20%. Shaders with branches should see the most improvement. There are three major changes to the compiler that are driving these improvements.

The first change is that the dataflow analysis for the optimization passes has been unified in a single function: rc_get_readers() which saves us from having to redo dataflow analysis for every passes and made it really easy to add the new optimization passes in this branch.

Fragment shader instructions for R300-R500 cards are actually composed of two sub instructions: one vector and one scalar. The vector instruction writes to the xyz components of a register and the scalar instruction writes to the w component. Currently, in the master branch an instruction like: MOV Temp[0].x, Temp[1].x is treated as a vector instruction, since it writes to the x component. This wastes the vector unit on what is actually a scalar instruction. One of the optimizations I added converts MOV Temp[0].x, Temp[1].x to MOV Temp[0].w Temp[1].x which allows us to make use of the scalar unit and leaves the vector unit free for actual vector instructions. Since there are usually more vector instructions than scalar we can usually fill this empty vector slot with another instruction which reduces the overall program size by one.

The third big change is converting the code to a quasi static single assignment (SSA) form prior to instruction scheduling. SSA basically means that each register is only written once. The main advantage of SSA is that it makes dataflow analysis much easier, however in the r300 compiler we aren't really using it for dataflow analysis. We are using it because it helps our scheduler do a better job pairing instructions and making use of the vector and scalar units on every cycle. I say quasi-SSA because you can't really turn vector instructions into SSA unless you break them apart into individual scalar instructions. For example, with vector instructions you might run into cases like this:

MOV Temp[4].x, Temp[5].x
MOV Temp[4].y, Temp[6].x
MOV Temp[7].xy, Temp[4].xy

In true SSA, each register is only written one time so we would need to rewrite the 2nd instruction like this:

MOV Temp[4].x, Temp[5].x
MOV Temp[5].y, Temp[6].x
MOV Temp[7].xy, Temp[4].xy

Oops, now we broke the program. Instruction 3 reads from Temp[4].x, but that component is never written. We could change instruction 3 to
MOV Temp[7].xy, Temp[5].xy, but then it would read from Temp[5].y which isn't written either. So, in the r300 compiler we convert everything to SSA unless we see code like the example above. In that case we just ignore it and don't bother trying to rewrite it.

As I mentioned earlier, these compiler optimizations reduce program size by about 10 - 20% Here is an example from the piglit test glsl-fs-atan3:

Categorymastersched-perf-rebasefglrx
Total Instructions1119360
Vector Instructions816547
Scalar Instructions273747
Flow Control Instructions20207
Presubtract Operations344
Temporary Registers1096

The fglrx results come from the AMD Shader Analyzer v1.42.

So about a 15% decrease in shader size for this test, but we are still quite far away from fglrx. The good news is, however, that I can see lots of areas for improvement. The big gap between the r300 compiler and fglrx is mostly because the way we use flow control instructions is very inefficient, and in this shader, it costs us about 16 instructions. There are a few other optimization we could be doing better too.

I'm really not a GPU performance expert, so I don't know how smaller shader programs will translate to better performance at least in terms of frames per second. Smaller shaders means less data needs to be submitted to the graphics processor so that should help, but I think most of the performance bottlenecks are other places in the driver.

I'm going to do more testing of the sched-perf-rebase branch before I merge it with master, but I feel pretty good about it now. Also, as a bonus while working on these performance improvements I found and fixed 5 non-performance related bugs, which I hope will resolve some of the outstanding r300g fdo bugs.

Monday, November 8, 2010

r300 Compiler Optimization Improvements

I just pushed a branch called sched-perf to git://anongit.freedesktop.org/~tstellar/mesa
It contains various optimization improvements:
  • Handling of flow control instructions in dataflow analysis.
  • More aggressive use of presubtact operations.
  • Some scheduler improvements.

I'm seeing about a 10% decrease in shader program size in most piglit tests with this branch, but I haven't done much testing with real applications. I added a debug option a few weeks ago for dumping shader stats (RADEON_DEBUG=pstat), which I've been using with piglit and is helpful for comparing compiler performance between different branches.

Tuesday, August 3, 2010

Back to Loops

I spent the last 3 weeks working on adding presubtract support to the r300 compiler. It turned out to be quite an undertaking. I had to make some major changes to some core parts of the compiler to get it working. I think it is pretty stable at the moment, but I would like to refactor some of the code and add support for the add and subtract operations before I merge it in to master. I would really like to create a sort of optimization framework that makes writing new optimizations a lot easier, so that will be part of what I do when I add the remaining presubtract operations. I'll probably pick this up again after my GSoC project is finished. For now, I am going to focus on loops again. Right now, I am working on handling breaks and continues for r500 fragment shaders. Once I get that working, I'll see what I can do about loops in Vertex shaders.

Monday, July 19, 2010

R300 Presubtract

The last week I've been trying to get presubtract operations working for the r300 compiler. Presubtract operations are basically "free" instructions that modify source values before the are sent to the ALU. The four presubtract operations for r300 cards are (1 - src0), (src1 + src0), (src1 - src0), and (1 - 2 * src0). At this point the compiler only uses (1 - src0), but now that I have one working adding the others shouldn't be too hard. I had to make some major changes to the compiler to get this working, so I am going to let it sit in its own branch (presub branch at http://cgit.freedesktop.org/~tstellar/mesa/) and test it out for a while before I merge it into the the master branch.

Thursday, July 8, 2010

Bug 25109: Why I love FOSS

I just pushed commit 3724a2e65f5b3aa6e123889342a3e9c4d05903f5 to the mesa master branch that fixes this bug. I filed this bug 8 month ago as a user without knowing anything about mesa or the r300 driver, and today I fixed it! How cool is that?

Friday, July 2, 2010

Hardware Loops Take 2

A few weeks ago I began working on using the hardware loop capabilities for fragment shaders on R500 cards. My original plan was to use the specialized loop instructions provided by the graphics card, but as it turned out, the documentation for these instructions was a little confusing (or so I thought), and I could never get them to work the way I wanted. So, instead I ended up using JUMP instructions to execute loops the same way you would if you were generating code for a CPU. This is an OK solution, but it makes it very difficult to generate code for loops that have continue or break statements.

After taking a few days off from loops, I decided to give the specialized loop instructions another try. I went back and reviewed the documentation and still it did not make sense to me, so I decided to ask Alex Deucher, who works at AMD, for some clarification on the documentation. As it turns out the documentation was fine, Alex pointed out a short but very important part of the documentation that I had over-looked. I've probably read the documentation one hundred times, but I always missed that one crucial part!!! Thanks, Alex.

I will start working on hardware loop instructions again soon, but first I am going to take a little detour to fix a bug in the compiler's instruction scheduler that is preventing me from playing civ4 and causing problems with Compiz for some people.

Sunday, June 6, 2010

r300 Loop Emulation

I have been making good progress on implementing loop emulation for the r300 compiler. I just published a branch containing loop emulation code here: http://cgit.freedesktop.org/~tstellar/mesa/

Loops like for(i=0; i<10; i++), the compiler is able to figure out how many iterations the loop will have and then unroll it that many times. It can't handle every possible loop, but I think I have the most common ones covered. For the rest of the loops that don't have a known number of iterations at compile time, the compiler will just unroll the loop until it hits the maximum instruction limit.

Thursday, May 27, 2010

Google Summer of Code Begins

I have started working on my Google Summer of Code project, which is to improve the GLSL compiler for the open source r300 driver. Right now, I am working on emulating loops in the compiler backend for cards that don' t have hardware looping instructions.