The new register allocator can pack one and two component register writes together into the same register to make full use of the four component temporary registers that the programs have access to. For example a program like this:
ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[1].x, TEMP[0].x, TEMP[0].x
MUL TEMP[2].x, TEMP[1].x, TEMP[1].x
ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[1].x, TEMP[0].x, TEMP[0].x
MUL TEMP[2].x, TEMP[1].x, TEMP[1].x
MAD OUT[0].x, TEMP[0].x, TEMP[1].x, TEMP[2].x
will now be transformed to this:
ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[0].y, TEMP[0].x, TEMP[0].x
MUL TEMP[0].z, TEMP[0].y, TEMP[0].y
will now be transformed to this:
ADD TEMP[0].x, CONST[0].x CONST[0].x
MUL TEMP[0].y, TEMP[0].x, TEMP[0].x
MUL TEMP[0].z, TEMP[0].y, TEMP[0].y
MAD OUT[0].x, TEMP[0].x, TEMP[0].y, TEMP[0].z
This will have a big impact on shaders that use a lot of scalar values. Some of the bigger shaders in Lightsmark use 30-50% less registers with the new register allocator on my RV515. I also get an improvement in fps from ~4.75 to ~5.30, which is about 10%, but with fps that low I'm not sure the difference is really significant. I'd be interested to see the results on other cards with different games and benchmarks. If anyone wants to test it out, the code is in the new-register-allocator branch here.
This will have a big impact on shaders that use a lot of scalar values. Some of the bigger shaders in Lightsmark use 30-50% less registers with the new register allocator on my RV515. I also get an improvement in fps from ~4.75 to ~5.30, which is about 10%, but with fps that low I'm not sure the difference is really significant. I'd be interested to see the results on other cards with different games and benchmarks. If anyone wants to test it out, the code is in the new-register-allocator branch here.
If you run programs with the environment variable RADEON_DEBUG=pstat they will print out statistics from the compiled shaders that are useful for evaluating the effectiveness of the new register allocator.
This comment has been removed by the author.
ReplyDeleteSome benchmarks are completed at http://www.phoronix.com/vr.php?view=15852
ReplyDeleteAny chance of having the branch updated against master? It would be interesting to try this with really demanding apps, like the unigine demos now that floating is merged.
ReplyDeleteI'll try to push a rebased version after I finish adding support for loops. I'm not sure how long that will take though.
ReplyDeleteI've just pushed a rebased branch called new-register-allocator-v2.
ReplyDelete