If anyone is interested, I've pushed the code to the tex-sem branch of my fdo git repo (http://cgit.freedesktop.org/~tstellar/mesa/) . When testing this out you can make use of a new environment variable called RADEON_TEX_GROUP, which defines the maximum number of texture lookups to submit at the same time. The default is 8, because it gave me the best Lightsmark performance on my card, but different values might be better for other applications / GPU combinations. To set the maximum number of texture lookups to 12, just do this:
RADEON_TEX_GROUP=12 ./your_app
The values I used for testing were 4, 8, and 12. It probably won't help to go any lower than 4, and I doubt anything higher than 16 will have much of an effect.
There are also a few other optimizations in this branch namely, a smarter instruction scheduler, and the re-enabling of the register rename pass which enhances the effect of all the compiler optimizations. If you are interested, give this branch a try and let me know how it works for you.