On FR we use Uber-Shader, to handle many lights within one shader program, but there is one problem. One scene we can have object influenced by different quantity of lights: so, we need to setup every shader per-object. When we draw another object with other quantity of influenced lights, we switch renderer to another shader. This operation is very expensive, which causes many states switching inside driver. In real-world application we can "sit" inside driver a half time of rendering, up to 10 ms, that's really bad.
If we want to use multiple lights within shader firstly we remember "for" operator. Let's imagine something like this:
/// Here we iterate through all lights to calculate lighting /// Parameter MAXLIGHTS is a maximum lights per-object for( int i = 0; i < MAXLIGHTS; i++ ) { if( pixel influenced by light[i] ) { /// Some calculations to do lighting ... /// Write result ... } }This is really basic light iteration-calculation, it has some caveats: cycle 'for' and condition 'if' embranchement of first priority. Let's remember some technical information about GPU. First and main: GPU DOESN'T have command stack, so it will be painful to execute interdependent conditions. Second is a "wavefront" problem, when processors wait each other, because one depends on results from other processor(this is true, when processors get into deeper condition branch than previous branch. But this is false, when condition branches are the same on all processors). These all problems are in GPU(but I beleave, IHVs will rework their GPU and jump over these problems). Programmer may think "The problem is in N condition branch/cycle! F*** it!" and may delete unnecesseary cycle/branch. But stop! Let's think: Parallax Occlusion Mapping has branching with a lot of 'if's and 'while' and works good on old videocards like NVIDIA Geforce 6xxx/7xxx and ATI Radeon X1xxx, so there is something another.
Another thing is to unroll cycle into branches of first priority without any branches of second priority. So code will be linear. This is how it could be:
/// Check first light if( lights > 0 ) | ( pixel influenced by light[0] ) { /// Some calculations to do lighting ... /// Write result ... } /// Check second light if( lights > 1 ) | ( pixel influenced by light[1] ) { /// Some calculations to do lighting ... /// Write result ... } ... /// Check N light if( lights > N ) | ( pixel influenced by light[n] ) { /// Some calculations to do lighting ... /// Write result ... }
This is quite ugly, but performance of this method is the same as Deferred Shading, because we do not suffer from shader and uniform switching. This shader is already Uber, so we don't need to any million shader compilations for various material-lights. Also we are using Constant Buffer for this method, so we write all once on CPU and use it once on GPU. So this is how we can manage our renderer not to be CPU-bounded.
I know this is really crazy, but it works. I don't know if THIS type of Forward Renderer used by any of companies, even my method may be not new.