Programmer's fault: 2012

21.7.12

Some additional notes to the previous post

Multithreading in OpenGL is a hard thing, event with Fence extension, this may "eat" all CPU processing time, because of sync overhead - this is multipy contexts version. But we can have only one context attached to seaparate thread and command buffering for execution by that thread. One rendering command for command buffer looks like this:

struct rendering_command
{
    rendering_operation op;
    unsigned int gl_id;
    unsigned char* data;
}

Let's see, what do we have here:
1)rendering_operation is a single command for OpenGL rendering system. Just to specify event.
2)gl_id is an OpenGL object identifier.
3)data is a pointer to the data storage need by this rendering event.

This is just very simple rendering command definition, you can make it more complex by adding storage for some additional data, used by renderer. You can add unsigned int param1, param2, param3 to structure to store parameters like texture size, when you want to schedule data to upload to texture or modify, where param1 is a width, param2 - height, param3 - depth.
For command buffer storage you can use queue, but it would be better to make it thread-safe via critical sections or mutexes to prevent "false threading" situation, or just use concurrent_queue from Intel Threading Building blocks[1].

Conclusion is just simple: there is no standard for such multithreading situation with OpenGL. You can use method proposed by NVIDIA in "Optimized OpenGL texture transfers"[2], or use method proposed here. Small chunks of textures(1x1 - 64x64) works great with this method, for large it's better to use NVIDIA's.

References:
[1]: http://threadingbuildingblocks.org/
[2]: http://nvidia.fullviewmedia.com/gtc2012/0515-J2-S0356.html

Optimized texture transfers with OpenGL

Updated: see some additional notes to this post

26.4.12

Crazy Forward Renderer

I want to share very crazy idea of rendering many lights using Forward Rendering. We all know that we use LPP(Lights Pre-Pass) or DS(Deferred Shading), if want many lights... But GPU's memory bus bandwidth is the worst thing that we could imagine, so DS\LPP implementation could become bottlenecked(with MSAA situation can be MORE worser) because of this. Ok, we can optimize implementation using various tricks like: Depth Bounds Test(hardware or emulated through shaders), Stencil Testing, Scrissors, Tiling using CPU or Compute Shader, etc. But what about lightweight scenes? Yes, we can have tons of lights using those(LPP, DS) techniques. Do we always need tons of lights on scene? On real scenes we have 8-12 lights overall in camera's frustum, while LPP and DS can handle thousands. Because of bandwidth it's better to use Forward Rendering(FR) instead of LPP or DS and get better frame rates(not always, but sometimes this rule is true).
On FR we use Uber-Shader, to handle many lights within one shader program, but there is one problem. One scene we can have object influenced by different quantity of lights: so, we need to setup every shader per-object. When we draw another object with other quantity of influenced lights, we switch renderer to another shader. This operation is very expensive, which causes many states switching inside driver. In real-world application we can "sit" inside driver a half time of rendering, up to 10 ms, that's really bad.

If we want to use multiple lights within shader firstly we remember "for" operator. Let's imagine something like this:

/// Here we iterate through all lights to calculate lighting
/// Parameter MAXLIGHTS is a maximum lights per-object
for( int i = 0; i < MAXLIGHTS; i++ ) {
    if( pixel influenced by light[i] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
    }
}

This is really basic light iteration-calculation, it has some caveats: cycle 'for' and condition 'if' embranchement of first priority. Let's remember some technical information about GPU. First and main: GPU DOESN'T have command stack, so it will be painful to execute interdependent conditions. Second is a "wavefront" problem, when processors wait each other, because one depends on results from other processor(this is true, when processors get into deeper condition branch than previous branch. But this is false, when condition branches are the same on all processors). These all problems are in GPU(but I beleave, IHVs will rework their GPU and jump over these problems). Programmer may think "The problem is in N condition branch/cycle! F*** it!" and may delete unnecesseary cycle/branch. But stop! Let's think: Parallax Occlusion Mapping has branching with a lot of 'if's and 'while' and works good on old videocards like NVIDIA Geforce 6xxx/7xxx and ATI Radeon X1xxx, so there is something another.

Another thing is to unroll cycle into branches of first priority without any branches of second priority. So code will be linear. This is how it could be:

/// Check first light
if( lights > 0 ) | ( pixel influenced by light[0] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
}
/// Check second light
if( lights > 1 ) | ( pixel influenced by light[1] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
}
...
/// Check N light
if( lights > N ) | ( pixel influenced by light[n] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
}

This is quite ugly, but performance of this method is the same as Deferred Shading, because we do not suffer from shader and uniform switching. This shader is already Uber, so we don't need to any million shader compilations for various material-lights. Also we are using Constant Buffer for this method, so we write all once on CPU and use it once on GPU. So this is how we can manage our renderer not to be CPU-bounded.

I know this is really crazy, but it works. I don't know if THIS type of Forward Renderer used by any of companies, even my method may be not new.

19.4.12

Which versions of OpenGL are you programming with?

G-Truc wants to know what version of OpenGL you are programming with. Even if you are not sure, or are using a flavour of OpenGL ES or WebGL, take a moment to answer this question pool

3.4.12

Blog update #1

First news are about S.T.A.L.K.E.R. On the 1st of april(that's not joke) I found video on Youtube, where Alexey Sytyanov(game designer and scenarist of S.T.A.L.K.E.R. and S.T.A.L.K.E.R. 2) told us, what was happening with S.T.A.L.K.E.R. 2. 1) There was not so much information, on every question about the game he answered: "I can't say about it now", "No comments", and all others were like the first two. 2) Leaked arts are true art of S.T.A.L.K.E.R. 2 and some of them desribes first levels of the game, as Alexey claimed. 3) Sergei Grigorovich(owner of GSC Game World) closed game studio before developer team showed presentation of first developments of S.T.A.L.K.E.R. 2. Alexey said, that this(Sergei's statement about closing GSC) had happened in the day, when team wanted to show presentation: "Few hours before presentation Grigorovich said about closing. Despite this we showed presentation, everyone liked it, but there were too many regrets that it would be impossible to bring to life. We had really good, interesting and better story line than in all previous S.T.A.L.K.E.R. games".

I know this posting is not it the "range" of this blog, but I like GSC and it's talented developers that gave us Venom: Codename Outbreak, Cossacks, S.T.A.L.K.E.R. All the best for GSC, these times really bad for game studio, but they released S.T.A.L.K.E.R. Shadow Of Chernobyl after all years of waiting...I think they will release S.T.A.L.K.E.R. 2, despite all troubles, in the future.

If you can understand russian, see yourself: Interview with Alexey Sytyanov
There is also translated(thank you, Google Translate!) online-readable version of interview: Interview with Alexey Sytyanov(print version)

Second news about my engine and game on it. There are no any screenshoots with game levels or other testing levels. I am just "playing" with the engine trying to make nice graphics, balanced quality and speed.