15.11.14

Just thinking: modern GAPI

Disclaimer: this is another useless post without much of technical information.

I wanted to place this text in previous post, but that was hard to do logical division, so I devided into two posts. This post will store my thoughts about graphics APIs and their evolutions/revolutions. In discent times we programmed GAPI ourself, that gave as direct-to-metal access: sounded good, but it was impractical and there were much problem in case of porting to another hardware. So Microsoft made Direct3D for such graphics abstraction, then SGI made IrisGL which was opened later and transformed to OpenGL. Both of them are gold standard for graphics. They do a lot to make it easier to program graphics, and we can see the richest graphics in games today(like CrySis by CryTek or Battlefield by DICE). Anyway abstraction is made for simplification, but graphics made richer and now we have another problem: huge CPU usage in high numbers of draw calls.

That's why there is Mantle by AMD and Metal by Apple(only for iOS for now), also is Direct3D12 on the way. They are made for dropping out this overhead of draw calls by moving more work to programmers. It is really funny to see that the idea of "low-level" Direct3D12 is back to Direct3D1: we are making command buffer, then we are making commands and then driver consumes this buffer. The same we might say about Mantle and Metal - the are too "low-level" with somewhat "same" functionality.

Alex St. John has written a good post with comparison of Direct3D and Metal, and thoughts about OpenGL. Also, I'd like to talk about OpenGL too... It's stuck. Literally. I know Khronos Group tries to push OpenGL to the top of graphics industry, they are making a good job(but OpenGL is still lacking some cool features, like context-independent concurrency), but the main problems are IHVs. Every IHV is making it's own implementation, because OpenGL is only the specification and it doesn't have run-time component like Direct3D has. That's really bad that no one in Khronos who could bang the table and say "that must be as I say!" to stop this noncense of varying OpenGL implementations. I don't want to talk too much about this, because thare a lot of negative posts that already have the same opinion about OpenGL. Anyway, I'll still wait for OpenGL NG, I believe that Khronos Group will make it.

That's enough of "void" talking, let's consider my thoughts about modern graphics API.
First of all, API must provide one view of resource as raw memory. This means that we will work with it like in our programs via malloc/realloc/free, just like in CUDA. Second thought is continuation of the first, we must control allocated chunk of memory by ourselves, making needed memory barriers or doing sequental consistency read/write operations and etc. Third is that API must provide controlling interface of GPU scheduler. That would bring us continous kernel execution like in CUDA, where we can write kernel that has ability to launch other kernel functions. The third means that we will not be CPU-bounded. Ideally it would be the greatest thing to do programming of GPU like CPU. The best reference to this approach is CUDA. Also it's good to see something

That's all what I want to say. This is the last useless post, next time I'll write about our engine.

30.6.14

It was so quiet here...

...because there were not any reasons to write another useless/pointless post. Now I have a lot of information to say here, so let's start.

Engine update:
NanoEngine will be open-sourced forever. Yes, that's it. It will be under BSD license. For now, the engine is in Bitbucket's repository and it's not complete. Foundation is the most stable and complete module of the engine, while others are not, or are in draft quality on my PC. The complete list of modules will be:
    - Foundation: where all low-level(in terms of engine) routines, like compiler definitions, containers, memory allocators, threading primitives, multitasking system, etc.

    - Rendering, Audio, Input: these are foundation-like low-level implementations, which wraps selected APIs in common form. They are just back-ends for high-level implementations.

    - Framework: one of clue elements of the engine, it glues all low-level modules and boots them up for further use with higher level modules. There is also central dispatching engine, which automatically calculates count of jobs to be used with

    - RML: is a centralized resource management layer, is the place, where all resources are processed asynchronously.

    - Graphics: high-level rendering implementation, based on Rendering module with more specific implementations of features needed by application.

    - Sound: high-level audio implementation, based on Audio module, which features event-based audio playing, custom sound effects, which are applied specifically to sounds or places where sounds are playing.

    - Physics: is a wrapper around ODE calls.

    - Game: the highest-level of the engine, which implements game mechanics and scripting basics.

    - Launcher: there is nothing to say much, it's just a launcher which starts up Framework module. All platform-specific actions for window are connected here with other parts of the engine, like Rendering and Input.

Current progress (26.06.2014):
    - Foundation: ~90% - (In repository)
    - Rendering: ~70% (In repository)
    - Audio: ~30% (In repository)
    - Input: ~10% (Not in reposity)
    - Framework: ~20% (In repository)
    - RML: ~50% (Not in repository)
    - Graphics: ~30% (Not in repository)
    - Sound: N/A
    - Physics: ~10% (Not in repository)
    - Game: ~10% (Not in repository)
    - Launcher: ~70% (In repository)


Engine code:
There are some things which are always colliding: code complexity and clarity of code. I am using a lot of templates in the engine. Yes, templates are one of cool features of the language, but it's also vital to use them where they are needed.
Let's consider this example from the code:
/// Intrusive smart pointer declaration
template< typename Type, typename Policy > class IntrusivePtr;
/// Policy declaration
template< typename Type, typename Tag > class SmartPtrPolicy;
/// Policy tags
struct SmartPointerRelaxedTag {};
struct SmartPointerThreadSafeTag {};

/// Use case example
IntrusivePtr< MyClass, SmartPtrPolicy< MyClass, SmartPointerRelaxedTag > > myClassPtr;
As you see, there is big problem with templates(in case of text complexity). But they are used for synchronization specialization, for automatic code path selection in compile-time for threading purposes. One might say, it's better to change this from full type-templated specialization to value-templated, to make something like this:
template< typename Type, Threading::MemoryOrder accessOrder, Threading::MemoryOrder writeOrder > class IntrusivePtr;
// Use case example
IntrusivePtr< MyClass, Threading::MemoryOrderEnum::Acquire, Threading::MemoryOrderEnum::Release > myClassPtr;
As you see, it is as bad as...I don't know...as the worst thing that you may imagine: too many chars to write, too big length of string, so it would be harder to read and understand code because of "extreme navigation". Another pro might be, automatic exclusion of dead code in compile-time, but not every compiler will do it, because you need to specify it yourself. In Visual C++ you can specify /O2, but this would work for release version, and not debug, where you want to explore the whole assembled code without any optimizations. So, that's why I'm using typed-templated version with tag selector: the old good method, which is not abandoned nowadays.

It was something like introduction to engine's specialization. I made such example with templated smart pointer to show that engine will make use of such pointers for automatic garbage collection. There various types of them: IntrusivePtr with intrusive reference counting in class and deletion is also needs to be specified inside class, SharedPtr with self-allocated reference couting. Both of them use SmartPtrPolicy for threading model selection.
As it was mentioned in Engine Update section, the engine aimed at agressive multithreading, so we need containers for inter-thread communications. There are various types of lists, queues and rind buffer implementations, which are Lockfree SPSC(Single-producer/single-consumer), Lockfree MPSC(Multi-producer/single-consumer), Lockfree MPMC(Multi-producer/multi-consumers), and Blocking ones.

That's all for code section now, because I don't want to share more details before uploading new version into Git. Also, if you want to see code, I'll give an access to our git storage, just email me: techzdev@4dvisionzgamesz.com(remove all "z")

21.7.12

Some additional notes to the previous post

Multithreading in OpenGL is a hard thing, event with Fence extension, this may "eat" all CPU processing time, because of sync overhead - this is multipy contexts version. But we can have only one context attached to seaparate thread and command buffering for execution by that thread. One rendering command for command buffer looks like this:
struct rendering_command
{
    rendering_operation op;
    unsigned int gl_id;
    unsigned char* data;
}
Let's see, what do we have here:
1)rendering_operation is a single command for OpenGL rendering system. Just to specify event.
2)gl_id is an OpenGL object identifier.
3)data is a pointer to the data storage need by this rendering event.

This is just very simple rendering command definition, you can make it more complex by adding storage for some additional data, used by renderer. You can add unsigned int param1, param2, param3 to structure to store parameters like texture size, when you want to schedule data to upload to texture or modify, where param1 is a width, param2 - height, param3 - depth.
For command buffer storage you can use queue, but it would be better to make it thread-safe via critical sections or mutexes to prevent "false threading" situation, or just use concurrent_queue from Intel Threading Building blocks[1].

Conclusion is just simple: there is no standard for such multithreading situation with OpenGL. You can use method proposed by NVIDIA in "Optimized OpenGL texture transfers"[2], or use method proposed here. Small chunks of textures(1x1 - 64x64) works great with this method, for large it's better to use NVIDIA's.

References:
[1]: http://threadingbuildingblocks.org/
[2]: http://nvidia.fullviewmedia.com/gtc2012/0515-J2-S0356.html

Optimized texture transfers with OpenGL


Updated: see some additional notes to this post

26.4.12

Crazy Forward Renderer

I want to share very crazy idea of rendering many lights using Forward Rendering. We all know that we use LPP(Lights Pre-Pass) or DS(Deferred Shading), if want many lights... But GPU's memory bus bandwidth is the worst thing that we could imagine, so DS\LPP implementation could become bottlenecked(with MSAA situation can be MORE worser) because of this. Ok, we can optimize implementation using various tricks like: Depth Bounds Test(hardware or emulated through shaders), Stencil Testing, Scrissors, Tiling using CPU or Compute Shader, etc. But what about lightweight scenes? Yes, we can have tons of lights using those(LPP, DS) techniques. Do we always need tons of lights on scene? On real scenes we have 8-12 lights overall in camera's frustum, while LPP and DS can handle thousands. Because of bandwidth it's better to use Forward Rendering(FR) instead of LPP or DS and get better frame rates(not always, but sometimes this rule is true).
On FR we use Uber-Shader, to handle many lights within one shader program, but there is one problem. One scene we can have object influenced by different quantity of lights: so, we need to setup every shader per-object. When we draw another object with other quantity of influenced lights, we switch renderer to another shader. This operation is very expensive, which causes many states switching inside driver. In real-world application we can "sit" inside driver a half time of rendering, up to 10 ms, that's really bad.

If we want to use multiple lights within shader firstly we remember "for" operator. Let's imagine something like this:
/// Here we iterate through all lights to calculate lighting
/// Parameter MAXLIGHTS is a maximum lights per-object
for( int i = 0; i < MAXLIGHTS; i++ ) {
    if( pixel influenced by light[i] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
    }
}
This is really basic light iteration-calculation, it has some caveats: cycle 'for' and condition 'if' embranchement of first priority. Let's remember some technical information about GPU. First and main: GPU DOESN'T have command stack, so it will be painful to execute interdependent conditions. Second is a "wavefront" problem, when processors wait each other, because one depends on results from other processor(this is true, when processors get into deeper condition branch than previous branch. But this is false, when condition branches are the same on all processors). These all problems are in GPU(but I beleave, IHVs will rework their GPU and jump over these problems). Programmer may think "The problem is in N condition branch/cycle! F*** it!" and may delete unnecesseary cycle/branch. But stop! Let's think: Parallax Occlusion Mapping has branching with a lot of 'if's and 'while' and works good on old videocards like NVIDIA Geforce 6xxx/7xxx and ATI Radeon X1xxx, so there is something another.

Another thing is to unroll cycle into branches of first priority without any branches of second priority. So code will be linear. This is how it could be:
/// Check first light
if( lights > 0 ) | ( pixel influenced by light[0] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
}
/// Check second light
if( lights > 1 ) | ( pixel influenced by light[1] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
}
...
/// Check N light
if( lights > N ) | ( pixel influenced by light[n] ) {
    /// Some calculations to do lighting
    ...
    /// Write result
    ...
}

This is quite ugly, but performance of this method is the same as Deferred Shading, because we do not suffer from shader and uniform switching. This shader is already Uber, so we don't need to any million shader compilations for various material-lights. Also we are using Constant Buffer for this method, so we write all once on CPU and use it once on GPU. So this is how we can manage our renderer not to be CPU-bounded.

I know this is really crazy, but it works. I don't know if THIS type of Forward Renderer used by any of companies, even my method may be not new.

19.4.12

Which versions of OpenGL are you programming with?

G-Truc wants to know what version of OpenGL you are programming with. Even if you are not sure, or are using a flavour of OpenGL ES or WebGL, take a moment to answer this question pool

3.4.12

Blog update #1

First news are about S.T.A.L.K.E.R. On the 1st of april(that's not joke) I found video on Youtube, where Alexey Sytyanov(game designer and scenarist of S.T.A.L.K.E.R. and S.T.A.L.K.E.R. 2) told us, what was happening with S.T.A.L.K.E.R. 2. 1) There was not so much information, on every question about the game he answered: "I can't say about it now", "No comments", and all others were like the first two. 2) Leaked arts are true art of S.T.A.L.K.E.R. 2 and some of them desribes first levels of the game, as Alexey claimed. 3) Sergei Grigorovich(owner of GSC Game World) closed game studio before developer team showed presentation of first developments of S.T.A.L.K.E.R. 2. Alexey said, that this(Sergei's statement about closing GSC) had happened in the day, when team wanted to show presentation: "Few hours before presentation Grigorovich said about closing. Despite this we showed presentation, everyone liked it, but there were too many regrets that it would be impossible to bring to life. We had really good, interesting and better story line than in all previous S.T.A.L.K.E.R. games".
I know this posting is not it the "range" of this blog, but I like GSC and it's talented developers that gave us Venom: Codename Outbreak, Cossacks, S.T.A.L.K.E.R. All the best for GSC, these times really bad for game studio, but they released S.T.A.L.K.E.R. Shadow Of Chernobyl after all years of waiting...I think they will release S.T.A.L.K.E.R. 2, despite all troubles, in the future.

If you can understand russian, see yourself: Interview with Alexey Sytyanov
There is also translated(thank you, Google Translate!) online-readable version of interview: Interview with Alexey Sytyanov(print version) 

Second news about my engine and game on it. There are no any screenshoots with game levels or other testing levels. I am just "playing" with the engine trying to make nice graphics, balanced quality and speed.