Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GLSL IF speed vs multiply factor

I know this has been asked generally but answer is alweays "depends", so I'm creating a concrete question in hope to get a concrete answer.

I know the evil of IF's on GLSL, they can be really expensive, even execute all code in some hardware.

So, I have a fragment shader from an example (a dual paraboloid shadow map) which uses if's to determine which map to use and compute the depth, but I know it's very easy to replace those if's with a multiplier, the question is there are a texture sampling inside the fragment shader, what would be faster, to use an if or use a multiplier to filter the unused data?

These are the proposed codes:

IF version:

//Alpha is a variable computed on the fly, cannot be replaced

float depth = 0;
float mydepth = 0;

if(alpha >= 0.5f)
{
    depth = texture2D(ShadowFrontS, P0.xy).x;
    mydepth = P0.z;
}
else
{
    depth = texture2D(ShadowBackS, P1.xy).x;
    mydepth = P1.z;
}

Filter version:

float mlt = ceiling(alpha - 0.5f);

float depth = 0;
float mydepth = 0;

depth = texture2D(ShadowFrontS, P0.xy).x * mlt;
mydepth = P0.z * mlt;
mlt = 1.0f - mlt;
depth = depth + (texture2D(ShadowFrontS, P1.xy).x * mlt);
mydepth = P1.z * mlt;

P.D.: I'm targeting Desktop and Mobile devices, so performance on low-end hardware is a must.

like image 694
Gusman Avatar asked Sep 20 '25 06:09

Gusman


1 Answers

Branching is not "evil" per-se on massively SIMD architectures. If all the threads in a "bunch" (NVidia calls them Warps) follow the same code path, i.e. take all the same branches, everything is fine.

Only if a branch is partly taken (within that bunch) and for the other part not, both branches must be executed and later on the calculations and data fetches discarded that are not relevant for the current thread.

Now in your case it requires some careful profiling to see, which variant benefits your GPU more. But my gut instinct tells me, it's actually the branching version. Why? Because: Usually the value by which you decide on a branch depends on the screen space position and often large contiguous areas of fragments share the same code path and branching; so performance penalities happen only for those "bunches", which cover a bordering region. These bunches are usually only a few pixel² in size (8×8, or 16×16).

The shader you have there is not GPU limited (i.e. limited by the computational capabilities of the GPU), but memory bandwidth limited, i.e. by the throughput that the GPU's memory link offers; that is because of the texture2D fetch operations. And in that case reducing the actual number of fetches and thereby the required memory bandwidth will probably benefit your program more than reducing the number of computations.

The branchless mix-multiplex variant of your shader will always fetch both textures, the branching one will do that only within the bordering regions. So from that heuristic I'd guess, that your branching variant is actually the better choice.

But to be sure you have to profile it.

like image 80
datenwolf Avatar answered Sep 22 '25 19:09

datenwolf