Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speedup an if-else ladder C++

I have a piece of code that demands execution speed over anything else. By using the high_resolution_clock() from std::chrono I found out that this switch() into if-else() ladder is taking over 70% of my execution time. Is there any way to speed this up?

I'm using gcc with -O3 optimization during compiling.

I looked into a similar question: If else ladder optimisation but I can't use a return statement as it would exit the outer loop which I can't.

switch(RPL_OPTION) {
            case 0:
                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 1:
                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(RPL_CONST);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(RPL_CONST);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 2:
                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(upper_th);
                        flag_output.push_back(1);

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(lower_th);
                        flag_output.push_back(1);

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
            case 3:
                //Generating a gaussian noise distribution with 0 mean and 1 std deviation
                default_random_engine generator(time(0));
                normal_distribution<float> dist(0,1);

                for(int k = 0; k < WINDOW_SIZE; k++) {
                    if(ans[k] >= upper_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Calling a random sample from the distribution and calculating a noise value
                        filtered_output.push_back(dist(generator)*sigma);
                        flag_output.push_back(1);
                        continue;

                    } else if(ans[k] < lower_th) {
                        //Increasing flag counter
                        flag_count++;
                        //Calling a random sample from the distribution and calculating a noise value
                        filtered_output.push_back(dist(generator)*sigma);
                        flag_output.push_back(1);
                        continue;

                    } else {
                        //Adding the filtered value to the output vector
                        filtered_output.push_back(ans[k]);
                        flag_output.push_back(0);
                    }
                }
                break;
        }
like image 854
Psynapse_261 Avatar asked Oct 24 '25 19:10

Psynapse_261


2 Answers

A few optimizations that come to mind:

  1. vector.push_back() or emplace_back(), even with reserve(), are poison for performance because no compiler is able to vectorize the code. We can work with plain C pointers instead or just preallocate.

  2. Generating the random engine and distribution in the last case may have significant cost if this code is called repeatedly. We can hoist this out of the code. Note that this will also avoid issues with the repeated initialization for which you use a low-resolution time function.

  3. This may be unnecessary but rewriting the code a bit may allow more compiler optimizations, especially by turning things into conditional move-instructions and reducing the number of branches.

/* TODO: We have better ways of initializing generators but that is
 * unrelated to its performance
 * I'm lazy and turn this into a static variable. Better use a
 * different pattern (like up in the stack somewhere)
 * but you get the idea
 */
static default_random_engine generator(time(0));
static normal_distribution<float> dist(0,1);

std::size_t output_pos = filtered_output.size();
filtered_output.resize(output_pos + WINDOW_SIZE);
flag_output.resize(output_pos + WINDOW_SIZE);

switch(RPL_OPTION) {
case 0:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count += flag;
        filtered_output[output_pos + k] = ansk;
        flag_output[output_pos + k] = flag;
    }
    break;
case 1:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count += flag;
        // written carefully to help compiler turning this into a CMOV
        auto filtered = flag ? RPL_CONST : ansk;
        filtered_output[output_pos + k] = filtered;
        flag_output[output_pos + k] = flag;
    }
    break;
case 2:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        auto ansk = ans[k];
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        flag_count += flag;
        auto filtered = ansk < lower_th ? lower_th : ansk;
        filtered = ansk >= upper_th ? upper_th : filtered;
        filtered_output[output_pos + k] = filtered;
        flag_output[output_pos + k] = flag;
    }
    break;
case 3:
    for(int k = 0; k < WINDOW_SIZE; k++) {
        // optimized under the assumption that flag is usually 1
        auto ansk = ans[k];
        auto random = dist(generator) * sigma;
        int flag = (ansk >= upper_th) | (ansk < lower_th);
        auto filtered = flag ? random : ansk;
        filtered_output[output_pos + k] = filtered;
        flag_output[output_pos + k] = flag;
    }
    break;
}

Analyzing compiler output

I checked the resulting code with Godbolt. Cases 0-2 do vectorize. However, a lot hinges on good alias detection. So this needs to be analyzed in the context of the full function containing this code. Particular pain points are

  • Potential alias between ans and filtered_output. That is hard to avoid but I think compilers should be able to create code that check against this
  • Potential alias between the thresholds + RPL_CONST and the filtered_output. When in doubt, copy the inputs into a local variable (which the compiler can prove to be alias free). Just marking them const may not be enough
  • Potential alias between flag_count and flag_output, depending on the types. Again, better use a local variable for the count, then copy it to its output, if required

As for case 3, computing a random sample is expensive enough that my optimization may degrade performance if the inputs are usually within limits. That needs benchmarking. The longer I think about it, losing a few clock cycles on a mis-predict is probably much less time than computing a sample without using it.

Removing redundant code

The resulting code is highly redundant. We could move the switch-case into the loop but that messes with the vectorization. Instead, we can use a template function pattern.


class Filter
{
    int WINDOW_SIZE;
    float upper_th, lower_th, sigma, RPL_CONST;
    std::default_random_engine generator;
    std::normal_distribution<float> dist;

    template<class FilterOp>
    int apply(std::vector<float>& filtered_output,
              std::vector<int>& flag_output,
              const std::vector<float>& ans, FilterOp filter)
    {
        // move stuff into local variables to help with alias detection
        const int WINDOW_SIZE = this->WINDOW_SIZE;
        const float upper_th = this->upper_th, lower_th = this->lower_th;
        const std::size_t output_pos = filtered_output.size() - WINDOW_SIZE;
        int flag_count = 0;
        for(int k = 0; k < WINDOW_SIZE; k++) {
            auto ansk = ans[k];
            int flag = (ansk >= upper_th) | (ansk < lower_th);
            flag_count += flag;
            filtered_output[output_pos + k] = filter(ansk, flag);
            flag_output[output_pos + k] = flag;
        }
        return flag_count;
    }
public:
    int operator()(int RPL_OPTION,
              std::vector<float>& filtered_output,
              std::vector<int>& flag_output,
              const std::vector<float>& ans)
    {
        std::size_t output_pos = filtered_output.size();
        filtered_output.resize(output_pos + WINDOW_SIZE);
        flag_output.resize(output_pos + WINDOW_SIZE);
        switch(RPL_OPTION) {
        case 0:
            return apply(filtered_output, flag_output, ans,
                [](float ansk, int flag) noexcept -> float {
                    return ansk;
            });
        case 1:
            return apply(filtered_output, flag_output, ans,
                [RPL_CONST=this->RPL_CONST](float ansk, int flag) noexcept -> float {
                    return flag ? RPL_CONST : ansk;
            });
        case 2:
            return apply(filtered_output, flag_output, ans,
                [lower_th=this->lower_th, upper_th=this->upper_th](
                      float ansk, int flag) noexcept -> float {
                    auto filtered = ansk < lower_th ? lower_th : ansk;
                    return ansk >= upper_th ? upper_th : filtered;
            });
         case 3:
            return apply(filtered_output, flag_output, ans,
                [this](float ansk, int flag) noexcept -> float {
                    return flag ? dist(generator)*sigma : ansk;
            });
         default: return 0;
       }
    }
};
like image 159
Homer512 Avatar answered Oct 27 '25 09:10

Homer512


I am nearly 98% sure that the if-else ladder is not the problem.

The std::vectors (or whatever container you use) push_back function with tons of reallocations and data copying is for me the main candidate for optimization.

Please use the reserve function to allocate the needed memory beforehand.

Then move out all invariant stuff, like

default_random_engine generator(time(0));
normal_distribution<float> dist(0,1);

But without more example code, it is hard to judge.

A profiler will give you better results. Timer functions will not help a lot here.

like image 24
Armin Montigny Avatar answered Oct 27 '25 09:10

Armin Montigny



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!