Consider these two different implementations of a function which drops x elements from the front:
template <typename T>
std::vector<T> drop(int size, const std::vector<T>& coll){
if (size<0) return std::vector<T>();
auto sized = size > coll.size() ? coll.size() : size;
typename std::vector<T>::const_iterator first = coll.begin()+sized;
typename std::vector<T>::const_iterator last = coll.end();
return std::vector<T>(first,last);
}
template <typename T>
std::vector<T> drop2(int size, std::vector<T> coll){
if (size<0) return std::vector<T>();
auto sized = size > coll.size() ? coll.size() : size;
coll.erase(coll.begin(),coll.begin()+sized);
return coll;
}
In both versions, a new std::vector is allocated (in the second, it is copied as an argument, which is not a reference). In one, the result is created by erase() while in the other the result is created using the iterators of the original vector.
Is there any reason to believe that one of these would be meaningfully different in performance than the other?
Also, is RVO a guarantee in either or both of these?
EDIT:
Here is a test I did, which shows the first one is quite a bit slower than the second:
template<typename F>
void dropExample(F f){
std::cout<<"drop example"<<std::endl;
auto t1 = Clock::now();
for (auto x: range(100000)){
f(2, range(100));
}
auto t2 = Clock::now();
std::cout << "Delta t2-t1: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count()
<< " ms" << std::endl;
}
Output:
dropExample(drop<int>);
dropExample(drop2<int>);
drop example
Delta t2-t1: 625 ms
drop example
Delta t2-t1: 346 ms
No matter how many iterations I add to the for loop, the numbers are approximately like this, even for operations into the tens of seconds.
EDIT 2:
I have augmented the test with lvalue, as suggested in comments:
template<typename F, typename T>
void dropExample2(F f, T vec){
std::cout<<"drop example 2"<<std::endl;
auto t1 = Clock::now();
for (auto x: range(1000)){
f(2, vec);
}
auto t2 = Clock::now();
std::cout << "Delta t2-t1: "
<< std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1).count()
<< " ms" << std::endl;
}
Then in main:
int main(int argc, const char * argv[]) {
auto testrange=range(100000);
dropExample(drop<int>);
dropExample(drop2<int>);
dropExample2(drop<int>,testrange);
dropExample2(drop2<int>,testrange);
return 0;
}
Output still suggests the second is much faster:
drop example
Delta t2-t1: 564 ms
drop example
Delta t2-t1: 375 ms
drop example 2
Delta t2-t1: 2318 ms
drop example 2
Delta t2-t1: 698 ms
Here are supplementary functions used in the example:
std::vector<int> range(int start, int end, int step);
std::vector<int> range(int start, int end){
if (end<start){
return range(start,end,-1);
}else if (start == end){
return std::vector<int> {start};
}else{
std::vector<int> nums(end-start);
std::iota(nums.begin(),nums.end(),start);
return nums;}
}
std::vector<int> range(int end){
return range(0,end);
}
std::vector<int> range(int start, int end, int step){
std::vector<int> nums{start};
auto next=start+step;
while ((next<end&&start<=end&&step>0)||
(next>end&&start>end&&step<0))
{
nums.push_back(next);
next+=step;
}
return nums;
}
The first one is almost certainly faster, unless you are feeding drop an rvalue, in which case you'd have to measure.
Suppose you have N elements to start with, and M elements to drop:
Your second example will create a whole bunch of objects (in copying the input parameter) just to get rid of them later (at the call to erase). The performance difference because of that will depend on what T is, but I doubt the first one would ever be slower.
Also the amount of memory used will be greater in the second version as erase does not reallocate memory.
EDIT
Your current test is flawed because you pass the vector to be subsetted as a temporary, allowing the compiler to move construct the input parameter in drop2 and thus eliding the copy completely. Simply changing:
for (auto x: range(100000))
f(200, range(10000));
to
auto v = range(10000);
for (auto x: range(100000))
f(200, v);
Changed the results around quite dramatically. However, the second method was still faster for me until the vectors were much larger. It's also worth noting that because you are using int the different methods could be optimally optimized to memcpy and a couple of pointer manipulations.
drop could become simply a memcpy of (coll.size() - size) * sizeof(int) bytes while drop2 could become a memcpy of coll.size() * sizeof(int) bytes. This is because the destructor for an int is a no op and so the call to erase could become simply subtracting size from the __last pointer of the vector.
If all you are interested in is primitive types like this then that's okay, but if you also want to have an optimal implementation for, say, std::string then its destructor and copy-constructor become very important factors. I have tried with std::vector<int> as the type inside the vector and, while being slower overall, for smaller sizes it seems that drop2 is still faster. drop becomes more efficient at a lower threshold, however. I very much doubt that that is what we are seeing here, so the code we end up running is in some sort of in-between state of being just memcpy's and being what we wrote verbatim.
I guess in the end we are testing the compiler's ability to optimize different functions (std::uninitialized_copy, std::move(the iterator-based one), calling get_allocator().destroy(p) in a loop on trivial and non-trivial types, etc...). All I can say at this point is that results can vary quite wildly in terms of what gets optimized and by how much for even seemingly small changes in the code.
I am, however, still surprised the drop2 runs faster than drop, even if only for ranges under a certain size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With