I've recently changed drawing in my current project from standard drawing from a memory array to VBOs. To my surprise the framerate dropped significantly from 60fps to 30fps drawing a model with 1200verts 8 times. Doing further profiling showed that glDrawElements took 10 times as long when using VBOs compared to drawing from memory.
I am really puzzled why this is happening. Does anyone know what could be the cause for a performance decrease?
I am testing on an iPhone 5 running iOS 6.1.2.
I've isolated my VBO handling into a single function where I create the vertex/index buffer once statically at the top of the function. I can switch between normal and VBO rendering with an #ifdef USE_VBO
- (void)drawDuck:(Toy*)toy reflection:(BOOL)reflection
{
    ModelOBJ* model = _duck[0].model;
    int stride = sizeof(ModelOBJ::Vertex);
#define USE_VBO
#ifdef USE_VBO
    static bool vboInitialized = false;
    static unsigned int vbo, ibo;
    if (!vboInitialized) {
        vboInitialized = true;
        // Generate VBO
        glGenBuffers(1, &vbo);
        int numVertices = model->getNumberOfVertices();
        glBindBuffer(GL_ARRAY_BUFFER, vbo);
        glBufferData(GL_ARRAY_BUFFER, stride*numVertices, model->getVertexBuffer(), GL_STATIC_DRAW);
        glBindBuffer(GL_ARRAY_BUFFER, 0);
        // Generate index buffer
        glGenBuffers(1, &ibo);
        int numIndices = model->getNumberOfIndices();
        glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ibo);
        glBufferData(GL_ELEMENT_ARRAY_BUFFER, sizeof(unsigned short)*numIndices, model->getIndexBuffer(), GL_STATIC_DRAW);
        glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);
    }
#endif
    [self setupDuck:toy reflection:reflection];
#ifdef USE_VBO
    // Draw with VBO
    glBindBuffer(GL_ARRAY_BUFFER, vbo);
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, ibo);
    glEnableVertexAttribArray(GC_SHADER_ATTRIB_POSITION);
    glEnableVertexAttribArray(GC_SHADER_ATTRIB_NORMAL);
    glEnableVertexAttribArray(GC_SHADER_ATTRIB_TEX_COORD);
    glVertexAttribPointer(GC_SHADER_ATTRIB_POSITION, 3, GL_FLOAT, GL_FALSE, stride, (void*)offsetof(ModelOBJ::Vertex, position));
    glVertexAttribPointer(GC_SHADER_ATTRIB_TEX_COORD, 2, GL_FLOAT, GL_FALSE, stride, (void*)offsetof(ModelOBJ::Vertex, texCoord));
    glVertexAttribPointer(GC_SHADER_ATTRIB_NORMAL, 3, GL_FLOAT, GL_FALSE, stride, (void*)offsetof(ModelOBJ::Vertex, normal));
    glDrawElements(GL_TRIANGLES, model->getNumberOfIndices(), GL_UNSIGNED_SHORT, 0);
    glBindBuffer(GL_ARRAY_BUFFER, 0);
    glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, 0);
#else
    // Draw with array
    glEnableVertexAttribArray(GC_SHADER_ATTRIB_POSITION);
    glEnableVertexAttribArray(GC_SHADER_ATTRIB_NORMAL);
    glEnableVertexAttribArray(GC_SHADER_ATTRIB_TEX_COORD);
    glVertexAttribPointer(GC_SHADER_ATTRIB_POSITION, 3, GL_FLOAT, GL_FALSE, stride, model->getVertexBuffer()->position);
    glVertexAttribPointer(GC_SHADER_ATTRIB_TEX_COORD, 2, GL_FLOAT, GL_FALSE, stride, model->getVertexBuffer()->texCoord);
    glVertexAttribPointer(GC_SHADER_ATTRIB_NORMAL, 3, GL_FLOAT, GL_FALSE, stride, model->getVertexBuffer()->normal);
    glDrawElements(GL_TRIANGLES, model->getNumberOfIndices(), GL_UNSIGNED_SHORT, model->getIndexBuffer());
#endif
}
ModelOBJ::Vertex is just 3,2,3 float for pos, texcoord, normal. Indices are ushort.
UPDATE: I've now wrapped the draw setup (ie. the attrib binding calls) into an VAO and now performance is ok, even slightly better than drawing from main memory. So my conclusion is that VBO support without VAOs is broken on iOS. Is that assumption correct?
It is likely that the driver was falling back to software vertex submission (CPU copy from the VBO into the command buffer). This can be worse than using vertex arrays in client memory, as client memory us usually cached, while VBO contents are typically in write combined memory on iOS.
When using the CPU Sampler in Instruments, you'll see a ton if time underneath glDrawArrays/glDrawElements in gleRunVertexSubmitARM.
The most common reason to fall back to SW CPU submission is an unaligned attribute (current iOS devices require each attribute to be 4 byte aligned), but that doesn't appear to be the case for the 3 attributes you've shown. After that, the next most common cause is mixing client arrays and buffer objects in a single vertex array configuration.
In this case, you probably have a stray vertex attribute binding: some other array element is likely still enabled and pointing to a client array, causing everything to fall off of the hardware DMA path. By creating a VAO, you've either switched away from the misconfigured default VAO, or alternatively you are trying to enable a client VAO but being saved because client arrays are depreciated and do not function when used with VAOs (throws an INVALID_OPERATION error instead).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With