Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Emitting from socket.io during XML parsing blocks until processing complete

I have a nodeJS/express/socket.io setup with an angular client.

Messages from server to client seem to be blocked until a long running process completes and then arrive all at once, why is this?

One of the functions of the site is for a user to upload an xml file for processing. The files are not insignificant, and processing takes around 20 seconds.

I am using a sax wrapper (xml-flow) to parse the XML, which raises events at the end of relevant xml tags. These events call a callback which in turn calls a socket.emit() to the client to indicate progress.

Everything seems to be hooked up ok, but the progress messages seem to be blocked somehow until the parsing ends, at which point they all arrive at once at the client.

Using breakpoints, I have identified that the socket.emit() calls do take place at regular intervals, and I am unaware of any batching mechanism.

If it helps, I'd be happy to post any other bits of code you might think be relevant.

socket.on('analysis:request', function (data) {
    // this message arrives immediately
    socket.emit('analysis:status', 'Request acknowledged');

    // this process takes about 20 seconds, and uses the callback every 2-3 seconds
    uploader.processFile(data, function () {
        return {
            statusUpdate: function (message) {
                // these messages arrive all at once at the end of processing
                socket.emit('analysis:status', message);
            }
        }
    });
});

function processFile(filename, callback) {

    callback().statusUpdate("Unzipping file");
    var steps = [];

    var zip = new admZip(__dirname + '/' + filename);
    var entries = zip.getEntries();
    var entry = zip.getEntry(entries[0]);

    var bufferStream = new stream.PassThrough;
    bufferStream.end(new Buffer(entry.getData()));

    callback().statusUpdate("Processing stream");
    var xmlStream = flow(bufferStream, { strict: true, preserveMarkup: flow.NEVER, simplifyNodes: false, normalize: false });

    xmlStream.on('end', function () {
        storeResults(steps, callback);
    });

    xmlStream.on('error', function (ex) {
        console.log('xml-flow error', ex);
    });

    xmlStream.on('tag:Step', function (element) {
        steps.push(element);
        if (steps.length % 50 == 0) {
            callback().statusUpdate("Caching step " + stepNumber);
        }
    });
}

Update:

Well, I investigated both of the given answers and unfortunately did not come up with a satisfactory result.

using process.nextTick didn't really help much, as the long running process didn't seem to 'give up the tick' until it was finished anyway

using sax-async does acheive what I want, but takes about three times as long.

So given the choices I did a lot more work on optimising the parsing (preparsing, selective parsing etc) and managed to get it down to around 12 seconds, so can just about live with telling the user to wait without being able to indicate progress until completion.

Bounty awarded to kio, as it led me to sax-async which did actually work, just not fast enough :(

like image 617
paul Avatar asked Aug 31 '25 01:08

paul


2 Answers

cause of the problem

It looks like this process is CPU intensive task and probably it completely blocked event loop until finished computation. It is because node.js application normally runs on a single thread.

in general about CPU intensive task

Look at this answer for more details https://stackoverflow.com/a/17957474/4138339

When you have intensive tasks:

  1. try to split them to smaller processes and call process.nextTick. Look for more details in documentation https://nodejs.org/api/process.html#process_process_nexttick_callback_arg
  2. use child process workers. Look for more details in documentation https://nodejs.org/api/child_process.html#child_process_asynchronous_process_creation.

solution for OP problem

I focused on xml-flow module, but it is possible that some previous task is also intensive. I looked to xml-flow source code, and I think I have solution for your specific case.

When this event 'tag:Step' is called it is not waiting for your callback to finished. But we can pause xml-flow for a while, do other things and resume. I do not have your application, therefore I can not write exactly working example. You have to try to write it yourself.

For pause stream use xmlStream.pause() then call your callback and after a while resume stream xmlStream.resume(). I think that for a quick check you can call xmlStream.resume() in timeout, but for production it is better to use process.nextTick

like image 152
Krzysztof Sztompka Avatar answered Sep 03 '25 18:09

Krzysztof Sztompka


.emit in nodejs is NOT async. It might seem like that, but it's not truly async. When it gets called, all its event handlers get executed synchronously 1 by 1.

looks like xml-flow is a wrapper for sax-js which in turn seems to be doing everything sync, despite the .emit calls.

You'd need to write your own wrapper (or fork xml-flow and change it). Of course, someone has already done something like that: sax-async

like image 32
kio Avatar answered Sep 03 '25 19:09

kio