Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to read a zip stream as fast as possible using threads?

I have a 1GB zip file containing about 2000 textfiles. I want to read all files and all lines as fast as possible.

    try (ZipFile zipFile = new ZipFile("file.zip")) {
        zipFile.stream().parallel().forEach(entry -> readAllLines(entry)); //reading with BufferedReader.readLine();
    }

Result: stream.parallel() is about 30-50% faster than a normal stream.

Question: could I optimize the performance even more if I'd not be reading the stream using the parallel API, but firering my own threads explicit to read from the file?

like image 587
membersound Avatar asked Oct 14 '25 20:10

membersound


2 Answers

Maybe. Keep in mind that switching threads is somewhat expensive and parallel() of Java 8 is pretty good.

Uncompressing ZIP streams is CPU intensive, so more threads won't make things faster. If you create your own execution service where you carefully balance the number of threads with the number of cores, you might be able to find a better sweet spot than Java 8's parallel().

The other thing left is using a better buffering strategy for reading the file. But that's not easy for ZIP archives. You can try to use ZipInputStream instead of ZipFile but it's not so easy to mix the stream API with Java 8's Stream API ((de)compressing files using NIO).

like image 198
Aaron Digulla Avatar answered Oct 17 '25 09:10

Aaron Digulla


I recently met this problem, and solved by creating a ZipFile instance for each worker thread, like the following.

List<String> entryNames;
try (ZipFile file = new ZipFile(path)) {
    entryNames = file.stream().map(ZipEntry::getName).
        collect(Collectors.toList());
}
Queue<ZipFile> files = new ConcurrentLinkedQueue<>();
ThreadLocal<ZipFile> ctx = ThreadLocal.withInitial(() -> {
    try {
        ZipFile file = new ZipFile(path);
        files.add(file);
        return file;
    }
    catch (IOException ignored) {
        return null;
    }
});
try {
    entryNames.parallelStream().forEach(entryName -> {
        try {
            ZipFile file = ctx.get();
            if (file == null) return;
            ZipEntry entry = new ZipEntry(entryName);
            if (entry.isDirectory()) return;
            try (InputStream in = file.getInputStream(entry)) {
                byte[] bytes = in.readAllBytes();
                // process bytes
            }
        }
        catch (IOException ignored) {}
    });
}
finally {
    for (ZipFile file : files) {
        try { file.close(); } catch (IOException ignored) {}
    }
}
like image 26
relent95 Avatar answered Oct 17 '25 10:10

relent95