Storing lots of small files: archive vs. filesystem

Question

I am creating an application that requires a lot of image thumbnails (~3000, 5-25KB). Because speed is essential I plan on loading these images into memory when the application starts. At runtime, new thumbnails will be downloaded and added to the collective.

I could store them all in a folder, but reading thousands of files into memory when a program starts hardly seems efficient.

My second option would be to save them in some kind of (compressed) archive. This would make storage itself and loading more efficient (I think). However, new files will be added regularly, and that will probably not go as smoothly as just saving them in a folder.

Is storing a cache of small files in a (compressed) archive a bad idea or not? Are ZIP files the way to go? Would I be better off using uncompressed archives (and if so, what kind)?

All image files will be JPEG's.

Thanks in advance!

EDIT: I am considering to drop the "load everything into memory on application start" thing. This would simplify my question a little. My initial idea to put everything in one big file now seems less beneficial, since the problem of many files in one directory can be solved by hashing into subdirectories.

Craig · Accepted Answer

Small files don't compress especially well, so you may not gain much compression.

While loading the files will be fast because they are smaller, decompression adds time. You'd have to experiment to see which is faster.

I would think the real issues would relate to the efficiency of the file system when it comes to iterating over all the little files, especially if they are all in one folder. Windows is notorious for being pretty inefficient when folders contain lots of files.

I would consider doing something like writing them out into one file, uncompressed, that could be streamed into memory -- maybe not necessarily contiguous memory, as that might be a problem. But the idea would be to put them all in one file. Then write some kind of index that ties a file name or other identifier to an offset from which the location of the image in memory could be determined.

New images could be added at the end, and the index updated appropriately.

It isn't fancy but that's what you're trying to avoid. An archive or even a file system gives you lots of power and flexibility but at the cost of efficiency. When you know what you want to do, sometimes simple is better.

I would consider implementing a solution that reads files from a folder, another that divides the files into subfolders and subsubfolders so there are no more than 100 or so files in any given folder, then time those solutions so you have something to compare to. I would think a simple indexed file would be fast enough that you wouldn't even need to pre-load the images like you're suggesting -- just retrieve them as you need them and keep them around once they're in memory.

Storing lots of small files: archive vs. filesystem

Tags:

java

performance

caching

storage

Rapsey

1 Answers

Craig

Recent Activity

Donate For Us

Storing lots of small files: archive vs. filesystem

Tags:

java

performance

caching

storage

Rapsey

1 Answers

Craig

Related questions

Recent Activity

Donate For Us