Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Read multiple lines from a large file in non-ascending order

I have a very large text file, over 1GB, and I have a list of integers that represent line numbers, and the need is to produce another file containing the text of the original files line numbers in the new file.

Example of original large file:

ogfile line 1
some text here
another line
blah blah

So when I get a List of "2,4,4,1" the output file should read:

some text here
blah blah
blah blah
ogfile line 1

I have tried string lineString = File.ReadLines(filename).Skip(lineNumList[i]-1).Take(1).First();

but this takes way to long as the file has to be read in, skipped to the line in question, then reread the next time... and we are talking millions of lines in the 1GB file and my List<int> is thousands of line numbers.

Is there a better/faster way to read a single line, or have the reader skip to a specific line number without "skipping" line by line?

like image 286
Tizz Avatar asked Dec 06 '25 08:12

Tizz


1 Answers

The high-order bit here is: you are trying to solve a database problem using text files. Databases are designed to solve big data problems; text files, as you've discovered, are terrible at random access. Use a database, not a text file.

If you are hell-bent upon using a text file, what you have to do is take advantage of stuff you know about the likely problem parameters. For example, if you know that, as you imply, there are ~1M lines, each line is ~1KB, and the set of lines to extract is ~0.1% of the total lines, then you can come up with an efficient solution like this:

  • Make a set containing the line numbers to be read. The set must be fast to check for membership.
  • Make a dictionary that maps from line numbers to line contents. This must be fast to look up by key and fast to add new key/value pairs.
  • Read each line of the file one at a time; if the line number is in the set, add the contents to the dictionary.
  • Now iterate the list of line numbers and map the dictionary contents; now we have a sequence of strings.
  • Dump that sequence to the destination file.

We have five operations, so hopefully it is around five lines of code.

void DoIt(string pathIn, IEnumerable<int> lineNumbers, string pathOut)
{
  var lines = new HashSet<int>(lineNumbers);
  var dict = File.ReadLines(pathIn)
    .Select((lineText, index) => new KeyValuePair<int, string>(index, lineText))
    .Where(p => lines.Contains(p.Key))
    .ToDictionary(p => p.Key, p => p.Value);
  File.WriteAllLines(pathOut, lineNumbers.Select(i => dict[i]));
}

OK, got it in six. Pretty good.


Notice that I made use of all those assumptions; if the assumptions are violated then this stops being a good solution. In particular we assume that the dictionary is going to be small compared to the size of the input file. If that is not true, then you'll need a more sophisticated technique to get efficiencies.

Conversely, can we extract additional efficiencies? Yes, provided we know facts about likely inputs. Suppose for example we know that the same file will be iterated several times but with different line number sets, but those sets are likely to have overlap. In that case we can re-use dictionaries instead of rebuilding them. That is, suppose a previous operation has left a Dictionary<int, string> computed for lines (10, 20, 30, 40) and file X. If a request then comes in for lines (30, 20, 10) for file X, we already have the dictionary in memory.

The key thing I want to get across in this answer is that you must know something about the inputs in order to build an efficient solution; the more restrictions you can articulate on the inputs, the more efficient a solution you can build. Take advantage of all the knowledge you have about the problem domain.

like image 141
Eric Lippert Avatar answered Dec 08 '25 21:12

Eric Lippert



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!