How to store multiple random blobs in a single file?

Question

I am generating random blobs (which in my case is a std::string made out of a std::vector<unsigned char> of arbitrary non-zero length). Blobs are random in the sense that they might be of any size within a given size range, and they contain random unsigned characters.

I am trying to write these blobs to a file(say Blobs.txt) in such a way that each line in my file will give me one of those generated blobs. For example, the first line of Blobs.txt will contain the first generated blob, the last line of Blob.txt will contain the last generated blob and so on.

The problem that I'm currently facing is that the blob themselves might contain newline characters(). Hence, the number of lines in my Blobs.txt might not be equal to the number of blobs generated.

As blobs are random, I can not make any assumptions about how each of these blobs look like.

I see that the problem is being caused by the interference of two different use cases of the newline character in a single file, one in the blob content themselves and the other as a separator in the file. And, to solve this I either have to replace the newline character with something else in my blob content and then put them in the Blobs.txt, or, I have to change the blob separator in my Blobs.txt file from newline character to something else which can not appear in my blob content(not sure if the second solution is even possible).

I know that I am going to store a fixed number of blobs in Blobs.txt.

I was thinking of appending some explicit fixed length numbering prior to each blob in Blobs.txt file, e.g., "001:<blob 1 string>" but this does not seem to solve the problem because "001:" itself might be present in some blob, and we don't know the blob sizes beforehand.

One possible solution is to write the offsets in a separate file(say Blob_Offsets.txt) where I store the size of each of these blobs and use these to seek Blobs.txt. I can't want to rely on the in-memory structures because once they are gone, I won't be able to make sense out of Blobs.txt.

I was wondering if it is really needed to create a separate offset file because this will increase the number of disk accesses while reading a blob? Is there a better way to solve this problem?

jkb · Accepted Answer

Are you committed to using one "line" per blob? I think you're going to have a tough time with that, especially if you have new-lines in your blob.

I think I would use a binary approach instead. Open the file in binary mode so that new-lines don't get translated to CR-LF (in Windows). Write a binary count, followed by your blob, using stream.write. When you read the file, use stream.read to read the count value, then to read that many bytes into your blob.

Enlico · Answer

each line in my file will give me one of those generated blobs

and

change the blob separator in my Blobs.txt file from newline character to something else

are incompatible.The things that you want to be shown on different lines in the text editor have to be separated by (on linux, or on windows, or whatever); you can't use another character, because the editor will not interpret it as a line break and therefore it will not show them on separate lines.

Given this, you can only change the which are not meant to be line separators in the first place, i.e. those that randomly happen to be in the blobs. Escaping them via \ could be a way, e.g.

#include <iostream>
#include <string>
#include <range/v3/action/join.hpp>
#include <range/v3/view/transform.hpp>

using namespace ranges::views;
using namespace ranges::actions;

int main()
{
    // the string that unfortunately contains 
 already
    std::string s{"hello
world"};

    // indeed this prints 2 lines
    std::cout << s << std::endl;

    // function to escape 
 and \ itself
    auto constexpr escape = [](char c){
        return c == '
' ? "\n" :
              (c == '\' ? "\\" : std::string{c});
    };

    std::string s2 = s | transform(escape) // this is a range of strings now
                       | join; // so we join all the strings together

    std::cout << s2 << std::endl; // prints 1 line
}

As suggested in the comments, I've escaped the control character via \, and since \ is the escaping character, I have to escape it too, if I meet it in the string, which means that a \ (a "true" backspace in the text) must become \\ (a "true" double backspace).

The escaping is easy, because you want to escape single characters; the un-escaping is a bit more complicated, because escaped characters are not 1 character each, but 2: the escaped character and the escaping character. So to unescape you must group the escaping and escaped characters together. E.g. the characters in the raw string (see below, the Trivia part) R"(hel\nlo world)" where you have escaped a \ obtaining \ and newline obtaining a should be grouped like in ["h","e","l",R"(\)","n","l","o",R"( )","w","o","r","l","d"], which I think you can do via ranges::views::group; then you would ranges::views::transform by leaving strings of length 1 unaltered, and unescaping the strings of length 2 (i.e. R"(\)" and R"( )" would become\and respectively); then you wouldjoin`.

Trivia

Instead of

    std::string s{"hel\nlo
world"};

you can use a raw string literal (number 6 here) to avoid escaping characters:

    std::string s{R"(hel
lo
world)"};

where is truly just \ followed by the letter n, whereas for the line break I've truly pressed Enter in the middle of the string; if your editor allows you, you can put that character in the text; in Vim it comes out like this

std::string s{R"(hel
lo^Mworld)"};

where the ^M is the single newline character, obtained via Ctrl-vEnter.

How to store multiple random blobs in a single file?

Tags:

c++

c++17

Romy

2 Answers

jkb

Enlico

Recent Activity

Donate For Us

How to store multiple random blobs in a single file?

Tags:

c++

c++17

Romy

2 Answers

jkb

Enlico

Related questions

Recent Activity

Donate For Us