Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split string into key-value pairs using C++

Tags:

c++

dictionary

I have a string like this:

"CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567"

Now ": " splits key from value while \n separates the pairs. I want to add the key-value pairs to a map in C++.

Is there any efficient way of doing this considering optimization in mind?

like image 980
Viking Avatar asked Nov 15 '25 02:11

Viking


2 Answers

Well I have two methods here. The first one is the easy, obvious method that I use all the time (performance is rarely an issue). The second method is likely more efficient but I have not done any formal timings.

In my tests the second method is about 3 times faster.

#include <map>
#include <string>
#include <sstream>
#include <iostream>

std::map<std::string, std::string> mappify1(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string key, val;
    std::istringstream iss(s);

    while(std::getline(std::getline(iss, key, ':') >> std::ws, val))
        m[key] = val;

    return m;
}

std::map<std::string, std::string> mappify2(std::string const& s)
{
    std::map<std::string, std::string> m;

    std::string::size_type key_pos = 0;
    std::string::size_type key_end;
    std::string::size_type val_pos;
    std::string::size_type val_end;

    while((key_end = s.find(':', key_pos)) != std::string::npos)
    {
        if((val_pos = s.find_first_not_of(": ", key_end)) == std::string::npos)
            break;

        if((val_end = s.find('\n', val_pos)) == std::string::npos)
            val_end = s.size();

        m.emplace(s.substr(key_pos, key_end - key_pos), s.substr(val_pos, val_end - val_pos));

        key_pos = val_end + 1;

        if(val_end == s.size())
            break;
    }

    return m;
}
 
int main()
{
    std::string s = "CA: ABCD\nCB: ABFG\nCC: AFBV\nCD: 4567";

    std::cout << "mappify1: " << '\n';

    auto m = mappify1(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';

    std::cout << "mappify2: " << '\n';

    m = mappify2(s);
    for(auto const& p: m)
        std::cout << '{' << p.first << " => " << p.second << '}' << '\n';
}

Output:

mappify1: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
mappify2: 
{CA => ABCD}
{CB => ABFG}
{CC => AFBV}
{CD => 4567}
like image 153
Galik Avatar answered Nov 17 '25 18:11

Galik


This format is called "Tag-Value".

The most performance critical place where such encoding is used in the industry is probably financial FIX Protocol (= for key-value separator, and '\001' as entries delimiter). So if you are on x86 hardware then your best bet would be to google 'SSE4 FIX protocol parser github' and reuse the open sourced findings of HFT shops.

If you still want to delegate the vectorization part to the compiler and can spare few nanoseconds for readability then the most elegant solution is to store the result in a std::string (data) + boost::flat_map<boost::string_ref, boost::string_ref> (view). Parsing is a matter of taste, while-loop or strtok would be easiest for the compiler to parse. Boost-spirit based parser would be easiest for a human (familiar with boost-spirit) to read.

C++ for-loop based solution

#include <boost/container/flat_map.hpp> 
#include <boost/range/iterator_range.hpp>

#include <boost/range/iterator_range_io.hpp> 
#include <iostream>

// g++ -std=c++1z ~/aaa.cc
int main()
{
    using range_t = boost::iterator_range<std::string::const_iterator>;
    using map_t = boost::container::flat_map<range_t, range_t>;

    char const sep = ':';
    char const dlm = '\n';

    // this part can be reused for parsing multiple records
    map_t result;
    result.reserve(1024);

    std::string const input {"hello:world\n bye: world"};

    // this part is per-line/per-record
    result.clear();
    for (auto _beg = begin(input), _end = end(input), it = _beg; it != _end;)
    {
        auto sep_it = std::find(it, _end, sep);
        if (sep_it != _end)
        {
            auto dlm_it = std::find(sep_it + 1, _end, dlm);
            result.emplace(range_t {it, sep_it}, range_t {sep_it + 1, dlm_it});
            it = dlm_it + (dlm_it != _end);
        }
        else throw std::runtime_error("cannot parse");
    }

    for (auto& x: result)
        std::cout << x.first << " => " << x.second << '\n';

    return 0;
}
like image 26
bobah Avatar answered Nov 17 '25 18:11

bobah



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!