I'm trying to parse LaTeX escape codes (e.g. \alpha) to the Unicode (Mathematical) characters (i.e. U+1D6FC).
Right now this means I am using this symbols parser (rule):
struct greek_lower_case_letters_ : x3::symbols<char32_t>
{
  greek_lower_case_letters_::greek_lower_case_letters_()
  {
    add("alpha",   U'\u03B1');
  }
} greek_lower_case_letter;
This works fine but means I'm getting a std::u32string as a result.
I'd like an elegant way to keep the Unicode code points in the code (for maybe future automation) and maintenance reasons. Is there a way to get this kind of parser to parse into a UTF-8 std::string?
I thought of making the symbols struct parse to a std::string, but that would be highly inefficient (I know, premature optimization bla bla).
I was hoping there was some elegant way instead of going through a bunch of hoops to get this working (symbols appending strings to the result).
I do fear though that using the code point values and wanting UTF8 will incur a runtime cost of the conversion (or is there a constexpr UTF32->UTF8 conversion possibe?).  
The JSON parser example at cierelabs shows an approach that uses semantic actions to append code points in utf8 encoding:
  auto push_utf8 = [](auto& ctx)
  {
     typedef std::back_insert_iterator<std::string> insert_iter;
     insert_iter out_iter(_val(ctx));
     boost::utf8_output_iterator<insert_iter> utf8_iter(out_iter);
     *utf8_iter++ = _attr(ctx);
  };
  // ...
  auto const escape =
         ('u' > hex4)           [push_utf8]
     |   char_("\"\\/bfnrt")    [push_esc]
     ;
This is used in their
typedef x3::rule<unicode_string_class, std::string> unicode_string_type;
Which, as you can see, build the utf8 sequence into a std::string attribute.
See for full code: https://github.com/cierelabs/json_spirit/blob/x3_devel/ciere/json/parser/x3_grammar_def.hpp
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With