Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert a single-character string to its codepoint

Tags:

elixir

I know I can get the codepoint of a character using the ?a syntax.

iex> ?a
97

But what about when a is a binary, "a"? How can I get the codepoint in this case?

like image 975
Adam Millerchip Avatar asked Sep 11 '25 17:09

Adam Millerchip


1 Answers

Beware of UTF-8 decomposed form. It’s always safer to call String.normalize/2 on input before further processing (passing :nfc as a second argument.)

One might expect

<<cp::utf8>> = "á"

to work, but it raises, while

<<cp::utf8>> = "á"

works pretty fine. There is no typo above, "á" in the first example and "á" in the second example are different.

"á" == "á"
#⇒ false

To safely match both composed and decomposed, no matter what, one might explicitly normalize it to composed form upfront.

with <<cp::utf8>> <- String.normalize("á", :nfc),
  do: cp
#⇒ 225

All the examples above are copy-pasteable.

"á"
|> String.normalize(:nfc)
|> String.to_charlist()
|> hd()
#⇒ 225

but

"á"
|> String.to_charlist()
|> hd()
#⇒ 97
like image 169
Aleksei Matiushkin Avatar answered Sep 13 '25 05:09

Aleksei Matiushkin