Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capture emoji in javascript

I have to write a module in javascript that can detect emoji and replace each of them with a div tag with the link to an image.

(Emojis are strings of the form :) :-) etc )

The problem is I have several hundreds of them and trying to write a regular expression to capture all of them is not a good idea.

Is there a way to do this given I have a hashmap with the keys are the emoji strings and the values are the hex values. (all emojis are within a range)

Thanks!

EDITED: So maybe the way I stated the problem was not clear. Imagine you have a dictionary of 100000 words, each of 4-5 characters. And a stream of lines, each line contains 100 - 150 characters. How would you find the words in the lines?

like image 960
neutralino Avatar asked Feb 27 '26 16:02

neutralino


1 Answers

Javascript strings are, unfortunately, sequences of 16-bit unsigned integer codepoints, normally representing the UTF-16 encoding of a Unicode string. Consequently, Unicode characters outside the BMP (codepoints starting at U+10000) are represented as surrogate pairs, each of which is two "characters" long. That's visible in regular expressions; if you want to match, for example, U+1F623 ("PERSEVERING FACE"), you need to match \uD83D\uDE23.

While annoying, this is not totally impractical. Ranges are still pretty easy to match. For example, assuming you believe emoji to be the range U+1F300...U+1F64F, which is most of but not all of the characters listed in the emoji transcription data at http://www.unicode.org/Public/UNIDATA/EmojiSources.txt, then you could use the regex:

/\uD83C[\uDF00-\uDFFF]|\uD83D[\uDC00-\uDE4F]/

To compute those codes, you need to understand the mapping from a non-BMP Unicode codepoint to two surrogate characters. It's not that complicated :) First, you subtract U+10000 from the Unicode codepoint (the designers of UTF-16 chose to avoid wasting codespace on codepoints which already fit in 16 bits). That leaves you with a 20-bit number, since the largest valid Unicode codepoint is U+10FFFF. Now, you need to split that 20-bit number into two 10-bit chunks. The high-order 10 bits are added to U+D800 to form the first surrogate code, and the low-order 10 bits are added to U+DC00 to form the second surrogate.

Using the PERSEVERING FACE example:

U+1F623 => 0F623       (subtract 0x10000)
        => 0000 1111 0110 0010 0011  (in binary)
        => 00 0011 1101, 10 0010 0011 (two 10-bit chunks)
        =>  03D,  223  (back to hex)
        => D83D, DE23  (add D800 to first and DC00 to second) 

A "simple" way to get your computer to do these computations, if you have bash and the iconv utility, is:

printf $'\U1F623\U1F3A9' |
iconv -f utf8 -t utf16le | hexdump -e '8/2 "%04x " "\n"'

(I split that into two lines for display, but you can just type it as one line. You can put as many codes as you want into the string passed to printf.)

like image 104
rici Avatar answered Mar 01 '26 06:03

rici



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!