Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF16 hex to text

I have UTF-16 hex representation such as “0633064406270645” which is "سلام" in Arabic language.

I would like to convert it to its text equivalent. Is there a straight way to do that in PostgreSQL?

I can convert the UTF code point like below; unfortunately it seems UTF16 is not supported. Any ideas on how to do it in PostgreSQL, worst case I will write a function?

SELECT convert_from (decode (E'D8B3D984D8A7D985', 'hex'),'UTF8');

"سلام"

SELECT convert_from (decode (E'0633064406270645', 'hex'),'UTF16');

ERROR:  invalid source encoding name "UTF16"
********** Error **********
like image 228
Adam Avatar asked Sep 04 '25 17:09

Adam


1 Answers

PostgreSQL does not support UTF-16 natively. I suggest you to convert your data to UTF-8 before supplying it to the DB. If it's too late (wrong data already exists in your DB), you can use these maintenance functions to convert data from UTF-16 (logic copied from wikipedia):

-- convert from bytea, containing UTF-16-BE data
CREATE OR REPLACE FUNCTION convert_from_utf16be(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
WITH source(unit) AS (
  SELECT (get_byte(utf16_data, i) << 8) | get_byte(utf16_data, i + 1)
  FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
  SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
  FROM source
)
SELECT string_agg(CASE
  WHEN unit >= 56320 AND unit <= 57343 THEN CASE
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
    ELSE invalid_replacement
  END
  WHEN unit >= 55296 AND unit <= 56319 THEN CASE
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
    ELSE invalid_replacement
  END
  ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16-LE data
CREATE OR REPLACE FUNCTION convert_from_utf16le(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
WITH source(unit) AS (
  SELECT get_byte(utf16_data, i) | (get_byte(utf16_data, i + 1) << 8)
  FROM generate_series(0, octet_length(utf16_data) - 2, 2) i
),
codes(lag, unit, lead) AS (
  SELECT lag(unit, 1) OVER (), unit, lead(unit, 1) OVER ()
  FROM source
)
SELECT string_agg(CASE
  WHEN unit >= 56320 AND unit <= 57343 THEN CASE
    WHEN lag >= 55296 AND lag <= 56319 THEN '' -- already processed
    ELSE invalid_replacement
  END
  WHEN unit >= 55296 AND unit <= 56319 THEN CASE
    WHEN lead >= 56320 AND lead <= 57343 THEN chr((unit << 10) + lead - 56613888)
    ELSE invalid_replacement
  END
  ELSE chr(unit)
END, '')
FROM codes
$function$;

-- convert from bytea, containing UTF-16 data (with or without BOM)
CREATE OR REPLACE FUNCTION convert_from_utf16(utf16_data bytea, invalid_replacement text DEFAULT '?')
  RETURNS text
  LANGUAGE sql
  IMMUTABLE
  STRICT
AS $function$
SELECT CASE COALESCE(octet_length(utf16_data), 0)
  WHEN 0 THEN ''
  WHEN 1 THEN invalid_replacement
  ELSE CASE substring(utf16_data FOR 2)
    WHEN E'\\xFFFE' THEN convert_from_utf16le(substring(utf16_data FROM 3), invalid_replacement)
    ELSE convert_from_utf16be(substring(utf16_data FROM 3), invalid_replacement)
  END
END
$function$;

With these functions, you can convert from all kind of UTF-16:

SELECT convert_from_utf16be(decode('0633064406270645D852DF62', 'hex')),
       convert_from_utf16le(decode('330644062706450652D862DF', 'hex')),
       convert_from_utf16(decode('FEFF0633064406270645D852DF62', 'hex')),
       convert_from_utf16(decode('FFFE330644062706450652D862DF', 'hex'));

-- convert_from_utf16be | convert_from_utf16le | convert_from_utf16 | convert_from_utf16
------------------------+----------------------+--------------------+-------------------
-- سلام𤭢                | سلام𤭢                | سلام𤭢              | سلام𤭢
like image 93
pozs Avatar answered Sep 07 '25 14:09

pozs