How can I differentiate between arabic and urdu with regex?

Question

I've been trying to find a way to match and distinguish between urdu and arabic purely in regex. I've found a few ways, but they aren't working for me. I don't know the languages exactly, but I know that the urdu alphabet is partially derived from arabic, and uses some of it's characters, but there has to be a way to distinguish between the two. If not with regex, is there another way to do so?

I'm creating a library in typescript which will detect 3 languages (English, Urdu, Arabic) and with that information I'll be applying different fonts to those texts depending on their language.

The second way I found to identify arabic is to use /\p{IsArabic}/gu, but when I enter this into regexr, regex101, or in my code I get errors that it isn't a recognised unicode category.

The following is a block of code I'm using to identify the code

  interface LanguageInterface { 
  hasEnglish: boolean; 
  hasUrdu: boolean; 
  hasArabic: boolean; 
}

function getLang(str): LanguageInterface { 
  let hasEnglish: boolean = false; 
  let hasUrdu: boolean = false; 
  let hasArabic: boolean = false;

  // string has english characters 
  if (str.match(/([\u0041-\u005A]|[\u0061-\u007E])+/g)) hasEnglish = true;

  // string has urdu words/ characters 
  if (str.match(/[\u0600-\u06ff]|[\u0750-\u077f]|[\ufd50-\ufd8f]|[\ufe70-\ufeff]/g)) hasUrdu = true;

  // string has arabic words/ characters
  if(str.match(/\p{IsArabic}/gu)) hasArabic = true;


  return { hasEnglish, hasUrdu, hasArabic, }; 
}

Destroy666 · Accepted Answer

JavaScript/ECMAScript uses this syntax for Unicode scripts:

\p{Script=Latin} for English text
\p{Script=Arabic} for Arabic/Urdu text

Now, to differentiate Arabic and Urdu, you would need to compare their ranges.

Arabic:

U+0600–U+06FF

U+0750–U+077F

U+0870-U+089F

U+08A0–U+08FF

U+FB50–U+FDFF

U+FE70–U+FEFF

U+10EC0-U+10EFF

U+1EE00–U+1EEFF

Urdu:

U+0600 to U+06FF

U+0750 to U+077F

U+FB50 to U+FDFF

U+FE70 to U+FEFF

Since Urdu is just a subset of Arabic script as you can see, you can basically attempt to match their ranges after learning that it is in fact \p{Script=Arabic}.

However, many characters are shared between Arabic variants and not all the text you'll match will have unique letters for specific language. Not much you can do about that with regex, you would need to use some more advanced detection methods - by grammar, vocabulary, etc.

How can I differentiate between arabic and urdu with regex?

Tags:

javascript

regex

typescript

Opelcorsa2001

1 Answers

Destroy666

Recent Activity

Donate For Us

How can I differentiate between arabic and urdu with regex?

Tags:

javascript

regex

typescript

Opelcorsa2001

1 Answers

Destroy666

Related questions

Recent Activity

Donate For Us