Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I differentiate between arabic and urdu with regex?

I've been trying to find a way to match and distinguish between urdu and arabic purely in regex. I've found a few ways, but they aren't working for me. I don't know the languages exactly, but I know that the urdu alphabet is partially derived from arabic, and uses some of it's characters, but there has to be a way to distinguish between the two. If not with regex, is there another way to do so?

I'm creating a library in typescript which will detect 3 languages (English, Urdu, Arabic) and with that information I'll be applying different fonts to those texts depending on their language.

The first way I found using regex was : /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufb50-\ufbc1]|[\ufbd3-\ufd3f]|[\ufd50-\ufd8f]|[\ufd92-\ufdc7]|[\ufe70-\ufefc]|[\uFDF0-\uFDFD]/g to match for arabic, and for urdu: /[\u0600-\u06ff]|[\u0750-\u077f]|[\ufd50-\ufd8f]|[\ufe70-\ufeff]/g but the issue with this is that all of the regex used to identify urdu is part of what's used to identify arabic.

The second way I found to identify arabic is to use /\p{IsArabic}/gu, but when I enter this into regexr, regex101, or in my code I get errors that it isn't a recognised unicode category.

The following is a block of code I'm using to identify the code

  interface LanguageInterface { 
  hasEnglish: boolean; 
  hasUrdu: boolean; 
  hasArabic: boolean; 
}

function getLang(str): LanguageInterface { 
  let hasEnglish: boolean = false; 
  let hasUrdu: boolean = false; 
  let hasArabic: boolean = false;

  // string has english characters 
  if (str.match(/([\u0041-\u005A]|[\u0061-\u007E])+/g)) hasEnglish = true;

  // string has urdu words/ characters 
  if (str.match(/[\u0600-\u06ff]|[\u0750-\u077f]|[\ufd50-\ufd8f]|[\ufe70-\ufeff]/g)) hasUrdu = true;

  // string has arabic words/ characters
  if(str.match(/\p{IsArabic}/gu)) hasArabic = true;


  return { hasEnglish, hasUrdu, hasArabic, }; 
}
like image 841
Opelcorsa2001 Avatar asked Oct 15 '25 09:10

Opelcorsa2001


1 Answers

JavaScript/ECMAScript uses this syntax for Unicode scripts:

  • \p{Script=Latin} for English text
  • \p{Script=Arabic} for Arabic/Urdu text

Now, to differentiate Arabic and Urdu, you would need to compare their ranges.

Arabic:

U+0600–U+06FF

U+0750–U+077F

U+0870-U+089F

U+08A0–U+08FF

U+FB50–U+FDFF

U+FE70–U+FEFF

U+10EC0-U+10EFF

U+1EE00–U+1EEFF

Urdu:

U+0600 to U+06FF

U+0750 to U+077F

U+FB50 to U+FDFF

U+FE70 to U+FEFF

Since Urdu is just a subset of Arabic script as you can see, you can basically attempt to match their ranges after learning that it is in fact \p{Script=Arabic}.

However, many characters are shared between Arabic variants and not all the text you'll match will have unique letters for specific language. Not much you can do about that with regex, you would need to use some more advanced detection methods - by grammar, vocabulary, etc.

like image 122
Destroy666 Avatar answered Oct 16 '25 23:10

Destroy666



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!