Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the efficient way to find some pattern in a big text?

Tags:

regex

text

I want to extract email addresses from a large text file. what is the best way to do it?

My idea is to find '@' in the text and use "Regex" to find email address into substring at (for example) 256 chars before this position and length of 512.

P.S.: Straightforwardly I want to know the best and most efficient way to find some pattern (like email addresses) in a huge text.

like image 955
salman Avatar asked Nov 23 '25 00:11

salman


2 Answers

256 and 512 sound like arbitrary values.

  • You could indeed scan for the @ sign, but then you'd have to read forward and backward until you encounter a character that is not allowed in an email address (for example, another @ sign, a whitespace, a backslash...)
  • Quoting wikipedia:

The local-part of an e-mail address may be up to 64 characters long and the domain name may have a maximum of 255 characters.

So those values would be nicer.

Now combine both methods and voila, you have your algorithm.

like image 80
Konerak Avatar answered Nov 25 '25 17:11

Konerak


It depends on how many false positives and false negatives you want. Email addresses tend to be made up of letters, numbers, and certain symbols. However, while it is probably extremely rare to see characters out of that set in a real email address, the standard certainly allows it. So you really need to decide how many real matches you want and how many matches you want that match your regular expression but are not actually email addresses.

Here's one answer excludes many valid cases and also probably includes too many:

[A-Za-z0-9!#$%&*+-=?^_~]{1,64}@[A-Za-z0-9-.]{1,255}\.[A-Z]{2,6}
like image 23
Trey Hunner Avatar answered Nov 25 '25 15:11

Trey Hunner