What is the efficient way to find some pattern in a big text?

Question

I want to extract email addresses from a large text file. what is the best way to do it?

My idea is to find '@' in the text and use "Regex" to find email address into substring at (for example) 256 chars before this position and length of 512.

P.S.: Straightforwardly I want to know the best and most efficient way to find some pattern (like email addresses) in a huge text.

Konerak · Accepted Answer

256 and 512 sound like arbitrary values.

You could indeed scan for the @ sign, but then you'd have to read forward and backward until you encounter a character that is not allowed in an email address (for example, another @ sign, a whitespace, a backslash...)
Quoting wikipedia:

The local-part of an e-mail address may be up to 64 characters long and the domain name may have a maximum of 255 characters.

So those values would be nicer.

Now combine both methods and voila, you have your algorithm.

Trey Hunner · Answer

It depends on how many false positives and false negatives you want. Email addresses tend to be made up of letters, numbers, and certain symbols. However, while it is probably extremely rare to see characters out of that set in a real email address, the standard certainly allows it. So you really need to decide how many real matches you want and how many matches you want that match your regular expression but are not actually email addresses.

Here's one answer excludes many valid cases and also probably includes too many:

[A-Za-z0-9!#$%&*+-=?^_~]{1,64}@[A-Za-z0-9-.]{1,255}\.[A-Z]{2,6}

Here's one answer excludes many valid cases and also probably includes too many:

[A-Za-z0-9!#$%&*+-=?^_~]{1,64}@[A-Za-z0-9-.]{1,255}\.[A-Z]{2,6}

What is the efficient way to find some pattern in a big text?

Tags:

regex

text

salman

2 Answers

Konerak

Trey Hunner

Recent Activity

Donate For Us

What is the efficient way to find some pattern in a big text?

Tags:

regex

text

salman

2 Answers

Konerak

Trey Hunner

Related questions

Recent Activity

Donate For Us