Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to improve regular expression to extract phone numbers?

I am trying to use regular expression to extract phone number from web links. The problem I am facing is with unwanted id's and other elements of webpage. If anyone can suggest some improvements, it would be really helpful. Below is the code and regular expression I am using in Python,

from urllib2 import urlopen as uReq
uClient = uReq(url)
page_html = uClient.read()
print re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)

Now, for most of the website, the script getting some page element values and sometimes accurate. Please suggest some modifications in expression

re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)

My output looks like below for different url's

http://www.fraitagengineering.com/index.html
['(877) 424-4752']
http://hunterhawk.com/
['1481240672', '1481240643', '1479852632', '1478013441', '1481054486', '1481054560', '1481054598', '1481054588', '1476820246', '1481054521', '1481054540', '1476819829', '1481240830', '1479855986', '1479855990', '1479855994', '1479855895', '1476819760', '1476741750', '1476741750', '1476820517', '1479862863', '1476982247', '1481058326', '1481240672', '1481240830', '1513106590', '1481240643', '1479855986', '1479855990', '1479855994', '1479855895', '1479852632', '1478013441', '1715282331', '1041873852', '1736722557', '1525761106', '1481054486', '1476819760', '1481054560', '1476741750', '1481054598', '1476741750', '1481054588', '1476820246', '1481054521', '1476820517', '1479862863', '1481054540', '1476982247', '1476819829', '1481058326', '(925) 798-4950', '2093796260']
http://www.lbjewelrydesign.com/
['213-629-1823', '213-629-1823']

I want just phone numbers with (000) 000-0000 (not that I have added space after parenthesis),(000)-000-0000or000-000-0000` format. Any suggestions appreciated. Please note that I have already referred to this link : Find phone numbers in python script

I need improvement in regex for my specific needs.

like image 776
D-hash-pirit Avatar asked Dec 10 '25 15:12

D-hash-pirit


1 Answers

The following regular expression can be used to match the samples that you presented and other similar numbers:

(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}

The following example script can be used to test positive and negative cases other than play with the regular expression:

import re

positiveExamples = [
    '(000) 000-0000',
    '(000)-000-0000',
    '(000)000-0000',
    '000-000-0000'
]
negativeExamples = [
    '000 000-0000',
    '000-000 0000',
    '000 000 0000',
    '000000-0000',
    '000-0000000',
    '0000000000'
]

reObj = re.compile(r"(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}")

for example in positiveExamples:
    print 'Asserting positive example: %s' % example
    assert reObj.match(example)

for example in negativeExamples:
    print 'Asserting negative example: %s' % example
    assert reObj.match(example) == None
like image 106
Eduardo Avatar answered Dec 13 '25 05:12

Eduardo