Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up pandas drop() method?

I have a large excel file to clean around 200000 rows. So Im using pandas to drop unwanted rows if the conditions meet but it takes some time to run.

My current code looks like this

def cleanNumbers(number):  # checks number if it is a valid number
    vaild = True
    try:
        num = pn.parse('+' + str(number), None)
        if not pn.is_valid_number(num):
            vaild = False
    except:
        vaild = False
    return vaild

for UncleanNum in tqdm(TeleNum):
    valid = cleanNumbers(UncleanNum)  # calling cleanNumbers function
    if valid is False:
        df = df.drop(df[df.telephone == UncleanNum].index)  
        # dropping row if number is not a valid number

It takes around 30 min for this line of code to finish. Is there a more efficient way to drop rows with pandas? If not can I use numpy to have the same output?

Im not that aquainted with pandas or numpy so if you have any tips to share it would be helpful.

Edit:

Im using phonenumbers lib to check if the telephone number is valid. If its not a valid phonenumber i drop the row that number is on.

Example data

address     name    surname     telephone
Street St.  Bill    Billinson   7398673456897<--let say this is wrong
Street St.  Nick    Nick        324523452345
Street St.  Sam     Sammy       234523452345
Street St.  Bob     Bob         32452345234534<--and this too
Street St.  John    Greg        234523452345

Output

address     name    surname     telephone
Street St.  Nick    Nick        324523452345
Street St.  Sam     Sammy       234523452345
Street St.  John    Greg        234523452345

This is what my code does but it slow.

like image 864
John Zapanta Avatar asked Nov 17 '25 03:11

John Zapanta


1 Answers

In my opinion here main bootleneck is not drop, but custom function repeating for large number of values.

Create list of all valid numbers and then filter by boolean indexing with Series.isin:

v = [UncleanNum for UncleanNum in tqdm(TeleNum) if cleanNumbers(UncleanNum)]

df = df[df.telephone.isin(v)]

EDIT:

After some testing solution should be simplify, because function return boolean:

df1 = df[df['telephone'].apply(cleanNumbers)]
like image 52
jezrael Avatar answered Nov 19 '25 19:11

jezrael