Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

format phone number in csv using pandas

Python/pandas n00b. I have code that is processing event data stored in csv files. Data from df["CONTACT PHONE NUMBER"] is outputting the phone number as `5555551212.0' Obviously, the ".0" is a problem, but added because it's an integer, I imagine?

Anyhoo, I decided that I should format the phone number for usability's sake.

The number comes from the csv file, unformatted. The number will always be ten digits: 5555551212, but I would like to display it as (555)555-1212.

import glob
import os
import pandas as pd
import sys

csvfiles = os.path.join(directory, '*.csv')
for csvfile in glob.glob(csvfiles):
    df = pd.read_csv(filename)
    #formatting the contact phone
    phone_nos = df["CONTACT PHONE NUMBER"]
    for phone_no in phone_nos:
        contactphone = "(%c%c%c)%c%c%c-%c%c%c%c" % tuple(map(ord,phone_no))

The last line gives me the following error: not enough arguments for format string

But maybe this isn't the pandas way of doing this. Since I'm iterating through an array, I also need to save the data in its existing column or rebuild that column after the phone numbers have been processed.

like image 704
mattrweaver Avatar asked Jan 27 '26 18:01

mattrweaver


2 Answers

I think phone numbers should be stored as a string.
When reading the csv you can ensure this column is read as a string:

pd.read_csv(filename, dtype={"CONTACT PHONE NUMBER": str})

You can use the string methods, naively adding:

In [11]: s = pd.Series(['5554443333', '1114445555', np.nan, '123'])  # df["CONTACT PHONE NUMBER"]

# phone_nos = '(' + s.str[:3] + ')' + s.str[3:7] + '-' + s.str[7:11]

Edit: as Noah answers in a related question, you can do this more directly/efficiently using str.replace:

In [12]: phone_nos = s.str.replace('^(\d{3})(\d{3})(\d{4})$', r'(\1)\2-\3')

In [13]: phone_nos
Out[13]:
0    (555)4443-333
1    (111)4445-555
2              NaN
3              123
dtype: object

But there is a problem here as you have a malformed number, not precisely 10 digits, so you could NaN those:

In [14]: s.str.contains('^\d{10}$')  # note: NaN is truthy
Out[14]:
0     True
1     True
2      NaN
3    False
dtype: object

In [15]: phone_nos.where(s.str.contains('^\d{10}$'))
Out[15]:
0    (555)4443-333
1    (111)4445-555
2              NaN
3              NaN
dtype: object

Now, you might like to inspect the bad formats you have (maybe you have to change your output to encompass them, e.g. if they included a country code):

In [16]: s[~s.str.contains('^\d{10}$').astype(bool)]
Out[16]:
3    123
dtype: object
like image 115
Andy Hayden Avatar answered Jan 29 '26 08:01

Andy Hayden


I think the problem is that the phone numbers are stored as float64, so, adding a few things will fix your inner loop:

In [75]:

df['Phone_no']
Out[75]:
0    5554443333
1    1114445555
Name: Phone_no, dtype: float64
In [76]:

for phone_no in df['Phone_no']:
    contactphone = "(%c%c%c)%c%c%c-%c%c%c%c" % tuple(map(ord,list(str(phone_no)[:10])))
    print contactphone
(555)444-3333
(111)444-5555

However, I think it is easier just to have the phone numbers as string (@Andy_Hayden made a good point on missing values, so I made up the following dataset:)

In [121]:

print df
     Phone_no   Name
0  5554443333   John
1  1114445555   Jane
2         NaN  Betty

[3 rows x 2 columns]
In [122]:

df.dtypes
Out[122]:
Phone_no    float64
Name         object
dtype: object
#In [123]: You don't need to convert the entire DataFrame, only the 'Phone_no' needs to be converted.
#
#df=df.astype('S4')
In [124]:

df['PhoneNumber']=df['Phone_no'].astype(str).apply(lambda x: '('+x[:3]+')'+x[3:6]+'-'+x[6:10])
In [125]:

print df
       Phone_no   Name    PhoneNumber
0  5554443333.0   John  (555)444-3333
1  1114445555.0   Jane  (111)444-5555
2           NaN  Betty         (nan)-

[3 rows x 3 columns]

In [134]:
import numpy as np
df['PhoneNumber']=df['Phone_no'].astype(str).apply(lambda x: np.where((len(x)>=10)&set(list(x)).issubset(list('.0123456789')),
                                                                      '('+x[:3]+')'+x[3:6]+'-'+x[6:10],
                                                                      'Phone number not in record'))
In [135]:

print df
     Phone_no   Name                 PhoneNumber
0  5554443333   John               (555)444-3333
1  1114445555   Jane               (111)444-5555
2         NaN  Betty  Phone number not in record

[3 rows x 3 columns]
like image 36
CT Zhu Avatar answered Jan 29 '26 08:01

CT Zhu