I have a DataFrame with the column of file paths.
I want to change it to only the file name.
My DataFrame looks like:
df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv',
                 r'C:/Users\Test1.csv']
})
| Sr No | filename | |
|---|---|---|
| 18 | [email protected] | C:/Users\Test.csv | 
| 19 | [email protected] | C:/Users\Test1.csv | 
| 20 | [email protected] | C:/Users\Test1.csv | 
In short, my output should look like:
df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': ['Test', 'Test1', 'Test1']
})
| Sr No | filename | |
|---|---|---|
| 18 | [email protected] | Test | 
| 19 | [email protected] | Test1 | 
| 20 | [email protected] | Test1 | 
I want to do it using python and pandas DataFrame.
I have 100 of rows in the 'filename' column.
I tried using:
import os
import glob
myfile = os.path.basename('C:/Users/Test.csv')
os.path.splitext(myfile)
print(os.path.splitext(myfile)[0])
But it is only useful for one path, how to apply it to entire column?
Use pandas.Series.apply to iterate through the column, and assign the result to new column.
df["filename"] = df["filename"].apply(os.path.basename)
or
df["filename"] = df["filename"].apply(lambda path: os.path.basename(path))
Example:
>>> df
   Sr No          Email            filename
0     18  [email protected]   C:/Users\Test.csv
1     19  [email protected]  C:/Users\Test1.csv
2     20  [email protected]  C:/Users\Test1.csv
>>> df["filename"] = df["filename"].apply(os.path.basename)
>>> df
   Sr No          Email   filename
0     18  [email protected]   Test.csv
1     19  [email protected]  Test1.csv
2     20  [email protected]  Test1.csv
There is also an option using Path('C:/Users\Test.csv').name from the pathlib module, but this is slower than os.path.basename because pathlib converts the string to a pathlib object.
Providing the slash prior to the file name is consistent, the fastest option is with pandas.Series.str.split (e.g. df['filename'].str.split('\\', expand=True).iloc[:, -1]).
Tested in python 3.11.2 and pandas 2.0.0
%timeit testingimport pandas as pd
import os
from pathlib import Path
# sample dataframe with 30000 rows
df = pd.DataFrame({'Sr No': [18, 19, 20],
                   'Email': ['[email protected]', '[email protected]', '[email protected]'],
                   'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv']})
df = pd.concat([df] * 10000, ignore_index=True)
# timeit tests
%timeit df["filename"].apply(lambda path: Path(path).name)
%timeit df["filename"].apply(os.path.basename)
%timeit df["filename"].apply(lambda path: os.path.basename(path))
%timeit df['filename'].str.split('\\', expand=True).iloc[:, -1]
67.4 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With