How to get the base file name from a column of paths

Question

I have a DataFrame with the column of file paths.

I want to change it to only the file name.

My DataFrame looks like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv',
                 r'C:/Users\Test1.csv']
})

Sr No	Email	filename
18	[email protected]	C:/Users\Test.csv
19	[email protected]	C:/Users\Test1.csv
20	[email protected]	C:/Users\Test1.csv

filename should be only Test and Test1
Just need to write [email protected] at twice i.e. once for Test.csv and another for Test1.csv.

In short, my output should look like:

df = pd.DataFrame({
    'Sr No': [18, 19, 20],
    'Email': ['[email protected]', '[email protected]', '[email protected]'],
    'filename': ['Test', 'Test1', 'Test1']
})

Sr No	Email	filename
18	[email protected]	Test
19	[email protected]	Test1
20	[email protected]	Test1

I want to do it using python and pandas DataFrame.

I have 100 of rows in the 'filename' column.

I tried using:

import os

import glob

myfile = os.path.basename('C:/Users/Test.csv')
os.path.splitext(myfile)
print(os.path.splitext(myfile)[0])

But it is only useful for one path, how to apply it to entire column?

Pavithran Ramachandran · Accepted Answer

Use pandas.Series.apply to iterate through the column, and assign the result to new column.

df["filename"] = df["filename"].apply(os.path.basename)

or

df["filename"] = df["filename"].apply(lambda path: os.path.basename(path))

Example:

>>> df
   Sr No          Email            filename
0     18  [email protected]   C:/Users\Test.csv
1     19  [email protected]  C:/Users\Test1.csv
2     20  [email protected]  C:/Users\Test1.csv

>>> df["filename"] = df["filename"].apply(os.path.basename)
>>> df
   Sr No          Email   filename
0     18  [email protected]   Test.csv
1     19  [email protected]  Test1.csv
2     20  [email protected]  Test1.csv

There is also an option using Path('C:/Users\Test.csv').name from the pathlib module, but this is slower than os.path.basename because pathlib converts the string to a pathlib object.

Providing the slash prior to the file name is consistent, the fastest option is with pandas.Series.str.split (e.g. df['filename'].str.split('\', expand=True).iloc[:, -1]).

Tested in python 3.11.2 and pandas 2.0.0

`%timeit` testing

import pandas as pd
import os
from pathlib import Path

# sample dataframe with 30000 rows
df = pd.DataFrame({'Sr No': [18, 19, 20],
                   'Email': ['[email protected]', '[email protected]', '[email protected]'],
                   'filename': [r'C:/Users\Test.csv', r'C:/Users\Test1.csv', r'C:/Users\Test1.csv']})
df = pd.concat([df] * 10000, ignore_index=True)

# timeit tests
%timeit df["filename"].apply(lambda path: Path(path).name)
%timeit df["filename"].apply(os.path.basename)
%timeit df["filename"].apply(lambda path: os.path.basename(path))
%timeit df['filename'].str.split('\', expand=True).iloc[:, -1]

result

67.4 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.18 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
43 ms ± 1.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
15.2 ms ± 216 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

How to get the base file name from a column of paths

Tags:

python

path

pandas

filenames

ADS KUL

1 Answers

`%timeit` testing

result

Pavithran Ramachandran

Recent Activity

Donate For Us

How to get the base file name from a column of paths

Tags:

python

path

pandas

filenames

ADS KUL

1 Answers

%timeit testing

result

Pavithran Ramachandran

Related questions

Recent Activity

Donate For Us

`%timeit` testing