Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In pandas, what would be the idiomatic way to select multiple columns based on different patterns?

Tags:

python

pandas

r

I'm trying to replicate some of R4DS's dplyr exercises using Python's pandas, with the nycflights13.flights dataset. What I want to do is select, from that dataset:

  • Columns through year to day (inclusive);
  • All columns that end with "delay";
  • The distance and air_time columns

In the book, Hadley uses the following syntax:

library("tidyverse")
library("nycflights13")

flights_sml <- select(flights,
   year:day,
   ends_with("delay"),
   distance,
   air_time
)

In pandas, I came up with the following "solution":

import pandas as pd
from nycflights13 import flights

flights_sml = pd.concat([
    flights.loc[:, 'year':'day'],
    flights.loc[:, flights.columns.str.endswith("delay")],
    flights.distance,
    flights.air_time,
], axis=1)

Another possible implementation:

flights_sml = flights.filter(regex='year|day|month|delay$|^distance$|^air_time$', axis=1)

But I'm sure this is not the idiomatic way to write such DF-operation. I digged around, but haven't found something that fits in this situation from pandas API.

like image 761
Pedro Vinícius Avatar asked Jan 21 '26 20:01

Pedro Vinícius


1 Answers

You are correct. This will create multiple dataframes/series and then concatenate them together, resulting in a lot of extra work. Instead, you can create a list of the columns you want to use and then simply select those.

For example (keeping the same column order):

cols = ['year', 'month', 'day'] + [col for col in flights.columns if col.endswith('delay')] + ['distance', 'air_time']
flights_sml = flights[cols]
like image 92
Shaido Avatar answered Jan 23 '26 09:01

Shaido