how to perform the equivalent of a correlated subquery in pandas

Question

I have a CSV file from the Kaggle Titanic competition as follows. The record format of this file is described by the following columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. I want to analyze the data in this file and check whether passengers traveling in a group had a better survival rate. For this I assume that the value for Ticket will be the same for all passengers in a group.

I loaded the CSV in MS Access, and executed the following query to get the desired result set:

SELECT a.Ticket, a.PassengerId, a.Survived
FROM train a
WHERE 1 < (SELECT COUNT(*) FROM train b WHERE b.Ticket = a.Ticket)
ORDER BY a.Ticket

I am not being able to extract the same result set as above, without writing a loop.

Scott Boston · Accepted Answer

Let's see if this matches:

df.groupby(['Ticket']).filter(lambda x: x.Ticket.count()>1)[['Ticket','PassengerId','Survived']]

Or with Jezrael's suggestion:

df.groupby(['Ticket']).filter(lambda x: len(x)>1)[['Ticket','PassengerId','Survived']]

I am using groupby on Tickets then filtering my dataframe to those records where the count in that ticket group is greater than 1, using filter.

how to perform the equivalent of a correlated subquery in pandas

Tags:

python

pandas

Sudipto Chatterjee

1 Answers

Scott Boston

Recent Activity

Donate For Us

how to perform the equivalent of a correlated subquery in pandas

Tags:

python

pandas

Sudipto Chatterjee

1 Answers

Scott Boston

Related questions

Recent Activity

Donate For Us