I have a CSV file from the Kaggle Titanic competition as follows. The record format of this file is described by the following columns: PassengerId, Survived, Pclass, Name, Sex, Age, SibSp, Parch, Ticket, Fare, Cabin, Embarked. I want to analyze the data in this file and check whether passengers traveling in a group had a better survival rate. For this I assume that the value for Ticket will be the same for all passengers in a group.
I loaded the CSV in MS Access, and executed the following query to get the desired result set:
SELECT a.Ticket, a.PassengerId, a.Survived
FROM train a
WHERE 1 < (SELECT COUNT(*) FROM train b WHERE b.Ticket = a.Ticket)
ORDER BY a.Ticket
I am not being able to extract the same result set as above, without writing a loop.
Let's see if this matches:
df.groupby(['Ticket']).filter(lambda x: x.Ticket.count()>1)[['Ticket','PassengerId','Survived']]
Or with Jezrael's suggestion:
df.groupby(['Ticket']).filter(lambda x: len(x)>1)[['Ticket','PassengerId','Survived']]
I am using groupby on Tickets then filtering my dataframe to those records where the count in that ticket group is greater than 1, using filter.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With