how do I filter out a series of data (in pandas dataFrame) where I do not want the 1st letter to be 'Z', or any other character.
I have the following pandas dataFrame, df, (of which there are > 25,000 rows).
TIME_STAMP  Activity    Action  Quantity    EPIC    Price   Sub-activity    Venue
0   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
1   2017-08-30 08:00:05.000 Allocation  BUY 50  RRS 77.6    CPTY    066
3   2017-08-30 08:00:09.000 Allocation  BUY 91  BATS    47.875  CPTY    PXINLN
4   2017-08-30 08:00:10.000 Allocation  BUY 43  PNN 8.07    CPTY    WCAPD
5   2017-08-30 08:00:10.000 Allocation  BUY 270 SGE 6.93    CPTY    PROBDMAD
I am trying to remove all the rows where the 1st letter of the Venue is 'Z'.
For example, my usual filter code would be something like (filtering out all rows where the Venue = '066'
df = df[df.Venue != '066']
I can see this filter line filters out what I need by array, but I am not sure how to specify it within a filter context.
[k for k in df.Venue if 'Z' not in k]
Use str[0] for select first value or use startswith, contains with regex ^ for start of string. For invertong boolen mask is used ~:
df1 = df[df.Venue.str[0] != 'Z']
df1 = df[~df.Venue.str.startswith('Z')]
df1 = df[~df.Venue.str.contains('^Z')]
If no NaNs values faster is use list comprehension:
df1 = df[[x[0] != 'Z' for x in df.Venue]]
df1 = df[[not x.startswith('Z') for x in df.Venue]]
For the case where you do not have NaN values, you can convert the NumPy representation of a series to type '<U1' and test equality:
df1 = df[df['A'].values.astype('<U1') != 'Z']
from string import ascii_uppercase
from random import choice
L = [''.join(choice(ascii_uppercase) for _ in range(10)) for i in range(100000)]
df = pd.DataFrame({'A': L})
%timeit df['A'].values.astype('<U1') != 'Z'       # 4.05 ms per loop
%timeit [x[0] != 'Z' for x in df['A']]            # 11.9 ms per loop
%timeit [not x.startswith('Z') for x in df['A']]  # 23.7 ms per loop
%timeit ~df['A'].str.startswith('Z')              # 53.6 ms per loop
%timeit df['A'].str[0] != 'Z'                     # 53.7 ms per loop
%timeit ~df['A'].str.contains('^Z')               # 127 ms per loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With