Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Modify single value in pandas DataFrame with integer row and label column

I want to modify a single value in a DataFrame. The typical suggestion for doing this is to use df.at[] and reference the position as the index label and the column label, or to use df.iat[] and reference the position as the integer row and the integer column. But I want to reference the position as the integer row and the column label.

Assume this DataFrame:

dateindex apples oranges bananas
2021-01-01 14:00:01.384624 1 X 3
2021-01-05 13:43:26.203773 4 5 6
2021-01-31 08:23:29.837238 7 8 9
2021-02-08 10:23:09.095632 0 1 2
data = [{'apples':1, 'oranges':'X', 'bananas':3},
        {'apples':4, 'oranges':5,   'bananas':6},
        {'apples':7, 'oranges':8,   'bananas':9},
        {'apples':0, 'oranges':1,   'bananas':2}]
indexes = [pd.to_datetime('2021-01-01 14:00:01.384624'),
           pd.to_datetime('2021-01-05 13:43:26.203773'),
           pd.to_datetime('2021-01-31 08:23:29.837238'),
           pd.to_datetime('2021-02-08 10:23:09.095632')]
idx = pd.Index(indexes, name='dateindex')
df = pd.DataFrame(data, index=idx)

I want to change the value "X" to "2". I don't know the exact time; I just know that it's the first row. But I do know that I want to change the "oranges" column.

I want to do something like df.at[0,'oranges'], but I can't do that; I get a KeyError.

The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?

like image 436
wfaulk Avatar asked Oct 29 '25 07:10

wfaulk


2 Answers

Wrt

The best thing that I can figure out is to do df.at[df.index[0],'oranges'], but that seems so awkward when they've gone out of their way to provide both by-label and by-integer-offset interfaces. Is that the best thing?

Yes, it is. And I agree, it is awkward. The old .ix used to support these mixed indexing cases better but its behaviour depended on the dtype of the axis, making it inconsistent. In the meanwhile...

The other options, which have been used in the other answers, can all issue the SettingWithCopy warning. It's not guaranteed to raise the issue but it might, based on what the indexing criteria are and how values are assigned.

Referencing Combining positional and label-based indexing and starting with this df, which has dateindex as the index:

                             apples oranges  bananas
dateindex
2021-01-01 14:00:01.384624        1      X         3
2021-01-05 13:43:26.203773        4      5         6
2021-01-31 08:23:29.837238        7      8         9
2021-02-08 10:23:09.095632        0      1         2

Using both options:

  1. with .loc or .at:
    df.at[df.index[0], 'oranges'] = -50
    
                                 apples oranges  bananas
    dateindex
    2021-01-01 14:00:01.384624        1     -50        3
    2021-01-05 13:43:26.203773        4      5         6
    2021-01-31 08:23:29.837238        7      8         9
    2021-02-08 10:23:09.095632        0      1         2
    
  2. with .iloc or .iat:
    df.iat[0, df.columns.get_loc('oranges')] = -20
    
                                 apples oranges  bananas
    dateindex
    2021-01-01 14:00:01.384624        1     -20        3
    2021-01-05 13:43:26.203773        4      5         6
    2021-01-31 08:23:29.837238        7      8         9
    2021-02-08 10:23:09.095632        0      1         2
    

FWIW, I find approach #1 more consistent since it can handle multiple row indexes without changing the functions/methods used: df.loc[df.index[[0, 2]], 'oranges'] but approach #2 needs a different column indexer when there are multiple columns: df.iloc[[0, 2], df.columns.get_indexer(['oranges', 'bananas'])].

like image 61
aneroid Avatar answered Oct 31 '25 05:10

aneroid


Solution with Series.iat

If it doesn't seem more awkward to you, you can use the iat method of pandas Series:

df["oranges"].iat[0] = 2

Time performance comparison with other methods

As this method doesn't raise any warning, it can be interesting to compare its time performance with other proposed solutions.

%%timeit
df.at[df.index[0], 'oranges'] = 2
# > 9.91 µs ± 47.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df.iat[0, df.columns.get_loc('oranges')] = 2
# > 13.5 µs ± 74.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df["oranges"].iat[0] = 2
# > 3.49 µs ± 16.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

The pandas.Series.iat method seems to be the most performant one (I took the median of three runs).

Let's try again with huge DataFrames

With a DatetimeIndex

# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]
df_large.index = pd.date_range(start=0, periods=100000)
# 2070-01-01 to 2243-10-16, a bit unrealistic

%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = -2
# > 10.1 µs ± 85.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = -2
# > 13.2 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df_large["col_55555"].iat[55555] = -2
# > 3.31 µs ± 19 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

With a RangeIndex

# Generating random data
df_large = pd.DataFrame(np.random.randint(0, 50, (100000, 100000)))
df_large.columns = ["col_{}".format(i) for i in range(100000)]

%%timeit
df_large.at[df_large.index[55555], 'col_55555'] = 2
# > 4.5 µs ± 18.4 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df_large.iat[55555, df_large.columns.get_loc('col_55555')] = 2
# > 13.5 µs ± 50.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

%%timeit
df_large["col_55555"].iat[55555] = 2
# > 3.49 µs ± 20.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Since it is a simple indexing with O(n) complexity, the size of the array doesn't change much the results, except when it comes to the "at + index" ; strangely enough, it shows worst performance with small dataframes. Thanks to the author wfaulk for spotting that using a RangeIndex decreases the access time of the "at + index" method. The time performance remains higher and constant when dealing with DatetimeIndex with pd.Series.iat.

like image 44
Whole Brain Avatar answered Oct 31 '25 05:10

Whole Brain



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!