Given a Series s and DataFrame df, how do I operate on each column of df with s?
df = pd.DataFrame( [[1, 2, 3], [4, 5, 6]], index=[0, 1], columns=['a', 'b', 'c'] ) s = pd.Series([3, 14], index=[0, 1]) When I attempt to add them, I get all np.nan
df + s a b c 0 1 0 NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN What I thought I should get is
a b c 0 4 5 6 1 18 19 20 I've seen this kind of question several times over and have seen many other questions that involve some element of this. Most recently, I had to spend a bit of time explaining this concept in comments while looking for an appropriate canonical Q&A. I did not find one and so I thought I'd write one.
These questions usually arises with respect to a specific operation, but equally applies to most arithmetic operations.
Series from every column in a DataFrame?Series from every column in a DataFrame?Series from every column in a DataFrame?Series from every column in a DataFrame?Use apply() to Apply Functions to Columns in Pandas The apply() method allows to apply a function for a whole DataFrame, either across columns or rows. We set the parameter axis as 0 for rows and 1 for columns. The new appended e column is the sum of data in column a and b .
To assign new columns to a DataFrame, use the Pandas assign() method. The assign() returns the new object with all original columns in addition to new ones. Existing columns that are re-assigned will be overwritten. The length of the newly assigned column must match the number of rows in the DataFrame.
Accessing Element from Series with Position In order to access the series element refers to the index number. Use the index operator [ ] to access an element in a series. The index must be an integer. In order to access multiple elements from a series, we use Slice operation.
It is helpful to create a mental model of what Series and DataFrame objects are.
Series A Series should be thought of as an enhanced dictionary. This isn't always a perfect analogy, but we'll start here. Also, there are other analogies that you can make, but I am targeting a dictionary in order to demonstrate the purpose of this post.
index These are the keys that we can reference to get at the corresponding values. When the elements of the index are unique, the comparison to a dictionary becomes very close.
values These are the corresponding values that are keyed by the index.
DataFrame A DataFrame should be thought of as a dictionary of Series or a Series of Series. In this case the keys are the column names and the values are the columns themselves as Series objects. Each Series agrees to share the same index which is the index of the DataFrame.
columns These are the keys that we can reference to get at the corresponding Series.
index This the the index that all of the Series values agree to share.
columns and index objects They are the same kind of things. A DataFrames index can be used as another DataFrames columns. In fact, this happens when you do df.T to get a transpose.
values This is a two-dimensional array that contains the data in a DataFrame. The reality is that values is not what is stored inside the DataFrame object. (Well, sometimes it is, but I'm not about to try to describe the block manager). The point is, it is better to think of this as access to a two-dimensional array of the data.
These are sample pandas.Index objects that can be used as the index of a Series or DataFrame or can be used as the columns of a DataFrame:
idx_lower = pd.Index([*'abcde'], name='lower') idx_range = pd.RangeIndex(5, name='range') These are sample pandas.Series objects that use the pandas.Index objects above:
s0 = pd.Series(range(10, 15), idx_lower) s1 = pd.Series(range(30, 40, 2), idx_lower) s2 = pd.Series(range(50, 10, -8), idx_range) These are sample pandas.DataFrame objects that use the pandas.Index objects above:
df0 = pd.DataFrame(100, index=idx_range, columns=idx_lower) df1 = pd.DataFrame( np.arange(np.product(df0.shape)).reshape(df0.shape), index=idx_range, columns=idx_lower ) Series on Series When operating on two Series, the alignment is obvious. You align the index of one Series with the index of the other.
s1 + s0 lower a 40 b 43 c 46 d 49 e 52 dtype: int64 Which is the same as when I randomly shuffle one before I operate. The indices will still align.
s1 + s0.sample(frac=1) lower a 40 b 43 c 46 d 49 e 52 dtype: int64 And is not the case when instead I operate with the values of the shuffled Series. In this case, Pandas doesn't have the index to align with and therefore operates from a positions.
s1 + s0.sample(frac=1).values lower a 42 b 42 c 47 d 50 e 49 dtype: int64 Add a scalar
s1 + 1 lower a 31 b 33 c 35 d 37 e 39 dtype: int64 DataFrame on DataFrame The similar is true when operating between two DataFrames. The alignment is obvious and does what we think it should do:
df0 + df1 lower a b c d e range 0 100 101 102 103 104 1 105 106 107 108 109 2 110 111 112 113 114 3 115 116 117 118 119 4 120 121 122 123 124 It shuffles the second DataFrame on both axes. The index and columns will still align and give us the same thing.
df0 + df1.sample(frac=1).sample(frac=1, axis=1) lower a b c d e range 0 100 101 102 103 104 1 105 106 107 108 109 2 110 111 112 113 114 3 115 116 117 118 119 4 120 121 122 123 124 It is the same shuffling, but it adds the array and not the DataFrame. It is no longer aligned and will get different results.
df0 + df1.sample(frac=1).sample(frac=1, axis=1).values lower a b c d e range 0 123 124 121 122 120 1 118 119 116 117 115 2 108 109 106 107 105 3 103 104 101 102 100 4 113 114 111 112 110 Add a one-dimensional array. It will align with columns and broadcast across rows.
df0 + [*range(2, df0.shape[1] + 2)] lower a b c d e range 0 102 103 104 105 106 1 102 103 104 105 106 2 102 103 104 105 106 3 102 103 104 105 106 4 102 103 104 105 106 Add a scalar. There isn't anything to align with, so broadcasts to everything:
df0 + 1 lower a b c d e range 0 101 101 101 101 101 1 101 101 101 101 101 2 101 101 101 101 101 3 101 101 101 101 101 4 101 101 101 101 101 DataFrame on Series If DataFrames are to be thought of as dictionaries of Series and Series are to be thought of as dictionaries of values, then it is natural that when operating between a DataFrame and Series that they should be aligned by their "keys".
s0: lower a b c d e 10 11 12 13 14 df0: lower a b c d e range 0 100 100 100 100 100 1 100 100 100 100 100 2 100 100 100 100 100 3 100 100 100 100 100 4 100 100 100 100 100 And when we operate, the 10 in s0['a'] gets added to the entire column of df0['a']:
df0 + s0 lower a b c d e range 0 110 111 112 113 114 1 110 111 112 113 114 2 110 111 112 113 114 3 110 111 112 113 114 4 110 111 112 113 114 What about if I want s2 and df0?
s2: df0: | lower a b c d e range | range 0 50 | 0 100 100 100 100 100 1 42 | 1 100 100 100 100 100 2 34 | 2 100 100 100 100 100 3 26 | 3 100 100 100 100 100 4 18 | 4 100 100 100 100 100 When I operate, I get the all np.nan as cited in the question:
df0 + s2 a b c d e 0 1 2 3 4 range 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 4 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN This does not produce what we wanted, because Pandas is aligning the index of s2 with the columns of df0. The columns of the result includes a union of the index of s2 and the columns of df0.
We could fake it out with a tricky transposition:
(df0.T + s2).T lower a b c d e range 0 150 150 150 150 150 1 142 142 142 142 142 2 134 134 134 134 134 3 126 126 126 126 126 4 118 118 118 118 118 But it turns out Pandas has a better solution. There are operation methods that allow us to pass an axis argument to specify the axis to align with.
- sub + add * mul / div ** pow
And so the answer is simply:
df0.add(s2, axis='index') lower a b c d e range 0 150 150 150 150 150 1 142 142 142 142 142 2 134 134 134 134 134 3 126 126 126 126 126 4 118 118 118 118 118 It turns out axis='index' is synonymous with axis=0. As is axis='columns' synonymous with axis=1:
df0.add(s2, axis=0) lower a b c d e range 0 150 150 150 150 150 1 142 142 142 142 142 2 134 134 134 134 134 3 126 126 126 126 126 4 118 118 118 118 118 df0.sub(s2, axis=0) lower a b c d e range 0 50 50 50 50 50 1 58 58 58 58 58 2 66 66 66 66 66 3 74 74 74 74 74 4 82 82 82 82 82 df0.mul(s2, axis=0) lower a b c d e range 0 5000 5000 5000 5000 5000 1 4200 4200 4200 4200 4200 2 3400 3400 3400 3400 3400 3 2600 2600 2600 2600 2600 4 1800 1800 1800 1800 1800 df0.div(s2, axis=0) lower a b c d e range 0 2.000000 2.000000 2.000000 2.000000 2.000000 1 2.380952 2.380952 2.380952 2.380952 2.380952 2 2.941176 2.941176 2.941176 2.941176 2.941176 3 3.846154 3.846154 3.846154 3.846154 3.846154 4 5.555556 5.555556 5.555556 5.555556 5.555556 df0.pow(1 / s2, axis=0) lower a b c d e range 0 1.096478 1.096478 1.096478 1.096478 1.096478 1 1.115884 1.115884 1.115884 1.115884 1.115884 2 1.145048 1.145048 1.145048 1.145048 1.145048 3 1.193777 1.193777 1.193777 1.193777 1.193777 4 1.291550 1.291550 1.291550 1.291550 1.291550 It's important to address some higher level concepts first. Since my motivation is to share knowledge and teach, I wanted to make this as clear as possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With