Why I cannot replace the len below with "count" and keep same output behavior?. I mean, it runs without error, but the output is very different, if I change the len to count, I get an empty dataset.
import pandas as pd
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc=len)
print(pivot_table)
Results in:
Value 10 20 30 40 50 60 70 80 90
Category
A 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN NaN
B NaN 1.0 NaN NaN 1.0 NaN NaN 1.0 NaN
C NaN NaN 1.0 NaN NaN 1.0 NaN NaN 1.0
size produces same result as len:
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="size")
Even weirder, I can make "count" work like "size" (or len) if I do this strange workaround: create a copy of Value with a different "ValueCopy" name :
df["ValueCopy"] = df["Value"]
pivot_table = df.pivot_table(index='Category', columns='Value', values='ValueCopy', aggfunc="count")
But if I change the pivot_table call to use "count" without the extra dummy column workaround:
pivot_table = df.pivot_table(index='Category', columns='Value', values='Value', aggfunc="count")
it results in:
Empty DataFrame
Columns: []
Index: [A, B, C]
Questions:
"count" work if it does not call len? (I tried to read the pivot_table Pandas code, but I found it too convoluted)In Polars, "count" works perfectly fine as:
import polars as pl
# Create a DataFrame
data = {
'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]
}
df = pl.DataFrame(data)
# Pivot the DataFrame
pivot_table = df.pivot(index='Category', columns='Value', values='Value', aggregate_function="count")
# Print the pivot table
print(pivot_table)
results in:
Shape: (3, 10)
┌──────────┬──────┬──────┬──────┬───┬──────┬──────┬──────┬──────┐
│ Category ┆ 10 ┆ 20 ┆ 30 ┆ … ┆ 60 ┆ 70 ┆ 80 ┆ 90 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ u32 ┆ u32 ┆ ┆ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════╪══════╪══════╪══════╪═══╪══════╪══════╪══════╪══════╡
│ A ┆ 1 ┆ null ┆ null ┆ … ┆ null ┆ 1 ┆ null ┆ null │
│ B ┆ null ┆ 1 ┆ null ┆ … ┆ null ┆ null ┆ 1 ┆ null │
│ C ┆ null ┆ null ┆ 1 ┆ … ┆ 1 ┆ null ┆ null ┆ 1 │
└──────────┴──────┴──────┴──────┴───┴──────┴──────┴──────┴──────┘
Likewise DuckDb SQL pivot count has no trouble counting:
import pandas as pd
import duckdb
# Your existing DataFrame
data = {'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B', 'C'],
'Value': [10, 20, 30, 40, 50, 60, 70, 80, 90]}
df = pd.DataFrame(data)
# Create a DuckDB connection
con = duckdb.connect()
# Register the DataFrame with DuckDB
con.register('df_pivot', df)
# Perform the pivot operation in DuckDB
query = """
PIVOT df_pivot
ON Value
USING Count(Value)
GROUP BY Category
"""
pivot_table = con.execute(query).fetchdf()
print(pivot_table)
results in:
Category 10 20 30 40 50 60 70 80 90
0 A 1 0 0 1 0 0 1 0 0
1 B 0 1 0 0 1 0 0 1 0
2 C 0 0 1 0 0 1 0 0 1
In Pandas, pivot_table at some point calls groupby(index+columns).
In your example that means grouping on the only two existing columns, so there are no values to count.
When using len in this situation it returns the length of each group as a DataFrame (using DataFrame.__len__, which returns len(self.index)) -- which IMO seems more odd than returning nothing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With