Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark: running the same operation on multiple columns in one go

My DataFrame table contains rows such as

['row1', 'col_1', 'col_2', 'col_3', ..., 'col_N', 'alpha']

N (the number of columns except the first and the last ones) is relatively large.

Now, I need to create another DataFrame out of this by multiplying each of these columns named col_i by column alpha. Is there a smarter way than to do a manual multiplication per each of these columns, as in:

sc = SparkContext()
sqlc = SQLContext(sc)

sqlc.sql('SELECT col_1 * alpha, col_2 * alpha, ..., col_N * alpha FROM table')

So I'd like to know whether it's possible to do the same operation on each column without specifically writing it for each one.

like image 300
mar tin Avatar asked Jan 23 '26 17:01

mar tin


1 Answers

Not sure how efficient this is, but I might do something like this:

col_names = df.columns
# start from one to exclude row, end at -1 to exclude alpha column
for x in range(1, len(col_names) - 1): 
    new_column_name = col_names[x] + "_x_alpha" # get good new column names
    df = df.withColumn(new_column_name, (getattr(df, col_names[x]) * getattr(df, col_names[3])))

This will yield the same dataframe you had originally, but with new columns that multiply each column_* entry by the entry in alpha.

When I run df.show() on my trivial example, I get this output:

row col_1 col_2 alpha col_1_x_alpha col_2_x_alpha
1   2     3     4     8             12           
2   3     4     5     15            20           
3   4     5     6     24            30  

Then you could run a SQL query to get only those columns that are of type column_*_x_alpha.

like image 83
Katya Willard Avatar answered Jan 25 '26 07:01

Katya Willard



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!