PySpark: running the same operation on multiple columns in one go

Question

My DataFrame table contains rows such as

['row1', 'col_1', 'col_2', 'col_3', ..., 'col_N', 'alpha']

N (the number of columns except the first and the last ones) is relatively large.

Now, I need to create another DataFrame out of this by multiplying each of these columns named col_i by column alpha. Is there a smarter way than to do a manual multiplication per each of these columns, as in:

sc = SparkContext()
sqlc = SQLContext(sc)

sqlc.sql('SELECT col_1 * alpha, col_2 * alpha, ..., col_N * alpha FROM table')

So I'd like to know whether it's possible to do the same operation on each column without specifically writing it for each one.

Katya Willard · Accepted Answer

Not sure how efficient this is, but I might do something like this:

col_names = df.columns
# start from one to exclude row, end at -1 to exclude alpha column
for x in range(1, len(col_names) - 1): 
    new_column_name = col_names[x] + "_x_alpha" # get good new column names
    df = df.withColumn(new_column_name, (getattr(df, col_names[x]) * getattr(df, col_names[3])))

This will yield the same dataframe you had originally, but with new columns that multiply each column_* entry by the entry in alpha.

When I run df.show() on my trivial example, I get this output:

row col_1 col_2 alpha col_1_x_alpha col_2_x_alpha
1   2     3     4     8             12           
2   3     4     5     15            20           
3   4     5     6     24            30

Then you could run a SQL query to get only those columns that are of type column_*_x_alpha.

PySpark: running the same operation on multiple columns in one go

Tags:

python

sql

select

dataframe

pyspark

mar tin

1 Answers

Katya Willard

Recent Activity

Donate For Us

PySpark: running the same operation on multiple columns in one go

Tags:

python

sql

select

dataframe

pyspark

mar tin

1 Answers

Katya Willard

Related questions

Recent Activity

Donate For Us