How do you set the display precision in PySpark when calling .show()?
Consider the following example:
from math import sqrt
import pyspark.sql.functions as f
data = zip(
    map(lambda x: sqrt(x), range(100, 105)),
    map(lambda x: sqrt(x), range(200, 205))
)
df = sqlCtx.createDataFrame(data, ["col1", "col2"])
df.select([f.avg(c).alias(c) for c in df.columns]).show()
Which outputs:
#+------------------+------------------+
#|              col1|              col2|
#+------------------+------------------+
#|10.099262230352151|14.212583322380274|
#+------------------+------------------+
How can I change it so that it only displays 3 digits after the decimal point?
Desired output:
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+
This is a PySpark version of this scala question. I'm posting it here because I could not find an answer when searching for PySpark solutions, and I think it can be helpful to others in the future.
You can use format_number to format a number to desired decimal places as stated in the official api document: Formats numeric column x to a format like '#,###,###. ##', rounded to d decimal places, and returns the result as a string column.
In the code for showing the full column content we are using show() function by passing parameter df. count(),truncate=False, we can write as df. show(df. count(), truncate=False), here show function takes the first parameter as n i.e, the number of rows to show, since df.
truncatebool, optional. If set to True , truncate strings longer than 20 chars by default. If set to a number greater than one, truncates long strings to length truncate and align cells right. verticalbool, optional. If set to True , print output rows vertically (one line per column value).
Solution: PySpark Show Full Contents of a DataFrame In Spark or PySpark by default truncate column content if it is longer than 20 chars when you try to output using show() method of DataFrame, in order to show the full contents without truncating you need to provide a boolean argument false to show(false) method.
The easiest option is to use pyspark.sql.functions.round():
from pyspark.sql.functions import avg, round
df.select([round(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+
This will maintain the values as numeric types.
The functions are the same for scala and python. The only difference is the import.
You can use format_number to format a number to desired decimal places as stated in the official api document:
Formats numeric column x to a format like '#,###,###.##', rounded to d decimal places, and returns the result as a string column.
from pyspark.sql.functions import avg, format_number 
df.select([format_number(avg(c), 3).alias(c) for c in df.columns]).show()
#+------+------+
#|  col1|  col2|
#+------+------+
#|10.099|14.213|
#+------+------+
The transformed columns would of StringType and a comma is used as a thousands separator:
#+-----------+--------------+
#|       col1|          col2|
#+-----------+--------------+
#|500,100.000|50,489,590.000|
#+-----------+--------------+
As stated in the scala version of this answer we can use regexp_replace to replace the , with any string you want
Replace all substrings of the specified string value that match regexp with rep.
from pyspark.sql.functions import avg, format_number, regexp_replace
df.select(
    [regexp_replace(format_number(avg(c), 3), ",", "").alias(c) for c in df.columns]
).show()
#+----------+------------+
#|      col1|        col2|
#+----------+------------+
#|500100.000|50489590.000|
#+----------+------------+
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With