Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

New posts in apache-spark

How to calculate sum and count in a single groupBy?

How to create a udf in PySpark which returns an array of strings?

Why does starting StreamingContext fail with “IllegalArgumentException: requirement failed: No output operations registered, so nothing to execute”?

Rolling your own reduceByKey in Spark Dataset

In Apache Spark, why does RDD.union not preserve the partitioner?

PySpark and broadcast join example

Spark union column order

How to find Spark's installation directory?

java ubuntu apache-spark

Join two ordinary RDDs with/without Spark SQL

Multiple condition filter on dataframe

Left Anti join in Spark?

scala apache-spark

SQL query in Spark/scala Size exceeds Integer.MAX_VALUE

Why does Spark application fail with “ClassNotFoundException: Failed to find data source: kafka” as uber-jar with sbt assembly?

Is it possible to alias columns programmatically in spark sql?

How to add any new library like spark-csv in Apache Spark prebuilt version

PySpark: modify column values when another column value satisfies a condition

environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON

How to write the resulting RDD to a csv file in Spark python

How to configure high performance BLAS/LAPACK for Breeze on Amazon EMR, EC2

How does Spark running on YARN account for Python memory usage?