I would like to do some DBSCAN on Spark. I have currently found 2 implementations:
I have tested the first one with the sbt configuration given in its github but:
functions in the jar are not the same as those in the doc or in the source on github. For example, I cannot find the train function in the jar
I manage to run a test with the fit function (found in the jar) but a bad configuration of epsilon (a little to big) put the code in an infinite loop.
code :
val model = DBSCAN.fit(eps, minPoints, values, parallelism)
Has someone managed to do someting with the first library?
Has someone tested the second one?
You can also consider using smile which provides an implementation of DBSCAN. You would have to use groupBy combined with either mapGroups or flatMapGroups in the most direct way and you would run dbscan there. Here's an example:
  import smile.clustering._
  val dataset: Array[Array[Double]] = Array(
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),
    Array(100, 100),
    Array(101, 100),
    Array(100, 101),
    Array(0, 0),
    Array(1, 0),
    Array(1, 2),
    Array(1, 1)
  )
  val dbscanResult = dbscan(dataset, minPts = 3, radius = 5)
  println(dbscanResult)
  // output
  DBSCAN clusters of 10 data points:
  0     6 (60.0%)
  1     4 (40.0%)
  Noise     0 ( 0.0%)
You can also write a User Defined Aggregate Function (UDAF) if you need to eek out more performance.
I use this approach at work to do clustering of time-series data, so grouping using Spark's time window function and then being able to execute DBSCAN within each window allows us to parallelize the implementation.
I was inspired by the following article to do this
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With