I am running JanusGraph (0.1.0) with Spark (1.6.1) on a single machine. I did my configuration as described here. When accessing the graph on the gremlin-console with the SparkGraphComputer, it is always empty. I cannot find any error in the logfiles, it is just an empty graph.
Is anyone using JanusGraph with Spark and can share his configuration and properties?
Using a JanusGraph, I get the expected Output:
gremlin> graph=JanusGraphFactory.open('conf/test.properties')
==>standardjanusgraph[cassandrathrift:[127.0.0.1]]
gremlin> g=graph.traversal()
==>graphtraversalsource[standardjanusgraph[cassandrathrift:[127.0.0.1]], standard]
gremlin> g.V().count()
14:26:10 WARN  org.janusgraph.graphdb.transaction.StandardJanusGraphTx  - Query requires iterating over all vertices [()]. For better performance, use indexes
==>1000001
gremlin>
Using a HadoopGraph with Spark as GraphComputer, the graph is empty:
gremlin> graph=GraphFactory.open('conf/test.properties')
==>hadoopgraph[cassandrainputformat->gryooutputformat]
gremlin> g=graph.traversal().withComputer(SparkGraphComputer)
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
gremlin> g.V().count()
            ==>0==============================================>   (14 + 1) / 15]
My conf/test.properties:
#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.janusgraph.hadoop.formats.cassandra.CassandraInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat
gremlin.hadoop.memoryOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
gremlin.hadoop.deriveMemory=false
gremlin.hadoop.jarsInDistributedCache=true
gremlin.hadoop.inputLocation=none
gremlin.hadoop.outputLocation=output
#
# Titan Cassandra InputFormat configuration
#
janusgraphmr.ioformat.conf.storage.backend=cassandrathrift
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
janusgraphmr.ioformat.conf.storage.keyspace=janusgraph
storage.backend=cassandrathrift
storage.hostname=127.0.0.1
storage.keyspace=janusgraph
#
# Apache Cassandra InputFormat configuration
#
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
cassandra.input.keyspace=janusgraph
cassandra.input.predicate=0c00020b0001000000000b000200000000020003000800047fffffff0000
cassandra.input.columnfamily=edgestore
cassandra.range.batch.size=2147483647
#
# SparkGraphComputer Configuration
#
spark.master=spark://127.0.0.1:7077
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.executor.memory=100g
gremlin.spark.persistContext=true
gremlin.hadoop.defaultGraphComputer=org.apache.tinkerpop.gremlin.spark.process.computer.SparkGraphComputer
HDFS seems to be configured correctly as described here:
gremlin> hdfs
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_178390072_1, ugi=cassandra (auth:SIMPLE)]]]
A tutorial for running JanusGraph on Cassandra and Elasticsearch using Elassandra, and then visualizing the graph using Graphexp. This will create files and potentially overwrite files on your box. Run with caution. I think only is a problem if you have an existing janus-graph-0.5.2 installation and it's at ~/lib/janusgraph-0.5.2 .
Running on-prem you may get better deals on high end servers, so in this case, you should consider running Spark and Cassandra in the same cluster for high performance computing. Regardless where you run your workloads, you have two approaches that you can use to integrate Spark and Cassandra.
In general, all nodes in a cluster have the same seed list. To change the configuration setting follow steps – open cassandra.yaml file. use the command ctrl+f to search in a file. Search for seeds. You will see the following property in file seeds: “127.0.0.1”. It is default setting for the cluster node.
A node in the cluster contains keyspaces, tables, schema information, etc. In Cassandra, cassandra.yaml is the main configuration file in which we can change the default setting as per requirements and after any changes in cassandra.yaml file you must remember to restart the node to take effect.
Try fixing these properties:
janusgraphmr.ioformat.conf.storage.keyspace=janusgraph
storage.keyspace=janusgraph
Replace with:
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
storage.cassandra.keyspace=janusgraph
The default keyspace name is janusgraph, so despite the mistakes on the property names, I don't think you would have observed that problem unless you loaded your data using a different keyspace name.
The latter property is described in the Configuration Reference. Also, keep an eye on this open issue to improve the docs for Hadoop-Graph usage.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With