JanusGraph
diff --git a/‎.travis.yml‎
Lines changed: 4 additions & 2 deletions b/‎.travis.yml‎
Lines changed: 4 additions & 2 deletions
diff --git a/‎docs/basics.adoc‎
Lines changed: 2 additions & 0 deletions b/‎docs/basics.adoc‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎docs/deploymentscenarios.adoc‎
Lines changed: 45 additions & 0 deletions b/‎docs/deploymentscenarios.adoc‎
Lines changed: 45 additions & 0 deletions
diff --git a/‎docs/hadoop.adoc‎
Lines changed: 176 additions & 76 deletions b/‎docs/hadoop.adoc‎
Lines changed: 176 additions & 76 deletions
@@ -91,8 +91,10 @@ install:
       echo "Building all modules for Coverity analysis";
       travis_retry travis_wait \
           mvn install -DskipTests=true -Dmaven.javadoc.skip=true --batch-mode --show-version;
-    elif [ "${TRAVIS_BRANCH}" == "${COVERITY_BRANCH_NAME}" ] || [ -n "${COVERITY_ONLY:-}" ]; then
-      echo "Skipping module build on Coverity branch/job";
+    elif [ "${TRAVIS_BRANCH}" == "${COVERITY_BRANCH_NAME}" ] || ! [ -z "${COVERITY_ONLY:-}" ]; then
+      echo "Building all modules for test-compile coverage, but skipping Coverity upload";
+      travis_retry travis_wait \
+          mvn install -DskipTests=true -Dmaven.javadoc.skip=true --batch-mode --show-version;
     else
       echo "Building janusgraph-${MODULE} and dependencies";
       travis_retry travis_wait \
 
@@ -942,6 +942,8 @@ By adding the `JanusGraphIoRegistry` to the `org.apache.tinkerpop.gremlin.driver
 
 It is possible to extend Gremlin Server with other means of communication by implementing the interfaces that it provides and leverage this with JanusGraph.  See more details in the appropriate TinkerPop documentation.
 
+include::deploymentscenarios.adoc[]
+
 include::configuredgraphfactory.adoc[]
 
 include::multinodejanusgraphcluster.adoc[]
 
@@ -0,0 +1,45 @@
+[[deployment-scenarios]]
+== Deployment Scenarios 
+
+JanusGraph offers a wide choice of storage and index backends which results in great flexibility of how it can be deployed. This chapter presents a few possible deployment scenarios to help with the complexity that comes with this flexibility.
+
+Before discussing the different deployment scenarios, it is important to understand the roles of JanusGraph itself and that of the backends. First of all, applications only communicate directly with JanusGraph, mostly by sending Gremlin traversals for execution. JanusGraph then communicates with the configured backends to execute the received traversal. When JanusGraph is used in the form of JanusGraph Server, then there is nothing like a _master_ JanusGraph Server. Applications can therefore connect to any JanusGraph Server instance. They can also use a load-balancer to schedule requests to the different instances. The JanusGraph Server instances themselves don't communicate to each other directly which makes it easy to scale them when the need arises to process more traversals.
+
+NOTE: The scenarios presented in this chapter are only examples of how JanusGraph can be deployed. Each deployment needs to take into account the concrete use cases and production needs.
+
+[[getting-started-scenario]]
+=== Getting Started Scenario
+
+This scenario is the scenario most users probably want to choose when they are just getting started with JanusGraph. It offers scalability and fault tolerance with a minimum number of servers required. JanusGraph Server runs together with an instance of the storage backend and optionally also an instance of the index backend on every server.
+
+image:getting-started-scenario.svg[Getting started deployment scenario diagram, 650]
+
+A setup like this can be extended by simply adding more servers of the same kind or by moving one of the components onto dedicated servers. The latter describes a growth path to transform the deployment into the <<advanced-scenario,Advanced Scenario>>.
+
+Any of the scalable storage backends can be used with this scenario. Note however that for Scylla http://docs.scylladb.com/getting-started/scylla_in_a_shared_environment/[some configuration is required when it is hosted co-located with other services] like in this scenario. When an index backend should be used in this scenario then it also needs to be one that is scalable.
+
+[[advanced-scenario]]
+=== Advanced Scenario
+
+The advanced scenario is an evolution of the <<getting-started-scenario>>. Instead of hosting the JanusGraph Server instances together with the storage backend and optionally also the index backend, they are now separated on different servers.
+The advantage of hosting the different components (JanusGraph Server, storage/index backend) on different servers is that they can be scaled and managed independently of each other.
+This offers a higher flexibility at the cost of having to maintain more servers.
+
+image:advanced-scenario.svg[Advanced deployment scenario diagram, 800]
+
+Since this scenario offers independent scalability of the different components, it of course makes most sense to also use scalable backends.
+
+[[minimalist-scenario]]
+=== Minimalist Scenario
+
+It is also possible to host JanusGraph Server together with the backend(s) on just one server. This is especially attractive for testing purposes or for example when JanusGraph just supports a single application which can then also run on the same server.
+
+image:minimalist-scenario.svg[Minimalist deployment scenario diagram, 650]
+
+Opposed to the previous scenarios, it makes most sense to use backends for this scenario that are not scalable. The in-memory backend can be used for testing purposes or Berkeley DB for production and Lucene as the optional index backend.
+
+[[embedded-janusgraph]]
+=== Embedded JanusGraph
+
+Instead of connecting to the JanusGraph Server from an application it is also possible to embed JanusGraph as a library inside a JVM based application. While this reduces the administrative overhead, it makes it impossible to scale JanusGraph independently of the application.
+Embedded JanusGraph can be deployed as a variation of any of the other scenarios. JanusGraph just moves from the server(s) directly into the application as its now just used as a library instead of an independent service.
@@ -1,20 +1,109 @@
 [[hadoop-tp3]]
 == JanusGraph with TinkerPop's Hadoop-Gremlin
 
-JanusGraph-Hadoop works with TinkerPop's hadoop-gremlin package for
-general-purpose OLAP.
+This chapter describes how to leverage https://hadoop.apache.org/[Apache Hadoop] and https://spark.apache.org/[Apache Spark] to configure JanusGraph for distributed graph processing. These steps will provide an overview on how to get started with those projects, but please refer to those project communities to become more deeply familiar with them.
 
-Here's a three step example showing some basic integrated JanusGraph-TinkerPop functionality.
+JanusGraph-Hadoop works with TinkerPop's https://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference/#hadoop-gremlin[hadoop-gremlin] package for
+general-purpose OLAP.
 
-1. Manually define schema and then load the Grateful Dead graph from a TinkerPop Kryo-serialized binary file 
-2. Run a VertexProgram to compute PageRanks, writing the derived graph to `output/~g`
-3. Read the derived graph vertices and their computed rank values
+For the scope of the example below, Apache Spark is the computing framework and Apache Cassandra is the storage backend. The directions can be followed with other packages with minor changes to the configuration properties.
 
 [NOTE]
-The examples in this chapter are based on running Spark in local mode. Additional configuration 
-is required when using Spark in standalone mode or when running Spark on YARN or Mesos.
+The examples in this chapter are based on running Spark in local mode or standalone cluster mode. Additional configuration
+is required when using Spark on YARN or Mesos.
+
+=== Configuring Hadoop for Running OLAP
+For running OLAP queries from the Gremlin Console, a few prerequisites need to be fulfilled. You will need to add the Hadoop configuration directory into the `CLASSPATH`, and the configuration directory needs to point to a live Hadoop cluster.
+
+Hadoop provides a distributed access-controlled file system. The Hadoop file system is used by Spark workers running on different machines to have a common source for file based operations. The intermediate computations of various OLAP queries may be persisted on the Hadoop file system.
+
+For configuring a single node Hadoop cluster, please refer to official https://hadoop.apache.org/docs/r$MAVEN{hadoop2.version}/hadoop-project-dist/hadoop-common/SingleCluster.html[Apache Hadoop Docs]
+
+Once you have a Hadoop cluster up and running, we will need to specify the Hadoop configuration files in the `CLASSPATH`. The below document expects that you have those configuration files located under `/etc/hadoop/conf`.
+
+Once verified, follow the below steps to add the Hadoop configuration to the `CLASSPATH` and start the Gremlin Console, which will play the role of the Spark driver program.
+
+[source, shell]
+----
+export HADOOP_CONF_DIR=/etc/hadoop/conf
+export CLASSPATH=$HADOOP_CONF_DIR
+bin/gremlin.sh
+----
+
+Once the path to Hadoop configuration has been added to the `CLASSPATH`, we can verify whether the Gremlin Console can access the Hadoop cluster by following these quick steps:
+
+[source, gremlin]
+----
+gremlin> hdfs
+==>storage[org.apache.hadoop.fs.LocalFileSystem@65bb9029] // BAD
+
+gremlin> hdfs
+==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1229457199_1, ugi=user (auth:SIMPLE)]]] // GOOD
+----
+
+
+=== OLAP Traversals
+
+JanusGraph-Hadoop works with TinkerPop's hadoop-gremlin package for general-purpose OLAP to traverse over the graph, and parallelize queries by leveraging Apache Spark.
+
+==== OLAP Traversals with Spark Local
+
+The backend demonstrated here is Cassandra for the OLAP example below. Additional configuration will be needed that is specific to that storage backend. The configuration is specified by the `gremlin.hadoop.graphReader` property which specifies the class to read data from the storage backend.
+
+JanusGraph currently supports following graphReader classes:
+
+* `Cassandra3InputFormat` for use with Cassandra 3
+* `CassandraInputFormat` for use with Cassandra 2
+* `HBaseInputFormat` and `HBaseSnapshotInputFormat` for use with HBase
+
+The following properties file can be used to connect a JanusGraph instance in Cassandra such that it can be used with HadoopGraph to run OLAP queries.
+
+[source, properties]
+----
+# read-cassandra-3.properties
+#
+# Hadoop Graph Configuration
+#
+gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
+gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
+gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
+
+gremlin.hadoop.jarsInDistributedCache=true
+gremlin.hadoop.inputLocation=none
+gremlin.hadoop.outputLocation=output
+gremlin.spark.persistContext=true
+
+#
+# JanusGraph Cassandra InputFormat configuration
+#
+# These properties defines the connection properties which were used while write data to JanusGraph.
+janusgraphmr.ioformat.conf.storage.backend=cassandra
+# This specifies the hostname & port for Cassandra data store.
+janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
+janusgraphmr.ioformat.conf.storage.port=9160
+# This specifies the keyspace where data is stored.
+janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
+# This defines the indexing backned configuration used while writing data to JanusGraph.
+janusgraphmr.ioformat.conf.index.search.backend=elasticsearch
+janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1
+# Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr).
+
+#
+# Apache Cassandra InputFormat configuration
+#
+cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
 
-=== Defining defining schema and loading data
+#
+# SparkGraphComputer Configuration
+#
+spark.master=local[*]
+spark.executor.memory=1g
+spark.serializer=org.apache.spark.serializer.KryoSerializer
+spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
+
+----
+
+First create a properties file with above configurations, and load the same on the Gremlin Console to run OLAP queries as follows:
 
 [source, gremlin]
 ----
@@ -28,100 +117,111 @@ gremlin> :plugin use tinkerpop.hadoop
 ==>tinkerpop.hadoop activated
 gremlin> :plugin use tinkerpop.spark
 ==>tinkerpop.spark activated
-gremlin> :load data/grateful-dead-janusgraph-schema.groovy
-==>true
-==>true
-gremlin> graph = JanusGraphFactory.open('conf/janusgraph-cql.properties')
-==>standardjanusgraph[cql:[127.0.0.1]]
-gremlin> defineGratefulDeadSchema(graph)
-==>null
-gremlin> graph.close()
-==>null
-gremlin> if (!hdfs.exists('data/grateful-dead.kryo')) hdfs.copyFromLocal('data/grateful-dead.kryo','data/grateful-dead.kryo')
-==>null
-gremlin> graph = GraphFactory.open('conf/hadoop-graph/hadoop-load.properties')
-==>hadoopgraph[gryoinputformat->nulloutputformat]
-gremlin> blvp = BulkLoaderVertexProgram.build().writeGraph('conf/janusgraph-cql.properties').create(graph)
-==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader,vertexIdProperty=bulkLoader.vertex.id,userSuppliedIds=false,keepOriginalIds=true,batchSize=0]
-gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
-...
-==>result[hadoopgraph[gryoinputformat->nulloutputformat],memory[size:0]]
-gremlin> graph.close()
-==>null
+gremlin> // 1. Open a the graph for OLAP processing reading in from Cassandra 3
 gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cassandra-3.properties')
-==>hadoopgraph[cassandrainputformat->gryooutputformat]
+==>hadoopgraph[cassandra3inputformat->gryooutputformat]
+gremlin> // 2. Configure the traversal to run with Spark
 gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
-==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
+==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
+gremlin> // 3. Run some OLAP traversals
 gremlin> g.V().count()
-...
+......
 ==>808
+gremlin> g.E().count()
+......
+==> 8046
 ----
 
+==== OLAP Traversals with Spark Standalone Cluster
+
+The steps followed in the previous section can also be used with a Spark standalone cluster with only minor changes:
+
+* Update the `spark.master` property to point to the Spark master URL instead of local
+* Update the `spark.executor.extraClassPath` to enable the Spark executor to find the JanusGraph dependency jars
+* Copy the JanusGraph dependency jars into the location specified in the previous step on each Spark executor machine
+
+[NOTE]
+We have copied all the jars under *janusgraph-distribution/lib* into /opt/lib/janusgraph/ and the same directory structure is created across all workers, and jars are manually copied across all workers.
+
+The final properties file used for OLAP traversal is as follows:
+
 [source, properties]
 ----
-# hadoop-load.properties
-
+# read-cassandra-3.properties
 #
 # Hadoop Graph Configuration
 #
 gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
-gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
-gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
-gremlin.hadoop.inputLocation=./data/grateful-dead.kryo
-gremlin.hadoop.outputLocation=output
+gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
+gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
+
 gremlin.hadoop.jarsInDistributedCache=true
+gremlin.hadoop.inputLocation=none
+gremlin.hadoop.outputLocation=output
+gremlin.spark.persistContext=true
 
 #
-# GiraphGraphComputer Configuration
+# JanusGraph Cassandra InputFormat configuration
 #
-giraph.minWorkers=2
-giraph.maxWorkers=2
-giraph.useOutOfCoreGraph=true
-giraph.useOutOfCoreMessages=true
-mapred.map.child.java.opts=-Xmx1024m
-mapred.reduce.child.java.opts=-Xmx1024m
-giraph.numInputThreads=4
-giraph.numComputeThreads=4
-giraph.maxMessagesInMemory=100000
+# These properties defines the connection properties which were used while write data to JanusGraph.
+janusgraphmr.ioformat.conf.storage.backend=cassandra
+# This specifies the hostname & port for Cassandra data store.
+janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
+janusgraphmr.ioformat.conf.storage.port=9160
+# This specifies the keyspace where data is stored.
+janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
+# This defines the indexing backned configuration used while writing data to JanusGraph.
+janusgraphmr.ioformat.conf.index.search.backend=elasticsearch
+janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1
+# Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr).
+
+#
+# Apache Cassandra InputFormat configuration
+#
+cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
 
 #
 # SparkGraphComputer Configuration
 #
-spark.master=local[*]
+spark.master=spark://127.0.0.1:7077
 spark.executor.memory=1g
+spark.executor.extraClassPath=/opt/lib/janusgraph/*
 spark.serializer=org.apache.spark.serializer.KryoSerializer
 spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
 ----
 
+Then use the properties file as follows from the Gremlin Console:
+
 [source, gremlin]
 ----
-// grateful-dead-janusgraph-schema.groovy
-
-def defineGratefulDeadSchema(janusGraph) {
-    m = janusGraph.openManagement()
-    // vertex labels
-    artist = m.makeVertexLabel("artist").make()
-    song   = m.makeVertexLabel("song").make()
-    // edge labels
-    sungBy     = m.makeEdgeLabel("sungBy").make()
-    writtenBy  = m.makeEdgeLabel("writtenBy").make()
-    followedBy = m.makeEdgeLabel("followedBy").make()
-    // vertex and edge properties
-    blid         = m.makePropertyKey("bulkLoader.vertex.id").dataType(Long.class).make()
-    name         = m.makePropertyKey("name").dataType(String.class).make()
-    songType     = m.makePropertyKey("songType").dataType(String.class).make()
-    performances = m.makePropertyKey("performances").dataType(Integer.class).make()
-    weight       = m.makePropertyKey("weight").dataType(Integer.class).make()
-    // global indices
-    m.buildIndex("byBulkLoaderVertexId", Vertex.class).addKey(blid).buildCompositeIndex()
-    m.buildIndex("artistsByName", Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex()
-    m.buildIndex("songsByName", Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex()
-    // vertex centric indices
-    m.buildEdgeIndex(followedBy, "followedByWeight", Direction.BOTH, Order.decr, weight)
-    m.commit()
-}
+bin/gremlin.sh
+
+         \,,,/
+         (o o)
+-----oOOo-(3)-oOOo-----
+plugin activated: janusgraph.imports
+gremlin> :plugin use tinkerpop.hadoop
+==>tinkerpop.hadoop activated
+gremlin> :plugin use tinkerpop.spark
+==>tinkerpop.spark activated
+gremlin> // 1. Open a the graph for OLAP processing reading in from Cassandra 3
+gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cassandra-3.properties')
+==>hadoopgraph[cassandra3inputformat->gryooutputformat]
+gremlin> // 2. Configure the traversal to run with Spark
+gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
+==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
+gremlin> // 3. Run some OLAP traversals
+gremlin> g.V().count()
+......
+==>808
+gremlin> g.E().count()
+......
+==> 8046
 ----
 
-=== Running PageRank
 
-A fully functional example of the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference#pagerankvertexprogram[PageRankVertexProgram] can be found in the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference#vertexprogram[VertexProgram] section of the TinkerPop docs.
+=== Other Vertex Programs
+
+Apache TinkerPop provides various vertex programs. A vertex program runs on each vertex until either a termination criteria is attained or a fixed number of iterations has been reached. Due to the parallel nature of vertex programs, they can leverage parallel computing frameworks like Spark or Giraph to improve their performance.
+
+Once you are familiar with how to configure JanusGraph to work with Spark, you can run all the other vertex programs provided by Apache TinkerPop, like Page Rank, Bulk Loading and Peer Pressure. See the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference/#vertexprogram[TinkerPop VertexProgram docs] for more details.