Skip to content

Commit dee1400

Browse files
authored
Merge pull request #1195 from pluradj/master
merge 0.2 fixes to master
2 parents 1fabde7 + 98e5b23 commit dee1400

11 files changed

+238
-80
lines changed

.travis.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -91,8 +91,10 @@ install:
9191
echo "Building all modules for Coverity analysis";
9292
travis_retry travis_wait \
9393
mvn install -DskipTests=true -Dmaven.javadoc.skip=true --batch-mode --show-version;
94-
elif [ "${TRAVIS_BRANCH}" == "${COVERITY_BRANCH_NAME}" ] || [ -n "${COVERITY_ONLY:-}" ]; then
95-
echo "Skipping module build on Coverity branch/job";
94+
elif [ "${TRAVIS_BRANCH}" == "${COVERITY_BRANCH_NAME}" ] || ! [ -z "${COVERITY_ONLY:-}" ]; then
95+
echo "Building all modules for test-compile coverage, but skipping Coverity upload";
96+
travis_retry travis_wait \
97+
mvn install -DskipTests=true -Dmaven.javadoc.skip=true --batch-mode --show-version;
9698
else
9799
echo "Building janusgraph-${MODULE} and dependencies";
98100
travis_retry travis_wait \

docs/basics.adoc

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -942,6 +942,8 @@ By adding the `JanusGraphIoRegistry` to the `org.apache.tinkerpop.gremlin.driver
942942

943943
It is possible to extend Gremlin Server with other means of communication by implementing the interfaces that it provides and leverage this with JanusGraph. See more details in the appropriate TinkerPop documentation.
944944

945+
include::deploymentscenarios.adoc[]
946+
945947
include::configuredgraphfactory.adoc[]
946948

947949
include::multinodejanusgraphcluster.adoc[]

docs/deploymentscenarios.adoc

Lines changed: 45 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,45 @@
1+
[[deployment-scenarios]]
2+
== Deployment Scenarios
3+
4+
JanusGraph offers a wide choice of storage and index backends which results in great flexibility of how it can be deployed. This chapter presents a few possible deployment scenarios to help with the complexity that comes with this flexibility.
5+
6+
Before discussing the different deployment scenarios, it is important to understand the roles of JanusGraph itself and that of the backends. First of all, applications only communicate directly with JanusGraph, mostly by sending Gremlin traversals for execution. JanusGraph then communicates with the configured backends to execute the received traversal. When JanusGraph is used in the form of JanusGraph Server, then there is nothing like a _master_ JanusGraph Server. Applications can therefore connect to any JanusGraph Server instance. They can also use a load-balancer to schedule requests to the different instances. The JanusGraph Server instances themselves don't communicate to each other directly which makes it easy to scale them when the need arises to process more traversals.
7+
8+
NOTE: The scenarios presented in this chapter are only examples of how JanusGraph can be deployed. Each deployment needs to take into account the concrete use cases and production needs.
9+
10+
[[getting-started-scenario]]
11+
=== Getting Started Scenario
12+
13+
This scenario is the scenario most users probably want to choose when they are just getting started with JanusGraph. It offers scalability and fault tolerance with a minimum number of servers required. JanusGraph Server runs together with an instance of the storage backend and optionally also an instance of the index backend on every server.
14+
15+
image:getting-started-scenario.svg[Getting started deployment scenario diagram, 650]
16+
17+
A setup like this can be extended by simply adding more servers of the same kind or by moving one of the components onto dedicated servers. The latter describes a growth path to transform the deployment into the <<advanced-scenario,Advanced Scenario>>.
18+
19+
Any of the scalable storage backends can be used with this scenario. Note however that for Scylla http://docs.scylladb.com/getting-started/scylla_in_a_shared_environment/[some configuration is required when it is hosted co-located with other services] like in this scenario. When an index backend should be used in this scenario then it also needs to be one that is scalable.
20+
21+
[[advanced-scenario]]
22+
=== Advanced Scenario
23+
24+
The advanced scenario is an evolution of the <<getting-started-scenario>>. Instead of hosting the JanusGraph Server instances together with the storage backend and optionally also the index backend, they are now separated on different servers.
25+
The advantage of hosting the different components (JanusGraph Server, storage/index backend) on different servers is that they can be scaled and managed independently of each other.
26+
This offers a higher flexibility at the cost of having to maintain more servers.
27+
28+
image:advanced-scenario.svg[Advanced deployment scenario diagram, 800]
29+
30+
Since this scenario offers independent scalability of the different components, it of course makes most sense to also use scalable backends.
31+
32+
[[minimalist-scenario]]
33+
=== Minimalist Scenario
34+
35+
It is also possible to host JanusGraph Server together with the backend(s) on just one server. This is especially attractive for testing purposes or for example when JanusGraph just supports a single application which can then also run on the same server.
36+
37+
image:minimalist-scenario.svg[Minimalist deployment scenario diagram, 650]
38+
39+
Opposed to the previous scenarios, it makes most sense to use backends for this scenario that are not scalable. The in-memory backend can be used for testing purposes or Berkeley DB for production and Lucene as the optional index backend.
40+
41+
[[embedded-janusgraph]]
42+
=== Embedded JanusGraph
43+
44+
Instead of connecting to the JanusGraph Server from an application it is also possible to embed JanusGraph as a library inside a JVM based application. While this reduces the administrative overhead, it makes it impossible to scale JanusGraph independently of the application.
45+
Embedded JanusGraph can be deployed as a variation of any of the other scenarios. JanusGraph just moves from the server(s) directly into the application as its now just used as a library instead of an independent service.

docs/hadoop.adoc

Lines changed: 176 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,109 @@
11
[[hadoop-tp3]]
22
== JanusGraph with TinkerPop's Hadoop-Gremlin
33

4-
JanusGraph-Hadoop works with TinkerPop's hadoop-gremlin package for
5-
general-purpose OLAP.
4+
This chapter describes how to leverage https://hadoop.apache.org/[Apache Hadoop] and https://spark.apache.org/[Apache Spark] to configure JanusGraph for distributed graph processing. These steps will provide an overview on how to get started with those projects, but please refer to those project communities to become more deeply familiar with them.
65

7-
Here's a three step example showing some basic integrated JanusGraph-TinkerPop functionality.
6+
JanusGraph-Hadoop works with TinkerPop's https://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference/#hadoop-gremlin[hadoop-gremlin] package for
7+
general-purpose OLAP.
88

9-
1. Manually define schema and then load the Grateful Dead graph from a TinkerPop Kryo-serialized binary file
10-
2. Run a VertexProgram to compute PageRanks, writing the derived graph to `output/~g`
11-
3. Read the derived graph vertices and their computed rank values
9+
For the scope of the example below, Apache Spark is the computing framework and Apache Cassandra is the storage backend. The directions can be followed with other packages with minor changes to the configuration properties.
1210

1311
[NOTE]
14-
The examples in this chapter are based on running Spark in local mode. Additional configuration
15-
is required when using Spark in standalone mode or when running Spark on YARN or Mesos.
12+
The examples in this chapter are based on running Spark in local mode or standalone cluster mode. Additional configuration
13+
is required when using Spark on YARN or Mesos.
14+
15+
=== Configuring Hadoop for Running OLAP
16+
For running OLAP queries from the Gremlin Console, a few prerequisites need to be fulfilled. You will need to add the Hadoop configuration directory into the `CLASSPATH`, and the configuration directory needs to point to a live Hadoop cluster.
17+
18+
Hadoop provides a distributed access-controlled file system. The Hadoop file system is used by Spark workers running on different machines to have a common source for file based operations. The intermediate computations of various OLAP queries may be persisted on the Hadoop file system.
19+
20+
For configuring a single node Hadoop cluster, please refer to official https://hadoop.apache.org/docs/r$MAVEN{hadoop2.version}/hadoop-project-dist/hadoop-common/SingleCluster.html[Apache Hadoop Docs]
21+
22+
Once you have a Hadoop cluster up and running, we will need to specify the Hadoop configuration files in the `CLASSPATH`. The below document expects that you have those configuration files located under `/etc/hadoop/conf`.
23+
24+
Once verified, follow the below steps to add the Hadoop configuration to the `CLASSPATH` and start the Gremlin Console, which will play the role of the Spark driver program.
25+
26+
[source, shell]
27+
----
28+
export HADOOP_CONF_DIR=/etc/hadoop/conf
29+
export CLASSPATH=$HADOOP_CONF_DIR
30+
bin/gremlin.sh
31+
----
32+
33+
Once the path to Hadoop configuration has been added to the `CLASSPATH`, we can verify whether the Gremlin Console can access the Hadoop cluster by following these quick steps:
34+
35+
[source, gremlin]
36+
----
37+
gremlin> hdfs
38+
==>storage[org.apache.hadoop.fs.LocalFileSystem@65bb9029] // BAD
39+
40+
gremlin> hdfs
41+
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1229457199_1, ugi=user (auth:SIMPLE)]]] // GOOD
42+
----
43+
44+
45+
=== OLAP Traversals
46+
47+
JanusGraph-Hadoop works with TinkerPop's hadoop-gremlin package for general-purpose OLAP to traverse over the graph, and parallelize queries by leveraging Apache Spark.
48+
49+
==== OLAP Traversals with Spark Local
50+
51+
The backend demonstrated here is Cassandra for the OLAP example below. Additional configuration will be needed that is specific to that storage backend. The configuration is specified by the `gremlin.hadoop.graphReader` property which specifies the class to read data from the storage backend.
52+
53+
JanusGraph currently supports following graphReader classes:
54+
55+
* `Cassandra3InputFormat` for use with Cassandra 3
56+
* `CassandraInputFormat` for use with Cassandra 2
57+
* `HBaseInputFormat` and `HBaseSnapshotInputFormat` for use with HBase
58+
59+
The following properties file can be used to connect a JanusGraph instance in Cassandra such that it can be used with HadoopGraph to run OLAP queries.
60+
61+
[source, properties]
62+
----
63+
# read-cassandra-3.properties
64+
#
65+
# Hadoop Graph Configuration
66+
#
67+
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
68+
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
69+
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
70+
71+
gremlin.hadoop.jarsInDistributedCache=true
72+
gremlin.hadoop.inputLocation=none
73+
gremlin.hadoop.outputLocation=output
74+
gremlin.spark.persistContext=true
75+
76+
#
77+
# JanusGraph Cassandra InputFormat configuration
78+
#
79+
# These properties defines the connection properties which were used while write data to JanusGraph.
80+
janusgraphmr.ioformat.conf.storage.backend=cassandra
81+
# This specifies the hostname & port for Cassandra data store.
82+
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
83+
janusgraphmr.ioformat.conf.storage.port=9160
84+
# This specifies the keyspace where data is stored.
85+
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
86+
# This defines the indexing backned configuration used while writing data to JanusGraph.
87+
janusgraphmr.ioformat.conf.index.search.backend=elasticsearch
88+
janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1
89+
# Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr).
90+
91+
#
92+
# Apache Cassandra InputFormat configuration
93+
#
94+
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
1695
17-
=== Defining defining schema and loading data
96+
#
97+
# SparkGraphComputer Configuration
98+
#
99+
spark.master=local[*]
100+
spark.executor.memory=1g
101+
spark.serializer=org.apache.spark.serializer.KryoSerializer
102+
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
103+
104+
----
105+
106+
First create a properties file with above configurations, and load the same on the Gremlin Console to run OLAP queries as follows:
18107

19108
[source, gremlin]
20109
----
@@ -28,100 +117,111 @@ gremlin> :plugin use tinkerpop.hadoop
28117
==>tinkerpop.hadoop activated
29118
gremlin> :plugin use tinkerpop.spark
30119
==>tinkerpop.spark activated
31-
gremlin> :load data/grateful-dead-janusgraph-schema.groovy
32-
==>true
33-
==>true
34-
gremlin> graph = JanusGraphFactory.open('conf/janusgraph-cql.properties')
35-
==>standardjanusgraph[cql:[127.0.0.1]]
36-
gremlin> defineGratefulDeadSchema(graph)
37-
==>null
38-
gremlin> graph.close()
39-
==>null
40-
gremlin> if (!hdfs.exists('data/grateful-dead.kryo')) hdfs.copyFromLocal('data/grateful-dead.kryo','data/grateful-dead.kryo')
41-
==>null
42-
gremlin> graph = GraphFactory.open('conf/hadoop-graph/hadoop-load.properties')
43-
==>hadoopgraph[gryoinputformat->nulloutputformat]
44-
gremlin> blvp = BulkLoaderVertexProgram.build().writeGraph('conf/janusgraph-cql.properties').create(graph)
45-
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader,vertexIdProperty=bulkLoader.vertex.id,userSuppliedIds=false,keepOriginalIds=true,batchSize=0]
46-
gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
47-
...
48-
==>result[hadoopgraph[gryoinputformat->nulloutputformat],memory[size:0]]
49-
gremlin> graph.close()
50-
==>null
120+
gremlin> // 1. Open a the graph for OLAP processing reading in from Cassandra 3
51121
gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cassandra-3.properties')
52-
==>hadoopgraph[cassandrainputformat->gryooutputformat]
122+
==>hadoopgraph[cassandra3inputformat->gryooutputformat]
123+
gremlin> // 2. Configure the traversal to run with Spark
53124
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
54-
==>graphtraversalsource[hadoopgraph[cassandrainputformat->gryooutputformat], sparkgraphcomputer]
125+
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
126+
gremlin> // 3. Run some OLAP traversals
55127
gremlin> g.V().count()
56-
...
128+
......
57129
==>808
130+
gremlin> g.E().count()
131+
......
132+
==> 8046
58133
----
59134

135+
==== OLAP Traversals with Spark Standalone Cluster
136+
137+
The steps followed in the previous section can also be used with a Spark standalone cluster with only minor changes:
138+
139+
* Update the `spark.master` property to point to the Spark master URL instead of local
140+
* Update the `spark.executor.extraClassPath` to enable the Spark executor to find the JanusGraph dependency jars
141+
* Copy the JanusGraph dependency jars into the location specified in the previous step on each Spark executor machine
142+
143+
[NOTE]
144+
We have copied all the jars under *janusgraph-distribution/lib* into /opt/lib/janusgraph/ and the same directory structure is created across all workers, and jars are manually copied across all workers.
145+
146+
The final properties file used for OLAP traversal is as follows:
147+
60148
[source, properties]
61149
----
62-
# hadoop-load.properties
63-
150+
# read-cassandra-3.properties
64151
#
65152
# Hadoop Graph Configuration
66153
#
67154
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
68-
gremlin.hadoop.graphReader=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoInputFormat
69-
gremlin.hadoop.graphWriter=org.apache.hadoop.mapreduce.lib.output.NullOutputFormat
70-
gremlin.hadoop.inputLocation=./data/grateful-dead.kryo
71-
gremlin.hadoop.outputLocation=output
155+
gremlin.hadoop.graphReader=org.janusgraph.hadoop.formats.cassandra.Cassandra3InputFormat
156+
gremlin.hadoop.graphWriter=org.apache.tinkerpop.gremlin.hadoop.structure.io.gryo.GryoOutputFormat
157+
72158
gremlin.hadoop.jarsInDistributedCache=true
159+
gremlin.hadoop.inputLocation=none
160+
gremlin.hadoop.outputLocation=output
161+
gremlin.spark.persistContext=true
73162
74163
#
75-
# GiraphGraphComputer Configuration
164+
# JanusGraph Cassandra InputFormat configuration
76165
#
77-
giraph.minWorkers=2
78-
giraph.maxWorkers=2
79-
giraph.useOutOfCoreGraph=true
80-
giraph.useOutOfCoreMessages=true
81-
mapred.map.child.java.opts=-Xmx1024m
82-
mapred.reduce.child.java.opts=-Xmx1024m
83-
giraph.numInputThreads=4
84-
giraph.numComputeThreads=4
85-
giraph.maxMessagesInMemory=100000
166+
# These properties defines the connection properties which were used while write data to JanusGraph.
167+
janusgraphmr.ioformat.conf.storage.backend=cassandra
168+
# This specifies the hostname & port for Cassandra data store.
169+
janusgraphmr.ioformat.conf.storage.hostname=127.0.0.1
170+
janusgraphmr.ioformat.conf.storage.port=9160
171+
# This specifies the keyspace where data is stored.
172+
janusgraphmr.ioformat.conf.storage.cassandra.keyspace=janusgraph
173+
# This defines the indexing backned configuration used while writing data to JanusGraph.
174+
janusgraphmr.ioformat.conf.index.search.backend=elasticsearch
175+
janusgraphmr.ioformat.conf.index.search.hostname=127.0.0.1
176+
# Use the appropriate properties for the backend when using a different storage backend (HBase) or indexing backend (Solr).
177+
178+
#
179+
# Apache Cassandra InputFormat configuration
180+
#
181+
cassandra.input.partitioner.class=org.apache.cassandra.dht.Murmur3Partitioner
86182
87183
#
88184
# SparkGraphComputer Configuration
89185
#
90-
spark.master=local[*]
186+
spark.master=spark://127.0.0.1:7077
91187
spark.executor.memory=1g
188+
spark.executor.extraClassPath=/opt/lib/janusgraph/*
92189
spark.serializer=org.apache.spark.serializer.KryoSerializer
93190
spark.kryo.registrator=org.apache.tinkerpop.gremlin.spark.structure.io.gryo.GryoRegistrator
94191
----
95192

193+
Then use the properties file as follows from the Gremlin Console:
194+
96195
[source, gremlin]
97196
----
98-
// grateful-dead-janusgraph-schema.groovy
99-
100-
def defineGratefulDeadSchema(janusGraph) {
101-
m = janusGraph.openManagement()
102-
// vertex labels
103-
artist = m.makeVertexLabel("artist").make()
104-
song = m.makeVertexLabel("song").make()
105-
// edge labels
106-
sungBy = m.makeEdgeLabel("sungBy").make()
107-
writtenBy = m.makeEdgeLabel("writtenBy").make()
108-
followedBy = m.makeEdgeLabel("followedBy").make()
109-
// vertex and edge properties
110-
blid = m.makePropertyKey("bulkLoader.vertex.id").dataType(Long.class).make()
111-
name = m.makePropertyKey("name").dataType(String.class).make()
112-
songType = m.makePropertyKey("songType").dataType(String.class).make()
113-
performances = m.makePropertyKey("performances").dataType(Integer.class).make()
114-
weight = m.makePropertyKey("weight").dataType(Integer.class).make()
115-
// global indices
116-
m.buildIndex("byBulkLoaderVertexId", Vertex.class).addKey(blid).buildCompositeIndex()
117-
m.buildIndex("artistsByName", Vertex.class).addKey(name).indexOnly(artist).buildCompositeIndex()
118-
m.buildIndex("songsByName", Vertex.class).addKey(name).indexOnly(song).buildCompositeIndex()
119-
// vertex centric indices
120-
m.buildEdgeIndex(followedBy, "followedByWeight", Direction.BOTH, Order.decr, weight)
121-
m.commit()
122-
}
197+
bin/gremlin.sh
198+
199+
\,,,/
200+
(o o)
201+
-----oOOo-(3)-oOOo-----
202+
plugin activated: janusgraph.imports
203+
gremlin> :plugin use tinkerpop.hadoop
204+
==>tinkerpop.hadoop activated
205+
gremlin> :plugin use tinkerpop.spark
206+
==>tinkerpop.spark activated
207+
gremlin> // 1. Open a the graph for OLAP processing reading in from Cassandra 3
208+
gremlin> graph = GraphFactory.open('conf/hadoop-graph/read-cassandra-3.properties')
209+
==>hadoopgraph[cassandra3inputformat->gryooutputformat]
210+
gremlin> // 2. Configure the traversal to run with Spark
211+
gremlin> g = graph.traversal().withComputer(SparkGraphComputer)
212+
==>graphtraversalsource[hadoopgraph[cassandra3inputformat->gryooutputformat], sparkgraphcomputer]
213+
gremlin> // 3. Run some OLAP traversals
214+
gremlin> g.V().count()
215+
......
216+
==>808
217+
gremlin> g.E().count()
218+
......
219+
==> 8046
123220
----
124221

125-
=== Running PageRank
126222

127-
A fully functional example of the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference#pagerankvertexprogram[PageRankVertexProgram] can be found in the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference#vertexprogram[VertexProgram] section of the TinkerPop docs.
223+
=== Other Vertex Programs
224+
225+
Apache TinkerPop provides various vertex programs. A vertex program runs on each vertex until either a termination criteria is attained or a fixed number of iterations has been reached. Due to the parallel nature of vertex programs, they can leverage parallel computing frameworks like Spark or Giraph to improve their performance.
226+
227+
Once you are familiar with how to configure JanusGraph to work with Spark, you can run all the other vertex programs provided by Apache TinkerPop, like Page Rank, Bulk Loading and Peer Pressure. See the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference/#vertexprogram[TinkerPop VertexProgram docs] for more details.

0 commit comments

Comments
 (0)