You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/basics.adoc
+2Lines changed: 2 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -942,6 +942,8 @@ By adding the `JanusGraphIoRegistry` to the `org.apache.tinkerpop.gremlin.driver
942
942
943
943
It is possible to extend Gremlin Server with other means of communication by implementing the interfaces that it provides and leverage this with JanusGraph. See more details in the appropriate TinkerPop documentation.
JanusGraph offers a wide choice of storage and index backends which results in great flexibility of how it can be deployed. This chapter presents a few possible deployment scenarios to help with the complexity that comes with this flexibility.
5
+
6
+
Before discussing the different deployment scenarios, it is important to understand the roles of JanusGraph itself and that of the backends. First of all, applications only communicate directly with JanusGraph, mostly by sending Gremlin traversals for execution. JanusGraph then communicates with the configured backends to execute the received traversal. When JanusGraph is used in the form of JanusGraph Server, then there is nothing like a _master_ JanusGraph Server. Applications can therefore connect to any JanusGraph Server instance. They can also use a load-balancer to schedule requests to the different instances. The JanusGraph Server instances themselves don't communicate to each other directly which makes it easy to scale them when the need arises to process more traversals.
7
+
8
+
NOTE: The scenarios presented in this chapter are only examples of how JanusGraph can be deployed. Each deployment needs to take into account the concrete use cases and production needs.
9
+
10
+
[[getting-started-scenario]]
11
+
=== Getting Started Scenario
12
+
13
+
This scenario is the scenario most users probably want to choose when they are just getting started with JanusGraph. It offers scalability and fault tolerance with a minimum number of servers required. JanusGraph Server runs together with an instance of the storage backend and optionally also an instance of the index backend on every server.
14
+
15
+
image:getting-started-scenario.svg[Getting started deployment scenario diagram, 650]
16
+
17
+
A setup like this can be extended by simply adding more servers of the same kind or by moving one of the components onto dedicated servers. The latter describes a growth path to transform the deployment into the <<advanced-scenario,Advanced Scenario>>.
18
+
19
+
Any of the scalable storage backends can be used with this scenario. Note however that for Scylla http://docs.scylladb.com/getting-started/scylla_in_a_shared_environment/[some configuration is required when it is hosted co-located with other services] like in this scenario. When an index backend should be used in this scenario then it also needs to be one that is scalable.
20
+
21
+
[[advanced-scenario]]
22
+
=== Advanced Scenario
23
+
24
+
The advanced scenario is an evolution of the <<getting-started-scenario>>. Instead of hosting the JanusGraph Server instances together with the storage backend and optionally also the index backend, they are now separated on different servers.
25
+
The advantage of hosting the different components (JanusGraph Server, storage/index backend) on different servers is that they can be scaled and managed independently of each other.
26
+
This offers a higher flexibility at the cost of having to maintain more servers.
Since this scenario offers independent scalability of the different components, it of course makes most sense to also use scalable backends.
31
+
32
+
[[minimalist-scenario]]
33
+
=== Minimalist Scenario
34
+
35
+
It is also possible to host JanusGraph Server together with the backend(s) on just one server. This is especially attractive for testing purposes or for example when JanusGraph just supports a single application which can then also run on the same server.
Opposed to the previous scenarios, it makes most sense to use backends for this scenario that are not scalable. The in-memory backend can be used for testing purposes or Berkeley DB for production and Lucene as the optional index backend.
40
+
41
+
[[embedded-janusgraph]]
42
+
=== Embedded JanusGraph
43
+
44
+
Instead of connecting to the JanusGraph Server from an application it is also possible to embed JanusGraph as a library inside a JVM based application. While this reduces the administrative overhead, it makes it impossible to scale JanusGraph independently of the application.
45
+
Embedded JanusGraph can be deployed as a variation of any of the other scenarios. JanusGraph just moves from the server(s) directly into the application as its now just used as a library instead of an independent service.
JanusGraph-Hadoop works with TinkerPop's hadoop-gremlin package for
5
-
general-purpose OLAP.
4
+
This chapter describes how to leverage https://hadoop.apache.org/[Apache Hadoop] and https://spark.apache.org/[Apache Spark] to configure JanusGraph for distributed graph processing. These steps will provide an overview on how to get started with those projects, but please refer to those project communities to become more deeply familiar with them.
6
5
7
-
Here's a three step example showing some basic integrated JanusGraph-TinkerPop functionality.
6
+
JanusGraph-Hadoop works with TinkerPop's https://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference/#hadoop-gremlin[hadoop-gremlin] package for
7
+
general-purpose OLAP.
8
8
9
-
1. Manually define schema and then load the Grateful Dead graph from a TinkerPop Kryo-serialized binary file
10
-
2. Run a VertexProgram to compute PageRanks, writing the derived graph to `output/~g`
11
-
3. Read the derived graph vertices and their computed rank values
9
+
For the scope of the example below, Apache Spark is the computing framework and Apache Cassandra is the storage backend. The directions can be followed with other packages with minor changes to the configuration properties.
12
10
13
11
[NOTE]
14
-
The examples in this chapter are based on running Spark in local mode. Additional configuration
15
-
is required when using Spark in standalone mode or when running Spark on YARN or Mesos.
12
+
The examples in this chapter are based on running Spark in local mode or standalone cluster mode. Additional configuration
13
+
is required when using Spark on YARN or Mesos.
14
+
15
+
=== Configuring Hadoop for Running OLAP
16
+
For running OLAP queries from the Gremlin Console, a few prerequisites need to be fulfilled. You will need to add the Hadoop configuration directory into the `CLASSPATH`, and the configuration directory needs to point to a live Hadoop cluster.
17
+
18
+
Hadoop provides a distributed access-controlled file system. The Hadoop file system is used by Spark workers running on different machines to have a common source for file based operations. The intermediate computations of various OLAP queries may be persisted on the Hadoop file system.
19
+
20
+
For configuring a single node Hadoop cluster, please refer to official https://hadoop.apache.org/docs/r$MAVEN{hadoop2.version}/hadoop-project-dist/hadoop-common/SingleCluster.html[Apache Hadoop Docs]
21
+
22
+
Once you have a Hadoop cluster up and running, we will need to specify the Hadoop configuration files in the `CLASSPATH`. The below document expects that you have those configuration files located under `/etc/hadoop/conf`.
23
+
24
+
Once verified, follow the below steps to add the Hadoop configuration to the `CLASSPATH` and start the Gremlin Console, which will play the role of the Spark driver program.
25
+
26
+
[source, shell]
27
+
----
28
+
export HADOOP_CONF_DIR=/etc/hadoop/conf
29
+
export CLASSPATH=$HADOOP_CONF_DIR
30
+
bin/gremlin.sh
31
+
----
32
+
33
+
Once the path to Hadoop configuration has been added to the `CLASSPATH`, we can verify whether the Gremlin Console can access the Hadoop cluster by following these quick steps:
34
+
35
+
[source, gremlin]
36
+
----
37
+
gremlin> hdfs
38
+
==>storage[org.apache.hadoop.fs.LocalFileSystem@65bb9029] // BAD
39
+
40
+
gremlin> hdfs
41
+
==>storage[DFS[DFSClient[clientName=DFSClient_NONMAPREDUCE_1229457199_1, ugi=user (auth:SIMPLE)]]] // GOOD
42
+
----
43
+
44
+
45
+
=== OLAP Traversals
46
+
47
+
JanusGraph-Hadoop works with TinkerPop's hadoop-gremlin package for general-purpose OLAP to traverse over the graph, and parallelize queries by leveraging Apache Spark.
48
+
49
+
==== OLAP Traversals with Spark Local
50
+
51
+
The backend demonstrated here is Cassandra for the OLAP example below. Additional configuration will be needed that is specific to that storage backend. The configuration is specified by the `gremlin.hadoop.graphReader` property which specifies the class to read data from the storage backend.
52
+
53
+
JanusGraph currently supports following graphReader classes:
54
+
55
+
* `Cassandra3InputFormat` for use with Cassandra 3
56
+
* `CassandraInputFormat` for use with Cassandra 2
57
+
* `HBaseInputFormat` and `HBaseSnapshotInputFormat` for use with HBase
58
+
59
+
The following properties file can be used to connect a JanusGraph instance in Cassandra such that it can be used with HadoopGraph to run OLAP queries.
==== OLAP Traversals with Spark Standalone Cluster
136
+
137
+
The steps followed in the previous section can also be used with a Spark standalone cluster with only minor changes:
138
+
139
+
* Update the `spark.master` property to point to the Spark master URL instead of local
140
+
* Update the `spark.executor.extraClassPath` to enable the Spark executor to find the JanusGraph dependency jars
141
+
* Copy the JanusGraph dependency jars into the location specified in the previous step on each Spark executor machine
142
+
143
+
[NOTE]
144
+
We have copied all the jars under *janusgraph-distribution/lib* into /opt/lib/janusgraph/ and the same directory structure is created across all workers, and jars are manually copied across all workers.
145
+
146
+
The final properties file used for OLAP traversal is as follows:
A fully functional example of the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference#pagerankvertexprogram[PageRankVertexProgram] can be found in the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference#vertexprogram[VertexProgram] section of the TinkerPop docs.
223
+
=== Other Vertex Programs
224
+
225
+
Apache TinkerPop provides various vertex programs. A vertex program runs on each vertex until either a termination criteria is attained or a fixed number of iterations has been reached. Due to the parallel nature of vertex programs, they can leverage parallel computing frameworks like Spark or Giraph to improve their performance.
226
+
227
+
Once you are familiar with how to configure JanusGraph to work with Spark, you can run all the other vertex programs provided by Apache TinkerPop, like Page Rank, Bulk Loading and Peer Pressure. See the http://tinkerpop.apache.org/docs/$MAVEN{tinkerpop.version}/reference/#vertexprogram[TinkerPop VertexProgram docs] for more details.
0 commit comments