(🚧🚜 WIP 👷🚧 ) Split fabric-specific code into individual back-ends#1617

Open

cchepelov wants to merge 44 commits intotwitter:cascading3from

cchepelov:split-fabrics

Contributor

cchepelov commented Nov 4, 2016

(as being discussed on scalding-dev)

In the things I've left out for now (in addition to the list in the e-mail): scalding-hadoop-test (platform-specific tests on a real minicluster).
Perhaps it'd make sense to have run / runHadoop (→runFabricLocal ?) / runOnMiniCluster in JobTest, and the relevant Mode implementation assortment?

Cyrille Chépélov (TP12) and others added 30 commits

October 29, 2016 02:52


          Split-fabrics: initial WIP commit

e9e09da

  - create 4 empty fabric jars
  - break & split Mode across the "storage" and "execution tech" axis
  - intentionally retire some of the Mode-related type names to ensure things DO move
  - fix what breaks at compile time.

  TESTS NOT YET RUN AT THIS POINT


          Beginning of the great test-fixing exercise.

a11988d


          debug PartitionedDelimitedSourceTest

ecdfbfc


          fixed breakage in Executions

b5cdb0d


          fix Mode parse test: forgot --hdfs ;)

5fa042a


          fix autoCluster mode

d09112b


          not much, moving stuff around

644b9a3


          Rewrite away dependency to guava, using JDK 7 instead

f3212cb


          Manage more precisely the version of guava included (guava [16.0,) is…

826ad0b

… toxic to hadoop-commons 2.6.0)


          carry over the configuration from the local cluster to upcoming test …

474dbed

…jobs


          More guava

41133bc


          Fix JsonLineTest

54df9eb

PSA

0aee566


          fix repl: need to have cascading-hadoop around

23f16a8


          fix WeightedPageRank test

fd6ee73


          scalariform

3342ce9


          make the temporary file path (in HadoopTest mode) more explicit

0958edc


          better debugging temp file name clean-up

33807aa


          concentrate test-only code

c7efbf1


          (scalariform)

4e4d107


          reworking a bit Hadoop dependencies (compile-time, test-time, run-time)

1be785e


          Move LegacyHadoopMode & friends into scalding-fabric-hadoop

30a9211


          Implement modes for Hadoop2-mr1, Tez and Flink (not all work for now)

26bbcb3


          PartitionedDelimitedSourceTest: pass test (adapt to changing file pat…

04368cf

…terns across fabrics)


          fixing a List-to-String comparison (which should've failed much earlier)

492e2d2


          in generated setters, add the suppression against the PartialOption wart


          ensure test-mode doesn't accidentally return a LocalTest when autoClu…

abd0d81

…ster-test is requested


          WIP: splitting cluster-aware tests away from scalding-core

0874b91


          Fix unpossible comparison ("Step Number" ceased to be an Int)

ba4ba26


          bring in a fabric to test parquet

f16946a

Cyrille Chépélov (TP12) and others added 14 commits

November 2, 2016 15:34


          fix Reduce Operations tests; Tsv cannot be used for non-primitive types

0e45a2d


          make Tez use the Hadoop reducers specification if needed

4eec1a6


          OK, let's build Flink too. And run tests to ensure scriptValidator's …

f0d28a1

…happy


          dip a toe: try enabling Tez on the untyped REPL

ce8e67b


          (scalariform)

66a3b60


          ruby typo

13c17a2


          more character escapes

4b55e31


          fixing PartitionedDelimitedSourceTest on Tez too

0f0542d


          testing the tez fabric on scala 2.10.6 too (or the script validation …

9f05dbe

…fails)


          attempt to build & test repl against all fabrics

2e00633


          Counters: port to non-Hadoop fabrics


          Fabric-switching logic & stats: fix & clean various unfinished bits

40c792a


          early, very early baby steps of using Flink

ddb9efa


          minor Tez details + notes about non-working tezts

5c5127e

cchepelov changed the base branch from develop to cascading3

November 4, 2016 19:29

oscar-stripe suggested changes

View reviewed changes

oscar-stripe left a comment

Great work.

One question: is there any way we can reduce the size of the PR.

For instance, can we keep flink out and just do the existing fabrics? That will make is easier to review. Once we get it in, we can add more.

I like this approach and think it is the most sensible way to get the most out of cascading 3 without breaking users.

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

               import org.slf4j.LoggerFactory
+              import scala.annotation.meta.param
+              import scala.collection.{ Map, mutable }

oscar-stripe Nov 5, 2016

can we avoid scala.collection.Map and explicitly use Map for immutable map (standard) and mutable.Map when we need a mutable one?

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

                  * but needs a Mode inside.
                  */
-                private class ArgsWithMode(argsMap: Map[String, List[String]], val mode: Mode) extends Args(argsMap) {
+                private class ArgsWithMode(argsMap: scala.Predef.Map[String, List[String]], val mode: Mode) extends Args(argsMap) {

oscar-stripe Nov 5, 2016

I'd rather not hide Map with an import or use scala.collection.immutable.Map here.

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

-                    Hdfs(strictSources, config)
-                  } else
-                    throw ArgsException("[ERROR] Mode must be one of --local, --hadoop1, --hadoop2-mr1, --hadoop2-tez or --hdfs, you provided none")
+                  lazy val autoMode = if (args.boolean("autoCluster")) {

oscar-stripe Nov 5, 2016

can we prefix new scalding options with scalding.?

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

+                  "hdfs" -> "com.twitter.scalding.LegacyHadoopMode",
+                  "flink" -> "com.twitter.scalding.FlinkMode")
+                val KnownTestModesMap = Seq(

oscar-stripe Nov 5, 2016

can we make these maps instead?

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

+                  "flink-test" -> "com.twitter.scalding.FlinkTestMode")
+                // TODO: define hadoop2-mr1 (easy), tez and flink (less easy) classes.
+                private def getModeConstructor[M <: Mode](clazzName: String, types: Class[_]*) =

oscar-stripe Nov 5, 2016

I wonder if we should just have a contract that a Mode needs a constructor that takes exactly one Args and exactly one Configuration, or even, possibly Config which is immutable. Wouldn't that work? I don't want to add too much flexibility if we don't need it.

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

+              trait ClusterMode extends Mode {
               }
-              trait HadoopMode extends Mode {

oscar-stripe Nov 5, 2016

we have user code that assumes this. Can we find a way not to break those folks?

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

+               * The "HadoopMode" is actually an alias for "a mode running on a fabric that ultimately runs using an execution
+               * engine compatible with some of the Hadoop technology stack (may or may not include Hadoop 1.x, YARN, etc.)
+               */
+              trait HadoopExecutionModeBase[ConfigType <: Configuration]

oscar-stripe Nov 5, 2016

maybe we actually want an abstract type here:

trait HadoopExecutionModeBase {
  type ConfigType <: Configuration
}

since then we can refer to it: mode.ConfigType. We might do this to put more of the cascading types and not use _ so much.

scalding-core/src/main/scala/com/twitter/scalding/Mode.scala

+               */
+              trait HadoopExecutionModeBase[ConfigType <: Configuration]
+                extends ExecutionMode {
                 def jobConf: Configuration

oscar-stripe Nov 5, 2016

should this not return ConfigType?

scalding-core/src/main/scala/com/twitter/scalding/Stats.scala

+                  val memo = scala.collection.mutable.Map[String, Constructor[CounterImpl]]()
+                  val ctor = memo.synchronized {
+                    memo.getOrElse(klassName,

oscar-stripe Nov 5, 2016

let's use getOrElseUpdate rather than the nested put.

scalding-core/src/main/scala/com/twitter/scalding/Stats.scala

+                }
+                private[scalding] def upcast[T <: FlowProcess[_]](fp: FlowProcess[_])(implicit ev: TypeTag[T]): T = fp match {
+                  case hfp: T @unchecked if (ev == typeTag[T]) => hfp // see below

oscar-stripe Nov 5, 2016

this typeTag[T] will always match ev no? I think so. Why not just explicitly cast in this method:

def downCast[T <: U](u: U): T = u.asInstanceOf[T]

Also, this is a downcast, no?

CLAassistant commented Nov 16, 2019 •

edited

Loading

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ cchepelov
❌ Cyrille Chépélov (TP12)

Cyrille Chépélov (TP12) seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet