A Generic Way to Address Dependency Conflict for Bazel-built Scala Spark App
Recently I am working on container/helmg/k8s-ing a scala spark app. I am having an issue:
Caused by: java.lang.RuntimeException: java.lang.ClassCastException: cannot assign instance of scala.concurrent.duration.FiniteDuration to field org.apache.spark.rpc.RpcTimeout.duration of type scala.concurrent.duration.FiniteDuration in instance of org.apache.spark.rpc.RpcTimeout
when I have to use
1spark.driver.userClassPathFirst: true 2spark.executor.userClassPathFirst: true
It is a well-solved problem, if we are using maven to build the jar and make use of shaded jar. Unfortunally we are using bazel to do the build, and we don’t have a good solution to shade a jar.
Before diving into details of the problem, let me briefly introduce how we are doing the build in bazel.
Manage external dependencies
We currently make use
rules_jvm_external to manage all maven dependencies. We mark all spark related jars as
neverlink = True.
We also marked scala-library as neverlink
The WORKSPACE looks like this
1 2 maven_install( 3 name = "maven", 4 artifacts = [ 5 maven.artifact( 6 "org.apache.spark", 7 "spark-core_2.11", 8 "2.4.5", 9 neverlink = True, 10 ) 11 maven.artifact( 12 "org.scala-lang", 13 "scala-library", 14 "2.11.12", 15 neverlink = True, 16 ), 17 ], 18 excluded_artifacts = [ 19 ], 20 fetch_sources = False, 21 repositories = MAVEN_REPOSITORIES, 22 version_conflict_policy = "pinned", 23 ) 24 25 maven_install( 26 name = "maven_runtime", 27 artifacts = [ 28 "org.scala-lang:scala-library:2.11.12", 29 "org.scala-lang:scala-compiler:2.11.12", 30 "org.scala-lang:scala-reflect:2.11.12", 31 ], 32 excluded_artifacts = [ 33 ], 34 fetch_sources = False, 35 repositories = MAVEN_REPOSITORIES, 36 version_conflict_policy = "pinned", 37 ) 38
You might ask why we have two maven repos - this is the our current solution to mimic maven provided scope.
You can think the
maven one is to provide only compile dependencies (we mark those we don’t want to appear in the uber/fat jar (we use uber jar and fat jar interchangeabley) as neverlink), and the
maven_runtime is to provide runtime dependencies only. In an environment that the dependencies are provided (jars are provided, and no need to packaged into the fatjar), we don’t have to do anything as these jars have been marked as
neverlink and are not in the jar. We mostly use the
maven_runtime deps only in test scenarios, when we need to provide the jars (package the jars in the fatjar).
In the BUILD file, we mix use bazel scala and java rules to do the build.
The build file is like this
1scala_library( 2 name = "scala-scala-app", 3 src = "", 4 deps = [ 5 ":java-lib1", 6 "@maven//:org_scalaj_scalaj_http_2_11", 7 ] 8) 9 10java_library( 11 name = "java-lib1" 12) 13 14 15java_binary( 16 name = "spark-app" 17 create_executable = False, 18 deps = [ 19 ":scala-scala-app", 20 ] 21)
On the java_binary target, we build uber jar by build on the implicit target
spark-app_deploy.jar, which is how bazel builds the fat jar.
Now I can finally talk about the problem.
We found by building fatjar in this way, if you unzip the fatjar, it always contains the scala-library.
like these you will see
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class inflated: scala/collection/parallel/IterableSplitter$Taken.class inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$6.class inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$split$3.class inflated: scala/collection/parallel/IterableSplitter$Zipped.class inflated: scala/collection/parallel/IterableSplitter$ZippedAll.class inflated: scala/collection/parallel/IterableSplitter$class.class inflated: scala/collection/parallel/IterableSplitter.class inflated: scala/collection/parallel/ParIterable$.class inflated: scala/collection/parallel/ParIterable$class.class
While it is counter-intuitive as we have marked the ``org.scala-lang:scala-library:2.11.12` as neverlink, it shouldn’t be included in the uber jar.
At a second think, it might make senses, as the purpose of an uber jar is a self-contained jar that can execute without the need of anything else.
It seems like there is a special handling for the
scala-library when we build uber jar for scala. In order to make the jar can run by itself, the fatjar probably needs to include the scala-library.
It would be fine for most cases, if you don’t set
true as the spark app will always use the jars shipped with spark package in your enviroment first.
However, if you happen to have to useClassPathFirst, then you might run into the jar conflicting.
If the conflicting jar is a java jar, we could already solve it by the
Thing goes werid, when it happens to be the scala-library.
We are using the spark 2.4.5 as base container image, and I checked in the
/opt/spark/jars in the container file system, we have a jar named
scala-library-2.11.12.jar which is exactly the verion we are using. My understanding is in this case, even if we include the scala-library in the jar, it should be fine. But it was not. If someone knows why please let me know. Anyway, I am getting the
java.lang.ClassCastException error, which seems only happen, if our app jar includes a scala-library. It also seems we have scala-library conflicting even if it is the identical version.
The path I didn’t take is to NOT make use
I gave it a quick try by not using
userClassPathFirst, then I started to fall into another rabbit hole with many other jar conflicts like
fastxml etc. TBH, I still prefer we can use
userClassPathFirst, since we have the freedom to control what jars are packaged in our app and use them in priorities. We don’t need to change the provided jars in the spark app base images from time to time.
Given that we really want to stick with
userClassPathFirst, The main intuition of solution is to exclude the scala-library from deploy jar (the fatjar).
Normally, we can make use of either
deploy_env with bazel for this purpose.
neverlink not working in this case
I am guessing the the neverlink works on most of java/scala deps but not the scala-library as it is the standard library every scala app would need to run, so the neverlink is not hornored for scala-library. (need someone who is more faimilar with bazel build for scala to confirm this)
deploy_env come to the rescue
Then I have to try the other one which is
As the doc says:
A list of other java_binary targets which represent the deployment environment for this binary. Set this attribute when building a plugin which will be loaded by another java_binary. Setting this attribute excludes all dependencies from the runtime classpath (and the deploy jar) of this binary that are shared between this binary and the targets specified in deploy_env.
deploy_env gives you a way to substract certain deps from you uber jar.
In order to make use of it, I need to make some changes to the build file.
- every place I use
scala_libraryI need to do a
java_importthen use the
like below to cut the tie to
scala_library( name = "scala-spark-app", src = "", deps = [ ":java-lib1", "@maven//:org_scalaj_scalaj_http_2_11", ] ) java_import( name = "scala-app", jars = [":scala-spark-app"] ) java_library( name = "java-lib1" ) # make sure this rule only see java rules. java_binary( name = "spark-app" create_executable = False, deps = [ ":scala-app", ":java-lib1" ] )
Not knowing the why it works, my finding is that if the
java_binary deps on a
scala_library directly, if you build uber jar, it will always include the
deploy_envto substract whatever is conflicting. This can serve as a generic approach to address spark scala app jar dependencty confliction.
java_binary( name = "provided-deps", create_executable = False, runtime_deps = [ "@maven//:org_apache_commons_commons_lang3", "@maven_runtime//:org_scala_lang_scala_library", # has to be this jar from maven_runtime ], ) java_binary( name = "spark-app" create_executable = False, deploy_env = [ ":provided-deps", ], deps = [ ":scala-app", ":java-lib1" ] )
We have to use
"@maven//:org_scala_lang_scala_library" has been marked neverlink in our maven repo setup, so it will have no effect if i am using
"@maven//:org_scala_lang_scala_library" to do the substraction.
While our two-maven-repo-with-neverlink approach works really well before I encoutered this problem,
deploy_env seems not bad at all, need to spend more efforts to check can we get rid of the two maven repo appraoch and solve all our need of neverlink by
At last, the easies way to validate the fix is to unzip your deploy jar and it should NOT contain any class like these:
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class inflated: scala/collection/parallel/ParIterable$.class inflated: scala/collection/parallel/ParIterable$class.class
I have to admit this finding and solution is all trials and erros. I would like to know if this can be done better and knowing the details underneath. You can find me on twitter.