A Generic Way to Address Dependency Conflict for Bazel-built Scala Spark App
Introduction
Recently I am working on container/helmg/k8s-ing a scala spark app. I am having an issue:
Caused by: java.lang.RuntimeException: java.lang.ClassCastException:
cannot assign instance of scala.concurrent.duration.FiniteDuration
to field org.apache.spark.rpc.RpcTimeout.duration
of type scala.concurrent.duration.FiniteDuration
in instance of org.apache.spark.rpc.RpcTimeout
when I have to use
1spark.driver.userClassPathFirst: true
2spark.executor.userClassPathFirst: true
It is a well-solved problem, if we are using maven to build the jar and make use of shaded jar. Unfortunally we are using bazel to do the build, and we don’t have a good solution to shade a jar.
Before diving into details of the problem, let me briefly introduce how we are doing the build in bazel.
Manage external dependencies
We currently make use rules_jvm_external
to manage all maven dependencies. We mark all spark related jars as neverlink = True
.
We also marked scala-library as neverlink
The WORKSPACE looks like this
1
2 maven_install(
3 name = "maven",
4 artifacts = [
5 maven.artifact(
6 "org.apache.spark",
7 "spark-core_2.11",
8 "2.4.5",
9 neverlink = True,
10 )
11 maven.artifact(
12 "org.scala-lang",
13 "scala-library",
14 "2.11.12",
15 neverlink = True,
16 ),
17 ],
18 excluded_artifacts = [
19 ],
20 fetch_sources = False,
21 repositories = MAVEN_REPOSITORIES,
22 version_conflict_policy = "pinned",
23 )
24
25 maven_install(
26 name = "maven_runtime",
27 artifacts = [
28 "org.scala-lang:scala-library:2.11.12",
29 "org.scala-lang:scala-compiler:2.11.12",
30 "org.scala-lang:scala-reflect:2.11.12",
31 ],
32 excluded_artifacts = [
33 ],
34 fetch_sources = False,
35 repositories = MAVEN_REPOSITORIES,
36 version_conflict_policy = "pinned",
37 )
38
You might ask why we have two maven repos - this is the our current solution to mimic maven provided scope.
You can think the maven
one is to provide only compile dependencies (we mark those we don’t want to appear in the uber/fat jar (we use uber jar and fat jar interchangeabley) as neverlink), and the maven_runtime
is to provide runtime dependencies only. In an environment that the dependencies are provided (jars are provided, and no need to packaged into the fatjar), we don’t have to do anything as these jars have been marked as neverlink
and are not in the jar. We mostly use the maven_runtime
deps only in test scenarios, when we need to provide the jars (package the jars in the fatjar).
Build Targets
In the BUILD file, we mix use bazel scala and java rules to do the build.
The build file is like this
1scala_library(
2 name = "scala-scala-app",
3 src = "",
4 deps = [
5 ":java-lib1",
6 "@maven//:org_scalaj_scalaj_http_2_11",
7 ]
8)
9
10java_library(
11 name = "java-lib1"
12)
13
14
15java_binary(
16 name = "spark-app"
17 create_executable = False,
18 deps = [
19 ":scala-scala-app",
20 ]
21)
On the java_binary target, we build uber jar by build on the implicit target spark-app_deploy.jar
, which is how bazel builds the fat jar.
The Problem
Now I can finally talk about the problem.
We found by building fatjar in this way, if you unzip the fatjar, it always contains the scala-library.
like these you will see
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class
inflated: scala/collection/parallel/IterableSplitter$Taken.class
inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$6.class
inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$split$3.class
inflated: scala/collection/parallel/IterableSplitter$Zipped.class
inflated: scala/collection/parallel/IterableSplitter$ZippedAll.class
inflated: scala/collection/parallel/IterableSplitter$class.class
inflated: scala/collection/parallel/IterableSplitter.class
inflated: scala/collection/parallel/ParIterable$.class
inflated: scala/collection/parallel/ParIterable$class.class
While it is counter-intuitive as we have marked the ``org.scala-lang:scala-library:2.11.12` as neverlink, it shouldn’t be included in the uber jar.
At a second think, it might make senses, as the purpose of an uber jar is a self-contained jar that can execute without the need of anything else.
It seems like there is a special handling for the scala-library
when we build uber jar for scala. In order to make the jar can run by itself, the fatjar probably needs to include the scala-library.
It would be fine for most cases, if you don’t set userClassPathFirst
to true
as the spark app will always use the jars shipped with spark package in your enviroment first.
However, if you happen to have to useClassPathFirst, then you might run into the jar conflicting.
If the conflicting jar is a java jar, we could already solve it by the neverlink
in rules_jvm_external
.
Thing goes werid, when it happens to be the scala-library.
We are using the spark 2.4.5 as base container image, and I checked in the /opt/spark/jars
in the container file system, we have a jar named scala-library-2.11.12.jar
which is exactly the verion we are using. My understanding is in this case, even if we include the scala-library in the jar, it should be fine. But it was not. If someone knows why please let me know. Anyway, I am getting the java.lang.ClassCastException
error, which seems only happen, if our app jar includes a scala-library. It also seems we have scala-library conflicting even if it is the identical version.
Rejected Alternative
The path I didn’t take is to NOT make use userClassPathFirst
.
I gave it a quick try by not using userClassPathFirst
, then I started to fall into another rabbit hole with many other jar conflicts like guava
, fastxml
etc. TBH, I still prefer we can use userClassPathFirst
, since we have the freedom to control what jars are packaged in our app and use them in priorities. We don’t need to change the provided jars in the spark app base images from time to time.
Solution
Given that we really want to stick with userClassPathFirst
, The main intuition of solution is to exclude the scala-library from deploy jar (the fatjar).
Normally, we can make use of either neverlink
or deploy_env
with bazel for this purpose.
neverlink not working in this case
I am guessing the the neverlink works on most of java/scala deps but not the scala-library as it is the standard library every scala app would need to run, so the neverlink is not hornored for scala-library. (need someone who is more faimilar with bazel build for scala to confirm this)
deploy_env come to the rescue
Then I have to try the other one which is deploy_env
As the doc says:
A list of other java_binary targets which represent the deployment environment for this binary. Set this attribute when building a plugin which will be loaded by another java_binary. Setting this attribute excludes all dependencies from the runtime classpath (and the deploy jar) of this binary that are shared between this binary and the targets specified in deploy_env.
Consider this deploy_env
gives you a way to substract certain deps from you uber jar.
In order to make use of it, I need to make some changes to the build file.
- every place I use
scala_library
I need to do ajava_import
then use thejava_library
like below to cut the tie to scala_library
from java_library
or java_binary
scala_library(
name = "scala-spark-app",
src = "",
deps = [
":java-lib1",
"@maven//:org_scalaj_scalaj_http_2_11",
]
)
java_import(
name = "scala-app",
jars = [":scala-spark-app"]
)
java_library(
name = "java-lib1"
)
# make sure this rule only see java rules.
java_binary(
name = "spark-app"
create_executable = False,
deps = [
":scala-app",
":java-lib1"
]
)
Not knowing the why it works, my finding is that if the java_binary
deps on a scala_library
directly, if you build uber jar, it will always include the scala-library
.
- config
deploy_env
to substract whatever is conflicting. This can serve as a generic approach to address spark scala app jar dependencty confliction.
java_binary(
name = "provided-deps",
create_executable = False,
runtime_deps = [
"@maven//:org_apache_commons_commons_lang3",
"@maven_runtime//:org_scala_lang_scala_library", # has to be this jar from maven_runtime
],
)
java_binary(
name = "spark-app"
create_executable = False,
deploy_env = [
":provided-deps",
],
deps = [
":scala-app",
":java-lib1"
]
)
We have to use "@maven_runtime//:org_scala_lang_scala_library"
as "@maven//:org_scala_lang_scala_library"
has been marked neverlink in our maven repo setup, so it will have no effect if i am using "@maven//:org_scala_lang_scala_library"
to do the substraction.
While our two-maven-repo-with-neverlink approach works really well before I encoutered this problem, deploy_env
seems not bad at all, need to spend more efforts to check can we get rid of the two maven repo appraoch and solve all our need of neverlink by deploy_env
alone.
At last, the easies way to validate the fix is to unzip your deploy jar and it should NOT contain any class like these:
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class
inflated: scala/collection/parallel/ParIterable$.class
inflated: scala/collection/parallel/ParIterable$class.class
I have to admit this finding and solution is all trials and erros. I would like to know if this can be done better and knowing the details underneath. You can find me on twitter.