A Generic Way to Address Dependency Conflict for Bazel-Built Scala Spark App

Introduction

Recently I am workiong on container-ing/helm-ing/k8s-ing a scala app.
I am having issues

1
2
3
4
5
Caused by: java.lang.RuntimeException: java.lang.ClassCastException: 
cannot assign instance of scala.concurrent.duration.FiniteDuration
to field org.apache.spark.rpc.RpcTimeout.duration
of type scala.concurrent.duration.FiniteDuration
in instance of org.apache.spark.rpc.RpcTimeout

when I have to use

1
2
spark.driver.userClassPathFirst: true 
spark.executor.userClassPathFirst: true

It is a well-solved problem, if we are using maven and make use of shaded jar.
Unfortunally We use bazel to do the build, and we don’t have a good solution to share a jar.

Before diving into the more details of the problem, let me briefly introduce how we do the build in bazel.

Manage external dependencies

We currently make use rules_jvm_external to manage all maven dependencies. We mark all spark related jar as neverlink = True.
We also marked scala-library as neverlink

The WORKSPACE looks like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38

maven_install(
name = "maven",
artifacts = [
maven.artifact(
"org.apache.spark",
"spark-core_2.11",
"2.4.5",
neverlink = True,
)
maven.artifact(
"org.scala-lang",
"scala-library",
"2.11.12",
neverlink = True,
),
],
excluded_artifacts = [
],
fetch_sources = False,
repositories = MAVEN_REPOSITORIES,
version_conflict_policy = "pinned",
)

maven_install(
name = "maven_runtime",
artifacts = [
"org.scala-lang:scala-library:2.11.12",
"org.scala-lang:scala-compiler:2.11.12",
"org.scala-lang:scala-reflect:2.11.12",
],
excluded_artifacts = [
],
fetch_sources = False,
repositories = MAVEN_REPOSITORIES,
version_conflict_policy = "pinned",
)

To understand why we have two maven repe, this is the our current solution for mimic maven provided scope.

You can think the maven one is to provide only compile dependencies (we mark those we dont want to link as neverlink), and the maven_runtime is to provide runtime dependencies only. In an environment that the dependencies are provided, we don’t have to do anything as these jars have been marked as neverlink and are not in the jar. We mostly use the maven_runtime deps only in test scenarios, when we need to provide the jars.

Build Targets

In the BUILD file, we mix use bazel scala and java rules to do the build.

The build file is like this

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
scala_library(
name = "scala-scala-app",
src = "",
deps = [
":java-lib1",
"@maven//:org_scalaj_scalaj_http_2_11",
]
)

java_library(
name = "java-lib1"
)


java_binary(
name = "spark-app"
create_executable = False,
deps = [
":scala-scala-app",
]
)

On the java_binary target, we build uber jar by build on the implicit target spark-app_deploy.jar.

The Problem

Now I can finally talk about the problem.

We found by building uber jar in this way, if you unzip the uber jar, it always contains the scala-library.

like these you will see

1
2
3
4
5
6
7
8
9
10
11
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class
inflated: scala/collection/parallel/IterableSplitter$Taken.class
inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$6.class
inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$split$3.class
inflated: scala/collection/parallel/IterableSplitter$Zipped.class
inflated: scala/collection/parallel/IterableSplitter$ZippedAll.class
inflated: scala/collection/parallel/IterableSplitter$class.class
inflated: scala/collection/parallel/IterableSplitter.class
inflated: scala/collection/parallel/ParIterable$.class
inflated: scala/collection/parallel/ParIterable$class.class

While it is counter-intuitive as we have marked the ``org.scala-lang:scala-library:2.11.12` as neverlink, it shouldn’t be included in the uber jar.

At second think, I think it might make senses, as the purpose of an uber jar is a self-contained jar that can execute without needing anything else.

It seems like there is a special handling for the scala-library when we built uber jar for scala. In order to make the jar can run by itself, we probably need to include the scala-library inside the jar.

It would be fine for most cases, if you don’t set userClassPathFirst to true as the spark app will always use the jars with in the spark package in your enviroment first.

However, if you happen to have to useClassPathFirst, then you might run into the jar conflicting.

If the conflicting jar is a java jar, we could already solve it by the neverlink in rules_jvm_external.
Thing goes werid, when it happens to be the scala-library.

We are using the spark 2.4.5 as base container image, and I checked in the /opt/spark/jars in the container file system, we have a jar named scala-library-2.11.12.jar which is exactly the verion we are using. My understanding is in this case, even if we include the scala-library in the jar, it should be fine. But it was not. If someone knows why plz let me know. I am getting some java.lang.ClassCastException error, which based on my research can only happen, if we have conflicting that our app jar has a scala-library. It seems we still have scala-library conflicting even if it is the identical version.

Rejected Alternative

The path I didn’t take is to not make use userClassPathFirst, since it is dangerous as my people are warning on using userClassPathFirst .
I gave it a quick try, then I started to fall into another rabbit hole with many other jar conflicts like guava, fastxml etc. TBH, I still prefer we can use userClassPathFirst , since we have the freedom to control what jars are packaged in our app and use them in priorities. We don’t need to frequently to tailer the provided jars in the spark app base images.

Solution

Given that we stick with userClassPathFirst, The main intuition is to exclude the scala-library from deploy jar (uber jar).

Normally, we can make use of either neverlink or deploy_env with bazel for this purpose.

I am guessing the the neverlink works on most of java/scala deps but not the scala-library as it is like the standard library every scala app would need to run, so the neverlink is not hornored for scala-library. (need someone who is more faimilar with bazel build for scala to confirm this)

deploy_env come to the rescue

Then I have to try the other one which is deploy_env

As the doc says:

A list of other java_binary targets which represent the deployment environment for this binary. Set this attribute when building a plugin which will be loaded by another java_binary.
Setting this attribute excludes all dependencies from the runtime classpath (and the deploy jar) of this binary that are shared between this binary and the targets specified in deploy_env.

Consider this deploy_env gives you a way to substract certain deps from you uber jar.

In order to make use of it, I need to make some changes to the build file.

  1. every place I use scala_library I need to do a java_import then use the java_import

like below to cut the tie to scala_library from java_library or java_binary

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
scala_library(
name = "scala-spark-app",
src = "",
deps = [
":java-lib1",
"@maven//:org_scalaj_scalaj_http_2_11",
]
)

java_import(
name = "scala-app",
jars = [":scala-spark-app"]
)

java_library(
name = "java-lib1"
)

# make sure this rule only see java rules.
java_binary(
name = "spark-app"
create_executable = False,
deps = [
":scala-app",
":java-lib1"
]
)

not knowing the details, my finding on this is that if the java_binary deps on a scala_library directly, if you build uber jar, it will always include the scala-library.

  1. config deploy_env to subscrate what ever is conflicting. This can serve as a generic approach to address spark scala app jar dependencty confliction.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
java_binary(
name = "provided-deps",
create_executable = False,
runtime_deps = [
"@maven//:org_apache_commons_commons_lang3",
"@maven_runtime//:org_scala_lang_scala_library", # has to be this jar from maven_runtime
],
)

java_binary(
name = "spark-app"
create_executable = False,
deploy_env = [
":provided-deps",
],
deps = [
":scala-app",
":java-lib1"
]
)

We have to use "@maven_runtime//:org_scala_lang_scala_library" as "@maven//:org_scala_lang_scala_library" has been marked neverlink in our maven repo setup, so it will have no effect if i am using "@maven//:org_scala_lang_scala_library" to do the substraction.

While our previouly two manve repo with nevernlink approach works well before I encoutered this problem, deploy_env seems not bad at all, need to spend more efforts to check can we get rid of the two maven repo appraoch and solve all our need of neverlink by deploy_env alone.

At last, the easies way to validate the fix is to unzip your deploy jar and it should NOT contain any class like this

1
2
3
4
5
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class
inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class
inflated: scala/collection/parallel/ParIterable$.class
inflated: scala/collection/parallel/ParIterable$class.class

I have to admit this finding and solution is all trials and erros. I would like to know if this can be done better or the details underneath. You can find me on twitter.