A Generic Way to Address Dependency Conflict for Bazel-built Scala Spark App

2021-04-01

6 minute read

Introduction

Recently I am working on container/helmg/k8s-ing a scala spark app. I am having an issue:

Caused by: java.lang.RuntimeException: java.lang.ClassCastException: 
cannot assign instance of scala.concurrent.duration.FiniteDuration 
to field org.apache.spark.rpc.RpcTimeout.duration 
of type scala.concurrent.duration.FiniteDuration 
in instance of org.apache.spark.rpc.RpcTimeout

when I have to use

1spark.driver.userClassPathFirst: true 
2spark.executor.userClassPathFirst: true

It is a well-solved problem, if we are using maven to build the jar and make use of shaded jar. Unfortunally we are using bazel to do the build, and we don’t have a good solution to shade a jar.

Before diving into details of the problem, let me briefly introduce how we are doing the build in bazel.

Manage external dependencies

We currently make use rules_jvm_external to manage all maven dependencies. We mark all spark related jars as neverlink = True. We also marked scala-library as neverlink

The WORKSPACE looks like this

 1
 2    maven_install(
 3        name = "maven",
 4        artifacts = [
 5            maven.artifact(
 6                "org.apache.spark",
 7                "spark-core_2.11",
 8                "2.4.5",
 9                neverlink = True,
10            )
11            maven.artifact(
12                "org.scala-lang",
13                "scala-library",
14                "2.11.12",
15                neverlink = True,
16            ),
17        ],
18        excluded_artifacts = [
19        ],
20        fetch_sources = False,
21        repositories = MAVEN_REPOSITORIES,
22        version_conflict_policy = "pinned",
23    )
24
25    maven_install(
26        name = "maven_runtime",
27        artifacts = [
28            "org.scala-lang:scala-library:2.11.12",
29            "org.scala-lang:scala-compiler:2.11.12",
30            "org.scala-lang:scala-reflect:2.11.12",
31        ],
32        excluded_artifacts = [
33        ],
34        fetch_sources = False,
35        repositories = MAVEN_REPOSITORIES,
36        version_conflict_policy = "pinned",
37    )

You might ask why we have two maven repos - this is the our current solution to mimic maven provided scope.

You can think the maven one is to provide only compile dependencies (we mark those we don’t want to appear in the uber/fat jar (we use uber jar and fat jar interchangeabley) as neverlink), and the maven_runtime is to provide runtime dependencies only. In an environment that the dependencies are provided (jars are provided, and no need to packaged into the fatjar), we don’t have to do anything as these jars have been marked as neverlink and are not in the jar. We mostly use the maven_runtime deps only in test scenarios, when we need to provide the jars (package the jars in the fatjar).

Build Targets

In the BUILD file, we mix use bazel scala and java rules to do the build.

The build file is like this

 1scala_library(
 2    name = "scala-scala-app",
 3    src = "",
 4    deps = [
 5        ":java-lib1",
 6        "@maven//:org_scalaj_scalaj_http_2_11",
 7    ]
 8)
 9
10java_library(
11    name = "java-lib1"
12)
13
14
15java_binary(
16    name = "spark-app"
17    create_executable = False,
18    deps = [
19        ":scala-scala-app",
20    ]
21)

On the java_binary target, we build uber jar by build on the implicit target spark-app_deploy.jar, which is how bazel builds the fat jar.

The Problem

Now I can finally talk about the problem.

We found by building fatjar in this way, if you unzip the fatjar, it always contains the scala-library.

like these you will see

 inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class
 inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class
 inflated: scala/collection/parallel/IterableSplitter$Taken.class
 inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$6.class
 inflated: scala/collection/parallel/IterableSplitter$Zipped$$anonfun$split$3.class
 inflated: scala/collection/parallel/IterableSplitter$Zipped.class
 inflated: scala/collection/parallel/IterableSplitter$ZippedAll.class
 inflated: scala/collection/parallel/IterableSplitter$class.class
 inflated: scala/collection/parallel/IterableSplitter.class
 inflated: scala/collection/parallel/ParIterable$.class
 inflated: scala/collection/parallel/ParIterable$class.class

While it is counter-intuitive as we have marked the ``org.scala-lang:scala-library:2.11.12` as neverlink, it shouldn’t be included in the uber jar.

At a second think, it might make senses, as the purpose of an uber jar is a self-contained jar that can execute without the need of anything else.

It seems like there is a special handling for the scala-library when we build uber jar for scala. In order to make the jar can run by itself, the fatjar probably needs to include the scala-library.

It would be fine for most cases, if you don’t set userClassPathFirst to true as the spark app will always use the jars shipped with spark package in your enviroment first.

However, if you happen to have to useClassPathFirst, then you might run into the jar conflicting.

If the conflicting jar is a java jar, we could already solve it by the neverlink in rules_jvm_external. Thing goes werid, when it happens to be the scala-library.

We are using the spark 2.4.5 as base container image, and I checked in the /opt/spark/jars in the container file system, we have a jar named scala-library-2.11.12.jar which is exactly the verion we are using. My understanding is in this case, even if we include the scala-library in the jar, it should be fine. But it was not. If someone knows why please let me know. Anyway, I am getting the java.lang.ClassCastException error, which seems only happen, if our app jar includes a scala-library. It also seems we have scala-library conflicting even if it is the identical version.

Rejected Alternative

The path I didn’t take is to NOT make use userClassPathFirst. I gave it a quick try by not using userClassPathFirst, then I started to fall into another rabbit hole with many other jar conflicts like guava, fastxml etc. TBH, I still prefer we can use userClassPathFirst, since we have the freedom to control what jars are packaged in our app and use them in priorities. We don’t need to change the provided jars in the spark app base images from time to time.

Solution

Given that we really want to stick with userClassPathFirst, The main intuition of solution is to exclude the scala-library from deploy jar (the fatjar).

Normally, we can make use of either neverlink or deploy_env with bazel for this purpose.

neverlink not working in this case

I am guessing the the neverlink works on most of java/scala deps but not the scala-library as it is the standard library every scala app would need to run, so the neverlink is not hornored for scala-library. (need someone who is more faimilar with bazel build for scala to confirm this)

deploy_env come to the rescue

Then I have to try the other one which is deploy_env

As the doc says:

A list of other java_binary targets which represent the deployment environment for this binary. Set this attribute when building a plugin which will be loaded by another java_binary. Setting this attribute excludes all dependencies from the runtime classpath (and the deploy jar) of this binary that are shared between this binary and the targets specified in deploy_env.

Consider this deploy_env gives you a way to substract certain deps from you uber jar.

In order to make use of it, I need to make some changes to the build file.

every place I use scala_library I need to do a java_import then use the java_library

like below to cut the tie to scala_library from java_library or java_binary

scala_library(
    name = "scala-spark-app",
    src = "",
    deps = [
        ":java-lib1",
        "@maven//:org_scalaj_scalaj_http_2_11",
    ]
)

java_import(
    name = "scala-app",
    jars = [":scala-spark-app"]
)

java_library(
    name = "java-lib1"
)

# make sure this rule only see java rules.
java_binary(
    name = "spark-app"
    create_executable = False,
    deps = [
        ":scala-app",
        ":java-lib1"
    ]
)

Not knowing the why it works, my finding is that if the java_binary deps on a scala_library directly, if you build uber jar, it will always include the scala-library.

config deploy_env to substract whatever is conflicting. This can serve as a generic approach to address spark scala app jar dependencty confliction.

java_binary(
    name = "provided-deps",
    create_executable = False,
    runtime_deps = [
        "@maven//:org_apache_commons_commons_lang3",
        "@maven_runtime//:org_scala_lang_scala_library", # has to be this jar from maven_runtime
    ],
)

java_binary(
    name = "spark-app"
    create_executable = False,
    deploy_env = [
        ":provided-deps",
    ],
    deps = [
        ":scala-app",
        ":java-lib1"
    ]
)

We have to use "@maven_runtime//:org_scala_lang_scala_library" as "@maven//:org_scala_lang_scala_library" has been marked neverlink in our maven repo setup, so it will have no effect if i am using "@maven//:org_scala_lang_scala_library" to do the substraction.

While our two-maven-repo-with-neverlink approach works really well before I encoutered this problem, deploy_env seems not bad at all, need to spend more efforts to check can we get rid of the two maven repo appraoch and solve all our need of neverlink by deploy_env alone.

At last, the easies way to validate the fix is to unzip your deploy jar and it should NOT contain any class like these:

 inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$split$1.class
 inflated: scala/collection/parallel/IterableSplitter$Taken$$anonfun$takeSeq$1.class
 inflated: scala/collection/parallel/ParIterable$.class
 inflated: scala/collection/parallel/ParIterable$class.class

I have to admit this finding and solution is all trials and erros. I would like to know if this can be done better and knowing the details underneath. You can find me on twitter.