Recently I am workiong on container-ing/helm-ing/k8s-ing a scala app.
I am having issues
Caused by: java.lang.RuntimeException: java.lang.ClassCastException:
when I have to use
It is a well-solved problem, if we are using maven and make use of shaded jar.
Unfortunally We use bazel to do the build, and we don’t have a good solution to share a jar.
Before diving into the more details of the problem, let me briefly introduce how we do the build in bazel.
We currently make use
rules_jvm_external to manage all maven dependencies. We mark all spark related jar as
neverlink = True.
We also marked scala-library as neverlink
The WORKSPACE looks like this
To understand why we have two maven repe, this is the our current solution for mimic maven provided scope.
You can think the
maven one is to provide only compile dependencies (we mark those we dont want to link as neverlink), and the
maven_runtime is to provide runtime dependencies only. In an environment that the dependencies are provided, we don’t have to do anything as these jars have been marked as
neverlink and are not in the jar. We mostly use the
maven_runtime deps only in test scenarios, when we need to provide the jars.
In the BUILD file, we mix use bazel scala and java rules to do the build.
The build file is like this
On the java_binary target, we build uber jar by build on the implicit target
Now I can finally talk about the problem.
We found by building uber jar in this way, if you unzip the uber jar, it always contains the scala-library.
like these you will see
While it is counter-intuitive as we have marked the ``org.scala-lang:scala-library:2.11.12` as neverlink, it shouldn’t be included in the uber jar.
At second think, I think it might make senses, as the purpose of an uber jar is a self-contained jar that can execute without needing anything else.
It seems like there is a special handling for the
scala-library when we built uber jar for scala. In order to make the jar can run by itself, we probably need to include the scala-library inside the jar.
It would be fine for most cases, if you don’t set
true as the spark app will always use the jars with in the spark package in your enviroment first.
However, if you happen to have to useClassPathFirst, then you might run into the jar conflicting.
If the conflicting jar is a java jar, we could already solve it by the
Thing goes werid, when it happens to be the scala-library.
We are using the spark 2.4.5 as base container image, and I checked in the
/opt/spark/jars in the container file system, we have a jar named
scala-library-2.11.12.jar which is exactly the verion we are using. My understanding is in this case, even if we include the scala-library in the jar, it should be fine. But it was not. If someone knows why plz let me know. I am getting some
java.lang.ClassCastException error, which based on my research can only happen, if we have conflicting that our app jar has a scala-library. It seems we still have scala-library conflicting even if it is the identical version.
The path I didn’t take is to not make use
userClassPathFirst, since it is dangerous as my people are warning on using
I gave it a quick try, then I started to fall into another rabbit hole with many other jar conflicts like
fastxml etc. TBH, I still prefer we can use
userClassPathFirst , since we have the freedom to control what jars are packaged in our app and use them in priorities. We don’t need to frequently to tailer the provided jars in the spark app base images.
Given that we stick with
userClassPathFirst, The main intuition is to exclude the scala-library from deploy jar (uber jar).
Normally, we can make use of either
deploy_env with bazel for this purpose.
I am guessing the the neverlink works on most of java/scala deps but not the scala-library as it is like the standard library every scala app would need to run, so the neverlink is not hornored for scala-library. (need someone who is more faimilar with bazel build for scala to confirm this)
Then I have to try the other one which is
As the doc says:
A list of other java_binary targets which represent the deployment environment for this binary. Set this attribute when building a plugin which will be loaded by another java_binary.
Setting this attribute excludes all dependencies from the runtime classpath (and the deploy jar) of this binary that are shared between this binary and the targets specified in deploy_env.
deploy_env gives you a way to substract certain deps from you uber jar.
In order to make use of it, I need to make some changes to the build file.
- every place I use
scala_libraryI need to do a
java_importthen use the
like below to cut the tie to
not knowing the details, my finding on this is that if the
java_binary deps on a
scala_library directly, if you build uber jar, it will always include the
deploy_envto subscrate what ever is conflicting. This can serve as a generic approach to address spark scala app jar dependencty confliction.
We have to use
"@maven//:org_scala_lang_scala_library" has been marked neverlink in our maven repo setup, so it will have no effect if i am using
"@maven//:org_scala_lang_scala_library" to do the substraction.
While our previouly two manve repo with nevernlink approach works well before I encoutered this problem,
deploy_env seems not bad at all, need to spend more efforts to check can we get rid of the two maven repo appraoch and solve all our need of neverlink by
At last, the easies way to validate the fix is to unzip your deploy jar and it should NOT contain any class like this
I have to admit this finding and solution is all trials and erros. I would like to know if this can be done better or the details underneath. You can find me on twitter.