Adding to Apache Hadoop’s Classpath

By | 2016-08-06

One of the big pain points of administrating Apache Hadoop is the ability to safely and efficiently add to the classpath. The original design in Hadoop gave users a single way to add jars: the HADOOP_CLASSPATH environment variable. This is a bit of a problem for end users, admins, and any 3rd party applications may want to chat that single entry point. As the ecosystem grows and new users become power users, a single tuneable just is no longer viable.

Apache Hadoop 3.x introduces many significant changes to the scripts. Some of those changes help make managing the classpath easy and much more flexible for many use cases.

Per-User Changes

Apache Hadoop now recognizes a file called ${HOME}/.hadooprc.  This file lets you use the Unix Shell API to manipulate all sorts of things.  One of those API calls is hadoop_add_to_classpath.  This script function does exactly what it says: adds a single file or directory to the classpath.

A simple .hadooprc:

This simply adds the mycoolfeature.jar to every local invocation of Java that the Hadoop scripts will invoke for you. You can also add an entire directory:

By default, hadoop_add_classpath always does an append. But if we want our new entry to appear near the beginning of the classpath, this may be done by passing the before flag:

But this only works for each user separately. What about the cases where a classpath needs to get added for all daemons?

Hadoop Tools

A typical use case of modifying HADOOP_CLASSPATH for all processes is to add portions of the tools directory, e.g., enabling S3 support globally. Another new feature in Apache Hadoop 3.x is the HADOOP_OPTIONAL_TOOLS environment variable. In hadoop-env.sh, there is an entry that should look something like this:

This line lists all the various tools that may be added. To enable S3 support, simply update hadoop-env.sh on all nodes and restart all the Hadoop daemons:

This line tells the scripts to make sure the S3 libraries and their dependencies are in the classpath. Using the above method, there is no need to worry about these features getting removed from the classpath and those features always being available.

Custom Classpath Cluster-wide

Adding custom jars to the daemon classpath can also be done without modifying HADOOP_CLASSPATH. The shellprofile.d feature allows one to create bash snippets similar, but with more structure, to what one would find in /etc/profile.d.

First, create the ${HADOOP_CONF_DIR}/shellprofile.d directory if it doesn’t exist. Then create a file in that directory called sitepath.sh with the following content:

Copy it to all nodes and fire away!

The 2nd line registers the shell profile into the system. This allows for it to call functions by name at the proper time. For classpath work, Hadoop expects for the function to take the form of _(registered name)_hadoop_(utility), where the “registered name” is the parameter to the hadoop_add_profile command and the “utility” is a predefined name that provides specific functionality. In this example, we’ve registered the shell profile as “sitepath”, and the particular feature we want is “classpath”. Therefore, we define a function called _sitepath_hadoop_classpath that will execute the hadoop_add_classpath routine.

Again, every execution of the Hadoop scripts will now add this classpath directory while building the classpath. One significant advantage of using this method is that end users cannot easily overwrite it, making it perfect for 3rd party utilities.

Something Broke!

Classpath problems is a tricky business, especially when working with something as complex as Hadoop. It’s a bit of black box on the inside. Luckily, there is a new --debug flag which can offer a lot of help. It provides specific output to stderr marked with ‘DEBUG:’. It generates a lot of text (since it helps debug a lot more than just classpaths) and that may be pretty daunting. But taking out all the non-classpath bits, we’re left with the following:

Line 2 tells us that we passed the classpath command. After a bit of other work, we see the classpath being initialized and the content starting to flow in from line 5. We see that the Apache Hadoop subproject’s profiles get executed and add their content to the classpath next. Line 13, however, is particularly interesting. The new code present in 3.x attempts to sanitize the classpath by checking for duplicates and whether those paths exist. Here we see that the mapred profile tried to add a directory that doesn’t exit. That’s likely the sign of a bug or an installation problem! A bit more processing and finally we get to the end and see the output.

Whither HADOOP_CLASSPATH?

In the end, it may seem like HADOOP_CLASSPATH is deprecated. Not quite! HADOOP_CLASSPATH is still used for things like the dynamic class loaders. But these new capabilities do free it up to be utilized by the end users, totally and completely.

Leave a Reply