Building Your Own Apache Hadoop Distribution

By | 2016-08-16

hadoop-logo

While BUILDING.txt includes a lot of hints about what the various options are to build Apache Hadoop, it is overwhelming to turn those directions into something that can actually be deployed. In fact, even many regular contributors to the project don’t even know how the Apache Software Foundation builds a release!

Inside the dev-support/bin directory, Apache Hadoop features a helper utility called create-release that removes the guesswork. Built to help project release managers simplify their tasks, you can take advantage of it to make your release process easier, too.

Default Behavior

By default, just running create-release from any directory in the source repository will do a few things:

  • Verify the repository passes ASF license requirements
  • Build the Java components without any native (read: C/C++ compiled) parts
  • Build the website and all the documentation, including generating the release notes from the ASF JIRA system
  • Provide MD5 checksums to verify that transfers match

After finishing (which will take a while!), the built artifacts will be in the target/artifacts directory.


$ ls -1 hadoop/target/artifacts
CHANGES.md
CHANGES.md.md5
RELEASENOTES.md
RELEASENOTES.md.md5
hadoop-3.0.0-alpha2-SNAPSHOT-rat.txt
hadoop-3.0.0-alpha2-SNAPSHOT-rat.txt.md5
hadoop-3.0.0-alpha2-SNAPSHOT-site.tar.gz
hadoop-3.0.0-alpha2-SNAPSHOT-site.tar.gz.md5
hadoop-3.0.0-alpha2-SNAPSHOT-src.tar.gz
hadoop-3.0.0-alpha2-SNAPSHOT-src.tar.gz.md5
hadoop-3.0.0-alpha2-SNAPSHOT.tar.gz
hadoop-3.0.0-alpha2-SNAPSHOT.tar.gz.md5

The artifacts should work everywhere Java works. create-release, however, can do fancier builds with more features and some optimizations. Using the --help flag, we see the following:


$ ./create-release --help
--artifactsdir=[path] Path to use to store release bits
--asfrelease Make an ASF release
--docker Use Hadoop's Dockerfile for guaranteed environment
--dockercache Use a Docker-private maven cache
--logdir=[path] Path to store logs
--mvncache=[path] Path to the maven cache to use
--native Also build the native components
--rc-label=[label] Add this label to the builds
--sign Use .gnupg dir to sign the artifacts and jars
--version=[version] Use an alternative version string

If something goes wrong, check the logs stored in the patchprocess directory:

$ ls -1 patchprocess/*log
patchprocess/mvn_apache_rat.log
patchprocess/mvn_clean.log
patchprocess/mvn_install.log
patchprocess/mvn_site.log

Building the Native Components

Adding the OS-specific code to your release is usually the first priority. There are several components, but the two big ones are libhadoop.so (libhadoop.dylib on Mac OS X) and the container-executor. The former is a JNI library that enables a lot of extra functionality as well as faster versions of some features that are also implemented in Java. The latter enables the LinuxContainerExecutor functionality that, despite the name, actually works on most Unix operating systems to provide significantly better security.

For a successful compile of the native components, you’ll need to make sure your build environment has all the necessary prerequisites. (See BUILDING.txt for more information). If you are building on Linux, however, see more information later on about using Docker to make that easier.

Changing Locations of Things

You can tell create-release to use different directories for certain operations. The --logdir and --artifactsdir options are fairly self-explanatory. But what is the --mvncache option?

Apache Hadoop uses maven as its build tool. Maven downloads Java dependencies as it is compiling. When it does this, it stores them into a local cache. This local cache, however, has a huge gotcha: there is no locking. This means that multiple maven executions may collide due to the sharing of this cache. The --mvncache option allows you to tell create-release to use a different directory for its cache. This makes the tool safe to use for concurrent maven runs.

Signing Your Build

If gpg and gpg-agent are available, the --sign option will also sign the jars and artifacts. This is especially good if you follow-up the build with a mvn deploy to upload your build into something like Artifactory, Nexus, etc.

Changing Versions

Out of the box, Apache Hadoop typically encodes the version as x.y.z-SNAPSHOT, where x.y.z is the release currently under development in that particular branch. Under many circumstances, that is not ideal for a variety of reasons. Passing the --version flag will allow you to override that string with something else. That version string is also compiled into the source such that hadoop version will report it. The --rc-label changes the names of the tarballs so that they also have an extension. Putting these together, create-release --version=3.0.0-EM --rc-label=-RC1 will create Hadoop v3.0.0-EM stored in a tarball called hadoop-3.0.0-EM-RC1.tar.gz.

Docker Support

Getting all the pre-requisites for building Hadoop’s OS-native features is a time-consuming process. Luckily, a Dockerfile has shipped with Apache Hadoop for over a year now that is up-to-date with everything you need to build those features. create-release can take advantage of that file by passing the --docker option. This will run create-release in a Docker container built from that Dockerfile. Any options that reference other directory paths, e.g., --logdir, will get mounted inside the container so that they work as expected.

With the help of Docker, create-release can also guarantee a “fresh” cache. This is a useful exercise to guarantee that all dependencies are downloadable. The --dockercache option forces using a fresh maven cache directory.

Apache Hadoop Release Management

One of the key reasons create-release exists is to help ASF Apache Hadoop Release Managers offer a consistent build environment over time. This includes setting the proper flags so that create-release is itself consistent. The --asfrelease flag makes it so that ASF RM’s won’t have to research or memorize exactly how to make the release. It also includes some extra capabilities such as verifying that a signed release has a valid public key in the ASF master repositories. So while using the --asfrelease flag might be tempting, (unless you are a committer for the Apache Hadoop project) this flag probably won’t work for you. 🙂

Summary

create-release is an easy way to build Apache Hadoop. It provides tooling to guarantee a consistent environment and build parameters. While built for ASF usage, you can also use it to get the most out of your local installation.

2 thoughts on “Building Your Own Apache Hadoop Distribution

  1. david serafini

    I’m guessing that the docker option only works if you are using an Intel x86 compatible CPU, so users of ARM and Power CPUs are on their own. Am I right?

    Reply
    1. Allen Wittenauer Post author

      Using the built-in Dockerfile, at least as of today, it is only on x86. But one could replace the Dockerfile prior to running create-release and it should work just fine. (It’s unfortunate that it isn’t really possible to make a multi-platform Dockerfile due to limitations in the file format.)

      Reply

Leave a Reply