One of the features in Apache Hadoop 3.x is the ability to replace how the shell scripts work without having to support a code fork. This feature makes adding support for site or OS specific features, such as resource controls around daemons, really easy.
To keep things simple, let’s add Linux cgexec support for non-secure daemons. Doing this work for other environments/commands (Linux numactl, FreeBSD jails, etc) should be some relatively minor edits to what is presented here. Secure daemons work similarly but have a bit more complexity for a variety of reasons. Maybe I’ll cover that in the future….
Let’s get started!
Code Location
To make things easy, let’s work with the NameNode. We know that to start the NameNode we use the hdfs --daemon start namenode
command. Time to dig into the hdfs
code. Nothing looks particularly interesting until the end:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
if [[ "${HADOOP_SUBCMD_SUPPORTDAEMONIZATION}" = true ]]; then if [[ "${HADOOP_SUBCMD_SECURESERVICE}" = true ]]; then hadoop_secure_daemon_handler \ "${HADOOP_DAEMON_MODE}" \ "${HADOOP_SUBCMD}" \ "${HADOOP_CLASSNAME}" \ "${daemon_pidfile}" \ "${daemon_outfile}" \ "${priv_pidfile}" \ "${priv_outfile}" \ "${priv_errfile}" \ "${HADOOP_SUBCMD_ARGS[@]}" else hadoop_daemon_handler \ "${HADOOP_DAEMON_MODE}" \ "${HADOOP_SUBCMD}" \ "${HADOOP_CLASSNAME}" \ "${daemon_pidfile}" \ "${daemon_outfile}" \ "${HADOOP_SUBCMD_ARGS[@]}" fi exit $? else # shellcheck disable=SC2086 hadoop_java_exec "${HADOOP_SUBCMD}" "${HADOOP_CLASSNAME}" "${HADOOP_SUBCMD_ARGS[@]}" fi |
So there are two functions here to look at since we’re concentrating on daemons (i.e., HADOOP_SUBCMD_SUPPORTDAEMONIZATION will be set to true).
* hadoop_secure_daemon_handler
* hadoop_daemon_handler
We know from the Unix Shell Guide that these are found in Hadoop’s function library.
Looking at the function names and the accompanying documentation to confirm, we know that hadoop_secure_daemon_handler
is for secure daemons, so we’ll ignore that one. That leaves us with hadoop_daemon_handler
. Looking at that code, it does a bunch of setup stuff and then calls two other functions:
1 2 3 4 5 6 |
if [[ "$daemonmode" = "default" ]]; then hadoop_start_daemon "${daemonname}" "${class}" "${daemon_pidfile}" "$@" else hadoop_start_daemon_wrapper "${daemonname}" \ "${class}" "${daemon_pidfile}" "${daemon_outfile}" "$@" fi |
hadoop_start_daemon_wrapper
is just as its name implies: it is just a wrapper around hadoop_start_daemon
… which means the function we want to target is hadoop_start_daemon
.
Analyzing hadoop_start_daemon
Here’s what the bundled function looks like, after stripping out the comments:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
function hadoop_start_daemon { local command=$1 local class=$2 local pidfile=$3 shift 3 hadoop_debug "Final CLASSPATH: ${CLASSPATH}" hadoop_debug "Final HADOOP_OPTS: ${HADOOP_OPTS}" hadoop_debug "Final JAVA_HOME: ${JAVA_HOME}" hadoop_debug "java: ${JAVA}" hadoop_debug "Class name: ${class}" hadoop_debug "Command line options: $*" echo $$ > "${pidfile}" 2>/dev/null if [[ $? -gt 0 ]]; then hadoop_error "ERROR: Cannot write ${command} pid ${pidfile}." fi export CLASSPATH exec "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@" } |
We have a bunch of debug statements so that using the --debug
flag prints useful information before daemon launch. We’ve got the writing of the pid file, since doing that in Java is painful. The CLASSPATH environment variable is exported so that the JVM will know where to find stuff. Finally, we have the exec
of Java itself. It’s that last line where we’re going to want to change.
Preliminary Work: Getting Ready to Replace
Let’s copy this function without changes to make sure we can replace it. Create a file in HADOOP_CONF_DIR
called hadoop-user-functions.sh
. Give it permissions 0755. Inside, we need to put the proper bang path incantation. Let’s also create a fake hadoop_start_daemon
to verify we can replace:
1 2 3 4 5 6 7 8 9 |
#!/usr/bin/env bash # # function hadoop_start_daemon { echo "The power of the elephant compels you!" exit 1 } |
Running hdfs namenode
shows that it works!
$ hdfs namenode
The power of the elephant compels you!
Hooray! Instead of firing off Java, it printed out our message. While that worked, it’s not very useful I suppose. Let’s get our hands dirty.
cgexec
Setup
In order to use cgexec
, we need to have a cgroup configured. Let’s configure a simple one that we can use for HDFS. One thing we can do is prevent those processes from swapping:
1 2 |
cgcreate -t hdfs:hdfs -a hdfs:hdfs -g memory:hdfs echo 0 > /sys/fs/cgroup/memory/hdfs/memory.swappiness |
Now that we have an hdfs cgroup, we have something to use later.
Temporary Replacement
OK, now that we know we can replace successfully and have a cgroup to use, let’s change the code in hadoop-user-functions.sh
to match what ships with Hadoop since it is a good starting point for our changes. I’m going to strip out the comments to make this snippet smaller. You’ll want to keep them and add to them as we go along. Right?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
#!/usr/bin/env bash # # function hadoop_start_daemon { local command=$1 local class=$2 local pidfile=$3 shift 3 hadoop_debug "Final CLASSPATH: ${CLASSPATH}" hadoop_debug "Final HADOOP_OPTS: ${HADOOP_OPTS}" hadoop_debug "Final JAVA_HOME: ${JAVA_HOME}" hadoop_debug "java: ${JAVA}" hadoop_debug "Class name: ${class}" hadoop_debug "Command line options: $*" echo $$ > "${pidfile}" 2>/dev/null if [[ $? -gt 0 ]]; then hadoop_error "ERROR: Cannot write ${command} pid ${pidfile}." fi export CLASSPATH exec "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@" } |
For basic cgexec
support, we need to replace that exec
line. Let’s do something simple for now so that we know it works:
1 |
exec cgexec -g memory:hdfs "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@" |
Running hdfs --daemon start namenode
should fire up the NameNode but in our new cgroup. Let’s verify it. We can use jps
to figure out the NameNode’s pid. Using that pid, we can then get the cgroup information from /proc.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
$ jps -l | grep -i namenode 16351 org.apache.hadoop.hdfs.server.namenode.NameNode $ cat /proc/16351/cgroup 12:hugetlb:/ 11:net_prio:/ 10:perf_event:/ 9:net_cls:/ 8:freezer:/ 7:devices:/ 6:memory:/hdfs 5:blkio:/ 4:cpuacct:/ 3:cpu:/ 2:cpuset:/ 1:name=systemd:/user/1000.user/7.session |
Success! From here we can see that, yes, our NameNode started in the hdfs cgroup!
Real World Replacement
That’s great, but hard-coding the cgroup isn’t particularly interesting. Let’s make this configurable so that we can control cgexec
per daemon. Going back to hadoop_start_daemon
, we can see that one of the parameters it takes is the command. That’s incredibly useful because we can use it to target specific daemons.
First, let’s add a line that takes the command variable and builds up a new variable for us to use.
1 2 3 4 5 6 7 8 9 10 |
function hadoop_start_daemon { local command=$1 local class=$2 local pidfile=$3 shift 3 local cgvar="HADOOP_${command}_CGEXEC_OPTS" hadoop_debug "Final CLASSPATH: ${CLASSPATH}" |
Now that we have a variable we can use, let’s see if it’s defined and if so, call cgexec with the parameters that are inside it:
1 2 3 4 5 6 7 8 9 10 |
hadoop_error "ERROR: Cannot write ${command} pid ${pidfile}." fi export CLASSPATH if [[ -n "${!cgvar}" ]]; then exec cgexec ${!cgvar} "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@" else exec "${JAVA}" "-Dproc_${command}" ${HADOOP_OPTS} "${class}" "$@" fi } |
Some explanation might be required around line 5. The ${!cgvar}
expression is an indirect reference. This syntax means cgvar has the variable we are going to use. Ultimately, we’re now checking if HADOOP_command_CGEXEC_OPTS
is defined. If it is, then we’re going to call our exec cgexec version of the java command. If it isn’t defined, then we’ll call the regular exec java like normal.
On line 6, note that the ${!cgvar}
isn’t quoted. This syntax allows for spaces to get expanded to separate parameters. You’ll note that HADOOP_OPTS
is the same way.
Now let’s test this out. If we run the NameNode again, it shouldn’t be in a cgroup:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
$ hdfs --daemon start namenode $ jps -l | grep -i namenode 16817 org.apache.hadoop.hdfs.server.namenode.NameNode $ cat /proc/16817/cgroup 12:hugetlb:/ 11:net_prio:/ 10:perf_event:/ 9:net_cls:/ 8:freezer:/ 7:devices:/ 6:memory:/ 5:blkio:/ 4:cpuacct:/ 3:cpu:/ 2:cpuset:/ 1:name=systemd:/user/1000.user/7.session |
Let’s put it back in our cgroup. In hadoop-env.sh, add this line:
HADOOP_namenode_CGEXEC_OPTS="-g memory:hdfs"
Now restart the namenode and see what happened:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
$ hdfs --daemon stop namenode $ hdfs --daemon start namenode $ jps -l | grep -i namenode 16980 org.apache.hadoop.hdfs.server.namenode.NameNode $ cat /proc/16980/cgroup 12:hugetlb:/ 11:net_prio:/ 10:perf_event:/ 9:net_cls:/ 8:freezer:/ 7:devices:/ 6:memory:/hdfs 5:blkio:/ 4:cpuacct:/ 3:cpu:/ 2:cpuset:/ 1:name=systemd:/user/1000.user/7.session |
Awesome! Since there aren’t other HADOOP_command_CGEXEC_OPTS
variables defined, that means other daemons won’t be changed. We can verify this by starting up another daemon:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
$ yarn --daemon start nodemanager $ jps -l | grep -i nodemanager 17181 org.apache.hadoop.yarn.server.nodemanager.NodeManager $ cat /proc/17181/cgroup 12:hugetlb:/ 11:net_prio:/ 10:perf_event:/ 9:net_cls:/ 8:freezer:/ 7:devices:/ 6:memory:/ 5:blkio:/ 4:cpuacct:/ 3:cpu:/ 2:cpuset:/ 1:name=systemd:/user/1000.user/7.session |
As we can see, the memory line on the cgroup is empty and not in HDFS. Adding an HADOOP_nodemanager_CGEXEC_OPTS with appropriate settings for YARN would work as expected: it would get run with cgexec
with the contents of that variable as the parameters.
Conclusion
Using this functionality, it’s easy to see how this can be used to give a greater degree of control over how daemons in the Apache Hadoop environment work.