There’s been a lot of chatter in the Apache Hadoop universe lately over the role of the HDFS audit log. This includes a lot of supposition on the purpose of the original design.
This is my recollection of the events that led to its creation. Unfortunately, humans do not come with ECC RAM so apologies for any mistakes. I am going to leave out names to protect the guilty.
Backstory
Back in the summer/fall of 2007, the Grid Operations team at Yahoo! were starting down a path where we were upgrading clusters on a weekly basis. They were some cut-from-trunk version of 0.13 and up and up. Updating a cluster back then was extremely painful. It would take around 4-5 hours, with most of that time spent on praying that the NameNode wouldn’t tip over. In these earlier versions of Hadoop, it would take significant amounts of time to do certain operations (such as block report processing and block conversions) that today people don’t even really consider. The fewer blocks on the system, the faster the cluster is going to come back up. We were always under a space crunch, even with a 1500-2000 node clusters. We were left with only one decision: find and delete as much useless data as you can before doing an upgrade.
But we faced an interesting problem. Being conscientious system administrators, we didn’t want to drop data that was valuable so it’d be good to know if that data that someone dumped in that random directory was being used by someone, and in particular, when was the last time a user referenced that file. This is generally called ‘access time’ aka ‘atime‘.
Like today, memory usage on the NameNode was a critical metric. The development team was extremely focused on shrinking the amount of Java heap space used as much as they possibly could. So when your operations team asks for something as memory and computationally expensive as atime… well, the reaction is obvious: a quick and swift no.
Additionally, we didn’t have quotas or other fancy things yet like owners on files. Data would change magically, or files would appear and other bizarre things, and we had no idea what happened. It wasn’t uncommon to for a user to say “I had this file and now it’s gone! HDFS must have a bug and deleted it!” Our answer usually resembled this:
Persistence, however, is not just a virtue of data.
Several months later (and after countless requests for help in these particular areas), an interesting question was asked. Do we know how much of the data we have stored in HDFS is getting read and with what frequency?
“No. If only we had some way to see the last time a file was accessed…”
The Plan
Circa 0.17, the HDFS team decided that instead of giving us atime, we’d get an access log… an audit log of all the times a file system object (directory or file) would get manipulated in certain ways. This would give us data for the both positive and negative deletion problems as well as other insight into usage patterns. After a bit of brainstorming, we keyed in on the following properties of said log:
- Easily machine parse-able by anything (especially scripting languages)
- Only read, create, and related operations that would directly impact the accessibility of the file (chown, delete, etc) would get logged
- Log only successes, since MapReduce (and users) will try to do a lot of things that will legitimately fail, plus this reduced the size of the log and the delays it caused the NN
- grep-able for quick command line answers
It’s also important to remember that then, Apache Hadoop had zero security. It had progressed to the point that objects had permissions and owners and what not, but the system was trivially fooled (e.g., -Dhadoop.job.ugi=hdfs,hdfs
a door that was closed and then later re-opened via HADOOP_USER_NAME). This put security as a secondary consideration. We still logged who to give us some coverage, but it wouldn’t stand up in a court of law if it came to that.
Inspiration
One of the things we did a lot of in the early days was looking at what other teams were doing that were similar/same to the things we wanted to accomplish for inspiration. In this case, what other file systems (particularly distributed) did this type of logging? Considering there were a lot of ex-Sun Microsystems people around, the answer naturally swung around to BSM and in particular, how did NFS implement its BSM support?
Even though we used that for inspiration, we didn’t end up with a Java implementation of BSM. But we did do one thing that gives away its heritage: it’s fixed field… just as if you had printed out a C structure. Given the formatting, many people make the assumption that it is in key=value format, but that’s not true… it’s just rare that one of the values is empty. This is to make it easy to parse for even junior level system administrators without needing a lot of in-depth programming experience. After all, they are likely the ones that are going to be working with this file more than anyone else.
The Result
The format that was eventually picked looked like this:
1 2 3 4 5 6 7 |
public static final String AUDIT_FORMAT = "ugi=%s\t" + // ugi "ip=%s\t" + // remote IP "cmd=%s\t" + // command "src=%s\t" + // src path "dst=%s\t" + // dst path (optional) "perm=%s"; // permissions (optional) |
Introduced officially in Apache Hadoop 0.18, this log provided answers to a lot of questions that we had about what was happening on the file system. It was also easy to parse and could be searched from the command line in times when we needed quick answers. (e.g., did we just delete all of /user?)
Aftermath
Eight years later after it’s initial commit, the log file has undergone changes. Some good, some bad, and almost all of them mistimed. But in the end, this log file has turned into what may be the single most important log in Apache Hadoop. And for that, I’m proud to have been a small part of it.
Hmm, did you guys bring the term “Grid operations” (and hence Grid dev) from Yahoo to LinkedIn?
Yes. I actually cover the “why grid computing” in the “Deploying Grid Services” talk. I’m not sure who came up with the original name. It might have been Marco.