Image by Cory Denton
So you started building some batch processing jobs for Hadoop and Spark and now you need to run them every day. No problem, just use Cron right?
It seems like an innocent enough choice (this is exactly what Cron is designed for after all), but it could land you in a world of trouble. Let me try and persuade you to invest a bit more time upfront in a real workflow management solution.
Reason 1: There are no protections for running multiple versions of the same job
There are plenty of ways this can happen:
- You could screw up your crontab entry (which is easy to do), and have jobs run every hour instead of every day
- Your jobs might run longer than you think, and so they could easily overlap one another
- You might go to run a job manually when there’s a version already started by Cron (or vice versa)
Reason 2: Chaining multiple jobs together is messy
A pretty common scenario is to have a workflow of multiple jobs with dependencies, so Job B needs Job A to finish before running. Without building a bunch of custom logic into your application code most folks will do something like this with Cron:
I’m guilty! I’ve done this!
Chaining jobs with Cron is fine when everything works, but when things start to go wrong this can cause serious issues - missing data, failing jobs, erroneous messages, accidentally deleted data. And things WILL go wrong - accidentally break Job A or increase the amount of work that it does then every other job in the chain will suffer, but Cron doesn’t know any of that.
Problems get worse as you keep adding job after job to the crontab in an ever increasing stack of loosely dependent software.
Reason 3: Poor transparency for teammates
Which jobs are running right now? Which are going to run today? How long do these jobs take? How do I schedule my job? What machine should I schedule it on? These are all questions that are impossible to answer without building custom orchestration around your Cron process - time you’d be better off spending on building a better system.
Reason 4: Error recovery is not really a thing
In the above example of Job A and Job B, what happens if Job A fails? Nothing will re-run it or try to clean up any mess it left behind. Forget about automatically reverting to a historical dataset or deploying an older version of the job code. Well you can do all these things, if you build the capabilities yourself :-).
Reason 5: Poor monitoring capabilities
Like error recovery monitoring, Cron jobs require you to set up your own monitoring and alerting systems. You’ll have to work out who to alert and when, and how to detect partial failures in multiple parts of your workflows (what if 3 jobs partially fail - what effect did that have on the rest?)
Reason 6: One machine to rule them all
By default you’d likely deploy your batch Cron jobs to a single machine, but if that machine suffers a failure, then what? You’ll be dealing with all the problems outlined above plus the need to spin up a new machine and redeploy all the job code.
This also poses a memory and CPU problem. If your jobs are CPU or memory intensive this box will have to be big enough to accomodate all of them simultaneously. Alternatively you could have a system to manage many individual crontabs across several machines, but that just exacerbates the transparency and debugging issues I’ve already mentioned.
You are not the first person to run into this issue. Ideally you need a tool that can schedule complex jobs with interwoven dependencies, error recovery needs, monitoring, and a central view of what just happened, is happening, and will happen across the system. You’ll also want to be easily able to integrate with Hadoop, Spark, RDBMS systems, and external services.
Typical solutions that offer these capabilities are called workflow management systems, ETL management systems, pipeline management systems, or even batch job management systems. Many of them come with direct support for Hadoop and many are specially built for Hadoop-based workflows.
One word of warning: DO NOT TRY TO BUILD YOUR OWN WORKFLOW SYSTEM. They seem simple in concept, but before you know it you’ll be shaving a yak. Don’t say I didn’t warn you.
Quality Workflow Systems
Luigi by Spotify is one of my favorite options. It is a lightweight Python framework for building batch workflows. It works great with Hadoop and allows you to use real Python so you can do cool things like generate jobs or do some simple post-processing. Importantly it has been battle-tested in production by a number of companies. When I worked at Foursquare we would use Luigi to run 1000+ jobs on a daily basis.
If you’re feeling in an XML mood you could check out Apache Oozie. It is very powerful and has been embraced and battle-tested by the likes of Yahoo, Cloudera, and Hortonworks. That said I found the interface very confusing and hard to get started with and it has some edge cases that make jobs hard to debug.
Azkaban was built by LinkedIn to manage their Hadoop workflows. It has a simple file-based job specification and a distributed execution architecture. Again like the other two tools it has been tested in production by big organizations (mostly LinkedIn) and is designed to work well with all versions of Hadoop.
Chronos was built by AirBnB as a Cron replacement (!), it runs on Mesos which is a little non-standard for the Hadoop ecosystem, but it has a gorgeous UI and provides workflow dependency management capabilities. That said you have to build and run all jobs from the web UI or REST API. AirBnB also made AirFlow, and while Chronos seems like Cron+, AirFlow is more similar to Oozie or Luigi.
Build one yourself. No, stop! It’s a trap!