Is Airflow the right choice for your data pipelines?

Monday, June 22nd, 2020

Apache Airflow is a powerful tool for managing data pipelines

Airflow has become the default choice for running data pipelines

Having access to reliable, timely data is critical for businesses to operate well. Data pipelines play a critical role in enabling this.

Today, engineering teams often turn to Airflow to run data pipelines. Airflow is a popular and powerful open-source tool. It has a large community of users available to offer help, tips and support.

But users should be aware of its downsides

There are many “gotchas” to the Airflow experience to be cognizant of:

  • Set-up: Airflow has many pieces and dependencies to install. The core Airflow package is required of course. You’ll need to do some research to figure out which others you need, depending on your requirements. For example:
    • Database or AWS integrations need their own separate packages.
    • User logins need the “password” package (plus bcrypt).
    • Secrets management requires the “crypto” package (and dependencies).
    • You’ll want to use MySQL or Postgres instead of the default SQLite database. This database has to be set up and configured too.
    • Consider using a message queue for a more resilient and scalable Airflow experience. Again, this has to be set up and configured.
    • For email alerts, you’ll need to sign up for an email service like Sendgrid.
    • And of course, all this has to be hosted somewhere. So, you’ll have to set up and configure that. Don't forget to ensure it's all networked appropriately.
  • Maintenance / upgrades: Upgrading any piece of the Airflow system (core packages, dependencies) is time consuming. It also has the potential to halt your data pipelines. The underlying infrastructure / hardware needs monitoring.
  • Learning curve: For those new to Airflow, there’s a lot to learn. Action, sensor, and transfer operators. Hooks. DAGs and SubDAGs. Workers. Schedulers. A Google search for "Airflow tutorial” reveals an extensive list of getting-started guides. Enterprising folks on Udemy have created courses on how to use Airflow.
  • DAG composition: Airflow defines workflows using code-based DAGs. On the one hand, this code-based approach can provide a lot of power. But it can also be hard to visualize, and can make workflow creation slow and painful. With our point & click editor, you can create Workflows in seconds.
  • Support: Airflow has a vibrant community of users who provide help and support. But, there is no accountability for fixing bugs or introducing new features.
  • Capacity: Airflow instances must be provisioned with enough capacity for peak processing requirements. This leads to cycles of capacity monitoring and tweaking, as well as excess capacity.
  • Dependency management: Airflow workers must install all dependencies required across all tasks. This causes version conflicts when tasks use different versions of the same library.
  • Reliability and scalability: The Airflow scheduler is a single-point-of-failure. It can and does fail. And it's a difficult issue to get around (http://site.clairvoyantsoft.com/making-apache-airflow-highly-available/)
Learning curve: a lot has been written about how to get started with Airflow.

CloudReactor offers Airflow-like capabilities

CloudReactor makes it much easier for engineers to deploy, orchestrate, monitor and manage data pipelines in the cloud.

  • Deploy: Deploy code from local development environments to AWS ECS with a single command. All dependencies are copied from the local environment into the cloud.
  • Orchestrate: Link Tasks together to create sophisticated Workflows (pipelines). This takes seconds thanks to an easy-to-use, point & click Workflow editor.
  • Monitor: See the current status of all Tasks. Has anything failed — and if so, why? View any Task or Workflow’s historical run data (run durations, start / end times, exit codes & status, etc.). Set automated failure alerts.
  • Manage: Start, stop, retry, or schedule any Task or Workflow via an easy-to-use dashboard or API
CloudReactor lets you track and manage Tasks and Workflows easily

CloudReactor is easy to setup and requires no maintenance

With CloudReactor, engineers free themselves from the burden of set-up, ongoing maintenance, learning curve and other issues that Airflow presents.

  • Zero maintenance and DevOps: Our web-based dashboard is hosted, and your pipeline is deployed to AWS ECS Fargate i.e. runs serverless. So, there are no servers to set up, monitor or maintain. You don’t need to worry about capacity as everything is run and billed on-demand. And you get upgrades to our service with zero downtime, like you would with any SaaS product.
  • No need to learn anything new: Write code in your language of choice. Code and dependencies are wrapped up into a Docker container and deployed to ECS. No need to learn about operators, hooks, or anything you don’t already know. Each container contains the dependencies required for those Tasks, avoiding version conflicts.
  • Build Workflows in seconds: Workflows can be created in seconds thanks to an intuitive point-and-click UI.
  • Support: We provide full support for our service. We're excited to hear from customers about how we can help make their lives easier.
Link Tasks into a Workflow in seconds with our intuitive point-and-click editor

Sound interesting?

CloudReactor offers engineering teams a compelling blend of power and simplicity. Users can deploy their first Tasks and Workflows from scratch within 30 mins.

Get started with CloudReactor today, with 25 Tasks managed for free.

Create free account
Top