Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Post on 16-Apr-2017

1.025 views 2 download

Transcript of Apache Airflow (incubating) NL HUG Meetup 2016-07-19

2

https://wepayinc.app.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k

Airflow @ING

3

ING

Multinational banking and financial services corporation headquartered in Amsterdam.

Its primary businesses are retail banking, direct banking, wholesale banking, investment banking, asset management, and insurance services.

4

• Cron Replacement• Fault tolerant• No XML (looking at you Oozie!)• Testable• Python code• Extendable• Now Apache (incubating)• Scale Out• Complex Dependency Rules• Pools• CLI & Web UI

Why Apache Airflow (incubating)?

5

Growing community

6

Airflow Operational Design

Airflow Webserver

Database

Airflow Scheduler

Airflow Executor(local/celery/mesos

worker)

Airflow Tasks

Talks to

Auth Backend

7

Choose an executor that fits your environmentSequentialExecut

orLocalExecutor CeleryExecutor

Use case Mainly testing Production (~50% of installed base)

Production (~50% of installed base)

Scaleability -na- Vertical Horizontal and Vertical

Complexity Low Medium Medium/High

DAG Local Local Needs sync / pickle

Configuration [core]Executor = SequentialExecutor

[core]Executor = LocalExecutorParallelism=32

[core]Executor = CeleryExecutor

[celery]Celeryd_concurrency = 32Broker_url = rabbitmqcelery_result_backendDefault_queue =

Remark Don’t use num_runs

8

UTC everywhere

Engineers here respond in UTC if you ask them

what time it is

Max

• Airflow assumes every server / worker runs in UTC

• Airflow does not manage time zones (correctly) (to be fixed)

• UTC does not know Daylight Savings Time

9

Tasks run at the end of the period not at the start

• First run will be at 2016-06-1 22:00 UTC

• Execution date will be 2016-06-1 21:00 UTC

10

How to stop/kill a task?

11

How to force running a task?

Celery only (for now)

12

“An idempotent operation is one that has no additional effect if it is called more than once with the same input parameters.”

Make your tasks and DAGs idempotent

• DAGS and Tasks receive an execution date

• on_retry_callback can be used to do a cleanup before a retry

13

Generate your tasks programmaticallyList file names on HDFS

Loop file names

Create task

Assign upstream downstream

14

• Otherwise scheduling can get deadlocked as the sensors take up all the slots in the scheduler

• Another way to circumvent this issue is to have a separate pool for sensors

When using ExternalTaskSensor make sure to manually raise the priority of the tasks it is waiting for

15

• Do you have longer running tasks? Increase the heartbeat of the scheduler to decrease load

• Smaller tasks make for easier debugging and retrying• Properly choose your start date: the scheduler will fill gaps.

• Changing the schedule requires change the dag_id• Backfills are used to add runs where the scheduler already went by

Some last bits

16

Use case

Transactions

Risk

Products

External

HDFS SPARK

TEZ

POSTGRES

FLUME

XFB

SQOOP

SQOOP

17

Wait for files to arrive (Sensor)

18

Copy & clean up

19

Model creation• Run Spark• Tez

Sharding

20

Sqooping to DB

21

• Apache Release• Allow auto aligned

start_date• Backfills to use

Dag Runs• Improve pooling• DAG Parsing

Isolation

Draft Roadmap

• Rest API• Further

Kerberos Integration

• Schedule Backfill Dag Runs

• Isolation• DAG syncing

across workers• No direct

imports for operators from __init__

• Event Driven Driven Scheduler

• Make tasks not need the database

• Roles / principalsIn

progress

In progress In

progressIn

progress

22

Aspiring committer? Contributor? User?

http://gitter.im/apache/incubator-airflow/

https://github.com/apache/incubator-airflow/

http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/

23