Apache Airflow (incubating) NL HUG Meetup 2016-07-19

23

Transcript of Apache Airflow (incubating) NL HUG Meetup 2016-07-19

Page 1: Apache Airflow (incubating) NL HUG Meetup 2016-07-19
Page 2: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

2

https://wepayinc.app.box.com/s/hf1chwmthuet29ux2a83f5quc8o5q18k

Airflow @ING

Page 3: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

3

ING

Multinational banking and financial services corporation headquartered in Amsterdam.

Its primary businesses are retail banking, direct banking, wholesale banking, investment banking, asset management, and insurance services.

Page 4: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

4

• Cron Replacement• Fault tolerant• No XML (looking at you Oozie!)• Testable• Python code• Extendable• Now Apache (incubating)• Scale Out• Complex Dependency Rules• Pools• CLI & Web UI

Why Apache Airflow (incubating)?

Page 5: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

5

Growing community

Page 6: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

6

Airflow Operational Design

Airflow Webserver

Database

Airflow Scheduler

Airflow Executor(local/celery/mesos

worker)

Airflow Tasks

Talks to

Auth Backend

Page 7: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

7

Choose an executor that fits your environmentSequentialExecut

orLocalExecutor CeleryExecutor

Use case Mainly testing Production (~50% of installed base)

Production (~50% of installed base)

Scaleability -na- Vertical Horizontal and Vertical

Complexity Low Medium Medium/High

DAG Local Local Needs sync / pickle

Configuration [core]Executor = SequentialExecutor

[core]Executor = LocalExecutorParallelism=32

[core]Executor = CeleryExecutor

[celery]Celeryd_concurrency = 32Broker_url = rabbitmqcelery_result_backendDefault_queue =

Remark Don’t use num_runs

Page 8: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

8

UTC everywhere

Engineers here respond in UTC if you ask them

what time it is

Max

• Airflow assumes every server / worker runs in UTC

• Airflow does not manage time zones (correctly) (to be fixed)

• UTC does not know Daylight Savings Time

Page 9: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

9

Tasks run at the end of the period not at the start

• First run will be at 2016-06-1 22:00 UTC

• Execution date will be 2016-06-1 21:00 UTC

Page 10: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

10

How to stop/kill a task?

Page 11: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

11

How to force running a task?

Celery only (for now)

Page 12: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

12

“An idempotent operation is one that has no additional effect if it is called more than once with the same input parameters.”

Make your tasks and DAGs idempotent

• DAGS and Tasks receive an execution date

• on_retry_callback can be used to do a cleanup before a retry

Page 13: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

13

Generate your tasks programmaticallyList file names on HDFS

Loop file names

Create task

Assign upstream downstream

Page 14: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

14

• Otherwise scheduling can get deadlocked as the sensors take up all the slots in the scheduler

• Another way to circumvent this issue is to have a separate pool for sensors

When using ExternalTaskSensor make sure to manually raise the priority of the tasks it is waiting for

Page 15: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

15

• Do you have longer running tasks? Increase the heartbeat of the scheduler to decrease load

• Smaller tasks make for easier debugging and retrying• Properly choose your start date: the scheduler will fill gaps.

• Changing the schedule requires change the dag_id• Backfills are used to add runs where the scheduler already went by

Some last bits

Page 16: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

16

Use case

Transactions

Risk

Products

External

HDFS SPARK

TEZ

POSTGRES

FLUME

XFB

SQOOP

SQOOP

Page 17: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

17

Wait for files to arrive (Sensor)

Page 18: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

18

Copy & clean up

Page 19: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

19

Model creation• Run Spark• Tez

Sharding

Page 20: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

20

Sqooping to DB

Page 21: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

21

• Apache Release• Allow auto aligned

start_date• Backfills to use

Dag Runs• Improve pooling• DAG Parsing

Isolation

Draft Roadmap

• Rest API• Further

Kerberos Integration

• Schedule Backfill Dag Runs

• Isolation• DAG syncing

across workers• No direct

imports for operators from __init__

• Event Driven Driven Scheduler

• Make tasks not need the database

• Roles / principalsIn

progress

In progress In

progressIn

progress

Page 22: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

22

Aspiring committer? Contributor? User?

http://gitter.im/apache/incubator-airflow/

https://github.com/apache/incubator-airflow/

http://mail-archives.apache.org/mod_mbox/incubator-airflow-dev/

Page 23: Apache Airflow (incubating) NL HUG Meetup 2016-07-19

23