Runtime Task Scheduling using Imitation Learning for ... - arXiv

1

Runtime Task Scheduling using Imitation Learningfor Heterogeneous Many-Core Systems

Anish Krishnakumar, Samet E. Arda, A. Alper Goksoy, Sumit K. Mandal,Umit Y. Ogras, Anderson L. Sartor, Radu Marculescu

Abstract— Domain-specific systems-on-chip, a class ofheterogeneous many-core systems, are recognized as a keyapproach to narrow down the performance and energy-efficiencygap between custom hardware accelerators and programmableprocessors. Reaching the full potential of these architecturesdepends critically on optimally scheduling the applicationsto available resources at runtime. Existing optimization-basedtechniques cannot achieve this objective at runtime due tothe combinatorial nature of the task scheduling problem. Asthe main theoretical contribution, this paper poses schedulingas a classification problem and proposes a hierarchicalimitation learning (IL)-based scheduler that learns from anOracle to maximize the performance of multiple domain-specific applications. Extensive evaluations with six streamingapplications from wireless communications and radar domainsshow that the proposed IL-based scheduler approximatesan offline Oracle policy with more than 99% accuracyfor performance- and energy-based optimization objectives.Furthermore, it achieves almost identical performance to theOracle with a low runtime overhead and successfully adapts tonew applications, many-core system configurations, and runtimevariations in application characteristics.

Index Terms—Imitation learning, domain-specific SoC,scheduling, heterogeneous computing, many-core architectures.

Manuscript received April 17, 2020; revised June 12, 2020; accepted July6, 2020. Date of current version July 16, 2020. This article was presentedin the International Conference on Hardware/Software Codesign and SystemSynthesis 2020 and appears as part of the ESWEEK-TCAD special issue.(Corresponding author: Anish Krishnakumar.)

This material is based on research sponsored by Air Force ResearchLaboratory (AFRL) and Defense Advanced Research Projects Agency(DARPA) under agreement number FA8650-18-2-7860. The U.S. Governmentis authorized to reproduce and distribute reprints for Governmental purposesnotwithstanding any copyright notation thereon. The views and conclusioncontained herein are those of the authors and should not be interpreted asnecessarily representing the official policies or endorsements, either expressedor implied, of Air Force Research Laboratory (AFRL) and Defense AdvancedResearch Projects Agency (DARPA) or the U.S. Government.

A. Krishnakumar, S. E. Arda, A. A. Goksoy, S.K. Mandal and U. Y. Ograsare with the School of Electrical, Computer and Energy Engineering, ArizonaState University, Tempe, AZ 85287 USA.E-mail: {anishnk, sarda1, aagoksoy, skmandal, umit}@asu.edu

A. L. Sartor is with the Electrical and Computer Engineering Dept.,Carnegie Mellon University, Pittsburgh, PA 15213 USA.E-mail: {asartor}@cmu.edu

R. Marculescu is with the Electrical and Computer Engineering Dept., TheUniversity of Texas at Austin, Austin, TX 78712 USA.E-mail: {radum}@utexas.edu

Digital Object Identifier 10.1109/TCAD.2020.3012861© 2020 IEEE. Personal use of this material is permitted. Permission fromIEEE must be obtained for all other uses, in any current or future media,including reprinting/republishing this material for advertising or promotionalpurposes, creating new collective works, for resale or redistribution to serversor lists, or reuse of any copyrighted component of this work in other works.

I. INTRODUCTION

HOMOGENEOUS multi-core architectures havesuccessfully exploited thread- and data-level parallelism

to achieve performance and energy efficiency beyond thelimits of single-core processors. While general-purposecomputing achieves programming flexibility, it suffers fromsignificant performance and energy efficiency gap whencompared to special-purpose solutions. Domain-specificarchitectures, such as graphics processing units (GPUs) andneural network processors, are recognized as some of themost promising solutions to reduce this gap [13]. Domain-specific systems-on-chip (DSSoCs), a concrete instance ofthis new architecture, judiciously combine general-purposecores, special-purpose processors, and hardware accelerators.DSSoCs approach the efficacy of fixed-function solutions fora specific domain while maintaining programming flexibilityfor other domains [11].The success of DSSoCs depends critically on satisfying

two intertwined requirements. First, the available processingelements (PEs) must be utilized optimally, at runtime, toexecute the incoming tasks. For instance, scheduling all tasksto general-purpose cores may work, but diminishes the benefitsof the special-purpose PEs. Likewise, a static task-to-PEmapping could unnecessarily stall the parallel instances ofthe same task. Second, acceleration of the domain-specificapplications needs to be oblivious to the application developersto make DSSoCs practical. This paper addresses these tworequirements simultaneously.The task scheduling problem involves assigning tasks to

processing elements and ordering their execution to achievethe optimization goals, e.g., minimizing execution time, powerdissipation, or energy consumption. To this end, applicationsare abstracted using mathematical models, such as directedacyclic graph (DAG) and synchronous data graphs (SDG)that capture both the attributes of individual tasks (e.g.,expected execution time) and the dependencies among thetasks [6, 28, 33]. Scheduling these tasks to the available PEsis a well-known NP-complete problem [9, 34]. An optimalstatic schedule can be found for small problem sizes usingoptimization techniques, such as mixed-integer programming(MIP) [10] and constraint programming (CP) [27]. Theseapproaches are not applicable to runtime scheduling for twofundamental reasons. First, statically computed schedules loserelevance in a dynamic environment where tasks from multipleapplications stream in parallel, and PE utilizations changedynamically. Second, the execution time of these algorithms,

arX

iv:2

007.

0936

1v2

[cs

.AR

] 6

Aug

202

0

2

hence their overhead, can be prohibitive even for smallproblem sizes with few tens of tasks. Therefore, a variety ofheuristic schedulers, such as shortest job first (SJF) [35] andcomplete fair schedulers (CFS) [24], are used in practice forhomogeneous systems. These algorithms trade off the qualityof scheduling decisions and computational overhead.

To improve this state of affairs, this paper addressesthe following challenging proposition: Can we achieve ascheduler performance close to that of optimal MIP and CPschedulers, while using minimal runtime overhead comparedto commonly used heuristics? Furthermore, we investigatethis problem in the context of heterogeneous PEs. We notethat much of the scheduling in heterogeneous many-coresystems is tuned manually, even to date [3]. For example,OpenCL, a widely-used programming model for heterogeneouscores, leaves the scheduling problem to the programmers.Experts manually optimize the task to resource mappingbased on their knowledge of application(s), characteristics ofthe heterogeneous clusters, data transfer costs, and platformarchitecture. However, manual optimization suffers fromscalability for two reasons. First, optimizations do not scalewell for all applications. Second, extensive engineering effortsare required to adapt the solutions to different platformarchitectures and varying levels of concurrency in applications.Hence, there is a critical need for a methodology to provideoptimized scheduling solutions applicable to a variety ofapplications at runtime in heterogeneous many-core systems.

Scheduling has traditionally been considered as anoptimization problem [10]. We change this perspective byformulating runtime scheduling for heterogeneous many-coreplatforms as a classification problem. This perspective and thefollowing key insights enable us to employ machine learning(ML) techniques to solve this problem:Key insight 1: One can use an optimal (or near-optimal)scheduling algorithm offline without being limited bycomputational time and other runtime overheads. Then, theinputs to this scheduler and its decisions can be recorded alongwith relevant features to construct an Oracle.Key insight 2: One can design a policy that approximates theOracle with minimum overhead and use this policy at runtime.Key insight 3: One can exploit the effectiveness of ML tolearn from Oracle with different objectives, which includesminimizing execution time, energy consumption, etc.

Realizing this vision requires addressing several challenges.First, we need to construct an Oracle in a dynamic environmentwhere tasks from multiple applications can overlap arbitrarily,and each incoming application instance observes a differentsystem state. Finding optimal schedules is challenging evenoffline, since the underlying problem is NP-complete. Weaddress this challenge by constructing Oracles using bothCP and a computationally expensive heuristic, called earliesttask first (ETF) [14]. ML uses informative properties of thesystem (features) to predict the category in a classificationproblem. The second challenge is identifying the minimalset of relevant features that can lead to high accuracy withminimal overhead. We store a small set of 45 relevant featuresfor a many-core platform with 16 processing elements alongwith the Oracle to minimize the runtime overhead. This

enables us to represent a complex scheduling decision as aset of features and then predict the best processing elementfor task execution. The final challenge is approximating theOracle accurately with a minimum implementation overhead.Since runtime task scheduling is a sequential decision-makingproblem, supervised learning methodologies, such as linearregression and regression tree, may not generalize for unseenstates at runtime. Reinforcement learning (RL) and imitationlearning (IL) are more effective for sequential decision-making problems [19, 29, 31]. Indeed, RL has shown promisewhen applied to the scheduling problem [20, 21, 37], butit suffers from slow convergence and sensitivity of thereward function [15, 18]. In contrast, IL takes advantage ofthe expert’s inherent knowledge and produces policies thatimitate the expert decisions [30]. Hence, we propose anIL-based framework that schedules incoming applications toheterogeneous multi-core systems.The proposed IL framework is formulated to facilitate

generalization, i.e. it can be adapted to learn from any Oraclethat optimizes a specific objective, such as performance andenergy efficiency, of an arbitrary DSSoC. We evaluate theproposed framework with six domain-specific applicationsfrom wireless communications and radar systems. Theproposed IL policies successfully approximate the Oracle withmore than 99% accuracy, achieving fast convergence andgeneralizing to unseen applications. In addition, the schedulingdecisions are made within 1.1�s (on an Arm A53 core), whichis better than CFS performance (1.2�s). To the best of ourknowledge, this is the first imitation learning-based schedulingframework for heterogeneous many-core systems capable ofhandling multiple applications exhibiting streaming behavior.The main contributions of this paper are as follows:∙ An imitation learning framework to construct policies fortask scheduling in heterogeneous many-core platforms;

∙ Oracle design using both optimal and heuristic schedulersfor performance- and energy- based optimization objectives;

∙ Extensive experimental evaluation of the proposed ILpolicies along with latency and storage overhead analysis;

∙ Performance comparison of IL policies againstreinforcement learning and optimal schedules obtainedby constraint programming.The rest of the paper is organized as follows. We review the

related work in Section II. Section III provides backgroundinformation on DAG scheduling and imitation learning. InSection IV, we discuss the proposed methodology, followed byrelevant experimental results in Section V. Section VI presentsthe conclusions and possible future research for this work.

II. RELATED WORK AND NOVEL CONTRIBUTIONSCurrent many-core systems use runtime heuristics to enable

scheduling with low overheads. For example, the completelyfair scheduler (CFS) [24], widely used in Linux systems,aims to provide fairness for all processes in the system. CFSmaintains two queues (active and expired) to manage taskscheduling. In addition, CFS gives a fixed time quantumfor each process. Tasks are swapped between active andexpired queues based on activation and expiration of the

3

2 3

4 5 6

1

7

Clusters and PEs Supported Tasks

High-performance (big) general-purpose 1, 2, 3, 4, 5, 6, 7

Low-power (LITTLE)general-purpose 1, 2, 3, 4, 5, 6, 7

Hardware accelerator-1 (Acc-1) 3, 5

Hardware accelerator-2 (Acc-2) 2, 5, 6 LITTLE

big

Acc-1 3

Time

1

4

Acc-2 2

6

5

7

(a) (b) (c)

Fig. 1: (a) An example DAG consisting of 7 tasks (b) A heterogeneous computing platform with 4 processing elements and list of tasks inDAG supported by each PE (c) A sample schedule of the DAG on the heterogeneous many-core system.

time quantum. However, complex heuristics are requiredto manage such queues. CFS also does not generalize tooptimization objectives apart from performance and fairness.More importantly, CFS scheduling is limited to general-purpose cores and lacks support for specialized cores andhardware accelerators [7]. With the same limitations, shortestjob first (SJF) [35] scheduler estimates the task’s CPUprocessing time and assigns the first available resource to thetask with the shortest execution time.

List scheduling techniques [16, 28] for DAGs [4, 8,33] prioritize various objectives, such as energy [6, 32],fairness [41], security [40]. In general, this technique placesthe nodes (tasks) of a DAG in a list and provides a PEassignment and order at design time. Heterogeneous earliestfinish time (HEFT) [33] is one example, in which an upwardrank is computed to perform the scheduling decisions. Theauthors in [8] use a lookahead algorithm as an enhancementto the HEFT scheduler to improve the execution time, butsuffers from fourth order complexity O(n4) on the numberof tasks (n). Another recent technique shows improvement inperformance with quadratic complexity [4]. However, thesealgorithms suffer from the time complexity problem and aretailored to particular objectives and fail to generalize to acombination of objectives and choice of applications.

Machine learning (ML)-based schedulers show promisein eliminating the drawbacks of list scheduling andruntime heuristic techniques. ML-based schedulers possessthe capabilities to be further tuned at runtime [20]. Arecent support vector machine (SVM)-based scheduler forOpenCL kernels assigns kernels (tasks) between CPUs andGPUs [38]. In contrast to schedulers that use supervisedlearning, authors in [22] uses reinforcement learning (RL) toschedule Tensorflow device placement, but lacks the abilityof scheduling streaming jobs. DeepRM [20] uses deep neuralnetworks with RL for scheduling at an application granularityas opposed to using the notion of DAGs. On the other hand,Decima [21] uses a combination of graph neural networksand RL to perform coarse-grained processing-cluster levelscheduling for streaming DAGs.

RL-based scheduling techniques have two major drawbacks.First, they require a significant number of episodes toconverge. For example, the technique proposed in [21] takes50k episodes, with 1.5 seconds each, to converge to a solutionthat is equivalent to 21 hours of simulation in Nvidia Tesla

P100 GPU. Second, the efficiency of an RL-based techniquepredominantly depends on the choice of the reward function.Usually, the reward function is hand-tuned, depending on theproblem under consideration.To overcome these difficulties, we propose an IL-based

scheduling methodology. Since IL uses an Oracle to constructa policy, it does not suffer from slow convergence, as seenin RL. IL-based policies were initially used in robotics toshow their fast convergence property [30]. Recently, the useof imitation learning to intelligently manage power and energyconsumption in SoCs has been demonstrated [15, 18]. Tothe best of our knowledge, this is the first approach thatapplies IL for multi-application streaming task scheduling inheterogeneous many-core platforms.

III. BACKGROUND AND OVERVIEWThe runtime scheduling problem addressed in this paper

is illustrated in Fig. 1. We consider streaming applicationsthat can be modeled using directed acyclic graphs, such asthe one shown in Fig. 1(a). These applications process dataframes that arrive at a varying rate over time. For example,a WiFi-transmitter, one of our domain applications, receivesand encodes raw data frames before they are transmitted overthe air. Data frames from a single application or multiplesimultaneous applications can overlap in time as they gothrough the tasks that compose the application. For instance,Task-1 in Fig. 1(a) can start processing a new frame, whileother tasks continue processing earlier frames. Processing of aframe is said to be completed after the terminal task withoutany successor (Task-7 in Fig. 1(a)) is executed. We define theapplication formally to facilitate description of the schedulers.Definition 1: An application graph GApp( , ) is a directedacyclic graph, where each node Ti ∈ represents the tasksthat compose the application. Directed edge eij ∈ from taskTi to Tj shows that Tj cannot start processing a new framebefore the output of Ti reaches Tj for all Ti, Tj ∈ . vij foreach edge eij ∈ denotes the communication volume overthis edge. It is used to account for the communication latency.Each task in a given application graph GApp can execute

on different processing elements in the target DSSoC. Weformally define the target DSSoC as follows:Definition 2: An architecture graph GArcℎ( ,) is a directedgraph, where each node Pi ∈ represents processing

4

Scheduling Algorithms

Simulator

Applications

Platform Config

Offline Online – Runtime Evaluation

IL Policies

Impr

ove

Polic

y us

ing

DA

gger

(Sec

. IV

-C)

Oracle Generation (Sec. IV-B)

Cluster PE

Cluster PE

IL-policy Generation (Sec. IV-C)Cluster-2 (n2 PEs)

. . PE-1 PE-2 PE-n2

Cluster-1 (n1 PEs)

. . .PE-1 PE-2 PE-n1

Cluster-n (np PEs)

. .PE-1 PE-2 PE-np

. .

Cluster-2 (n2 PEs)

. . PE-1 PE-2 PE-n2

Cluster-1 (n1 PEs)

. . .PE-1 PE-2 PE-n1

Cluster-n (np PEs)

. .PE-1 PE-2 PE-np

. .

Platform Configuration and Task Profiling

Fig. 2: An overview of the proposed imitation learning framework for task scheduling in heterogeneous many-core systems. The frameworkintegrates the system configuration, profiling information, scheduling algorithms and applications to construct Oracle, and train IL policiesfor task scheduling. The IL policies, that are improved using DAgger, are then evaluated on the heterogeneous many-core system at runtime.

TABLE I: Summary of the notations used in this paperTj Task-j Set of TasksPi PE-i Set of PEsc Cluster-c Set of clustersLij

Communication linksbetween Pi to Pj Set of

communication linkstexe(Pi, Tj )

Execution time oftask Tj on PE Pi

tcomm(Lij )Communicationlatency from Pi to Pj

s State-s Set of statesvjk

Communication volumefrom task Tj to Tk Set of actions

S Static features D Dynamic features�C (s)

Apply cluster policyfor state s �P ,c (s)

Apply PE policyin cluster-c for state s

� Policy �∗ Oracle policy�G Policy for many-core

platform configuration G �∗G Oracle for many-coreplatform configuration G

elements, and Lij ∈ represents the communication linksbetween Pi and Pj in the target SoC. The nodes and linkshave the following quantities associated with them:

∙ texe(Pi, Tj) is the execution time of task Tj on PE Pi ∈ ,if Pi can execute (i.e., it supports) Tj .

∙ tcomm(Lij) is the communication latency from Pi to Pjfor all Pi, Pj ∈ .∙ C(Pi) ∈ is the PE cluster Pi ∈ belongs to.

The DSSoC example in Fig. 1(b) assumes one big core cluster,one LITTLE core cluster, and two hardware acceleratorseach with a single PE in them for simplicity. The low-power (LITTLE) and high-performance (big) general-purposeclusters can support the execution of all tasks, as shown inthe supported tasks column in Fig. 1(b). In contrast, hardwareaccelerators (Acc-1 and Acc-2) support only a subset of tasks.A particular instance of the scheduling problem is illustrated

in Fig. 1(c). Task-6 is scheduled to big core (although itexecutes faster on Acc-2) since Acc-2 is not available at thetime of decision making. Similarly, Task-4 is scheduled tothe LITTLE core (even if it executes faster on big) becausethe big core is utilized when Task-4 is ready to execute. Ingeneral, scheduling complex DAGs in heterogeneous many-core platforms present a multitude of choices making the

runtime scheduling problem highly complex. The complexityincreases further with: (1) overlapping DAGs at runtime,(2) executing multiple applications simultaneously, and (3)optimizing for objectives such as performance, energy, etc.The proposed solution leverages imitation learning, and

is outlined in Fig. 2. It is also referred to as learning bydemonstration, and is an adaption of supervised learning forsequential decision making problems. The decision-makingspace is segmented into distinct decision epochs, calledstates (). There exists a finite set of actions for everystate s ∈ . IL uses policies that map each state (s) to acorresponding action.Definition 3: Oracle Policy (expert) �∗(s) ∶ → maps a given system state to the optimal action. In ourruntime scheduling problem, the state includes the set ofready tasks and actions that correspond to assignment oftasks to processing elements . Given the Oracle �∗,the goal with imitation learning is to learn a runtime policythat can approximate it. We construct an Oracle offline andapproximate it using a hierarchical policy with two levels.Consider a generic heterogeneous many-core platform with aset of clusters , as illustrated in Fig. 2. At the first level, anIL policy chooses one cluster (among n clusters) for a task tobe executed in.The first-level policy assigns the ready tasks to one of the

clusters in , since each PE within the same cluster has thesame static parameters. Then, a cluster-level policy assignsthe tasks to a specific PE within that cluster. The details ofstate representation, Oracle generation, and hierarchical policydesign are presented in the next section.

IV. PROPOSED METHODOLOGY AND APPROACHThis section first introduces the system state representation,

including the features used by the IL policies. Then, itpresents the Oracle generation process, and the design of thehierarchical IL policies. Table I details the notations that willbe used hereafter.

A. System State RepresentationOffline scheduling algorithms are NP-complete even though

they rely on static features, such as average execution times.The complexity of runtime decisions is further exacerbated

5

as the system schedules multiple applications that exhibitstreaming behavior. In the streaming scenario, incomingframes do not observe an empty system with idle processors.In strong contrast, PEs have different utilization, and theremay be an arbitrary number of partially processed frames inthe wait queues of the PEs. Since our goal is to learn a set ofpolicies that generalize to all applications and all streamingintensities, the ability to learn the scheduling decisionscritically depends on the effectiveness of state representation.The system state should encompass both static and dynamicaspects of the set of tasks, applications, and the target platform.Naive representations of DAGs include adjacency matrix andadjacency list. However, these representations suffer fromdrawbacks such as large storage requirements, highly sparsematrices which complicates the training of supervised learningtechniques, and scalability for multiple streaming applications.In contrast, we carefully study the factors that influence taskscheduling in a streaming scenario and construct features thataccurately represent the system state. We broadly categorizethe features that make up the state as follows:

∙ Task features: This set includes the attributes ofindividual tasks. They can be both static, such as averageexecution time of a task on a given PE (texe(Pi, Tj)), anddynamic, such as the relative order of a task in the queue.

∙ Application features: This set describes the characteristicsof the entire application. They are static features, such asthe number of tasks in the application and the precedenceconstraints between them.

∙ PE features: This set describes the dynamic state ofthe processing elements. Examples include the earliestavailable times (readiness) of the PEs to execute tasks.

The static features are determined at the design time, whereasthe dynamic features can only be computed at runtime. Thestatic features aid in exploiting design time behavior. Forexample, texe(Pi, Tj) helps the scheduler compare the expectedperformance of different PEs. Dynamic features, on the otherhand, present the runtime dependencies between tasks andjobs and also the busy states of the processing elements. Forexample, the expected time when cluster c becomes availablefor processing adds invaluable information, which is onlyavailable at runtime.In summary, the features of a task comprehensively

represent the task itself and the state of the processingelements in the system to effectively learn the decisions fromthe Oracle policy. The specific types of features used in thiswork to represent the state and their categories are listed inTable II. The static and dynamic features are denoted as Sand D, respectively. Then, we define the systems state at agiven time instant k using the features in Table II as:

sk = S,k ∪ D,k (1)

where S,k and D,k denote the static and dynamic featuresrespectively at a given time instant k. For an SoC with 16processing elements grouped as 5 clusters, we obtain a set of45 features for the proposed IL technique.

TABLE II: Types of features employed for state representation frompoint of view of task TjFeature Type Feature Description Feature Categories

Static(S )

ID of task-j in the DAG TaskExecution time of a task Tjin PE Pi (texe(Pi, Tj ))

TaskPE

Downward depth of task Tjin the DAGTask

ApplicationIDs of predecessor tasksof task Tj

TaskApplication

Application ID ApplicationPower consumption of task Tjin PE Pi

TaskPE

Dynamic(D)

Relative order of task Tj inthe ready queue TaskEarliest time when PEsin a cluster-c are readyfor task execution

PE

Clusters in which predecessortasks of task Tj executed TaskCommunication volume from taskTj to task Tk(vjk) Task

B. Oracle GenerationThe goal of this work is to develop generalized scheduling

models for streaming applications of multiple types to beexecuted on heterogeneous many-core systems. The generalityof the IL-based scheduling framework enables using IL withany Oracle. The Oracle can be any scheduling algorithm thatoptimizes an arbitrary metric, such as execution time, powerconsumption, and total SoC energy.To generate the training dataset, we implemented both

optimal schedulers using CP and heuristics. These schedulersare integrated into a SoC simulation framework, as explainedunder experimental results. Suppose a new task Tj becomesready at time k. The Oracle is called to schedule the task toa PE. The Oracle policy for this action task with system statesk can be expressed as:

�∗(sk) = Pi, (2)where Pi ∈ is the PE Tj scheduled to and sk is the systemstate defined in Equation 1. After each scheduling action, theparticular task that is scheduled (Tj), the system state sk ∈ ,and the scheduling decision are added to the training data. Toenable the Oracle policies to generalize for different workloadconditions, we constructed workload mixes using the targetapplications at different data rates, as detailed in Section V-A.

C. IL-based Scheduling FrameworkThis section presents the hierarchical IL-based scheduler

for runtime task scheduling in heterogeneous many-coreplatforms. A hierarchical structure is more scalable sinceit breaks a complex scheduling problem down into simplerproblems. Furthermore, it achieves a significantly higherclassification accuracy compared to a flat classifier (>93%versus 55%), as detailed in Section V-D.

6

Algorithm 1: Hierarchical imitation learning Framework1 for task T ∈ do2 s = Get current state for task T

/* Level-1 IL policy to assign cluster */3 c = �C (s)

/* Level-2 IL policy to assign PE */4 p = �P ,c (s)

/* Assign T to the predicted PE */5 end

Algorithm 2: Methodology to aggregate data in a hierarchical imitationlearning framework1 for task T ∈ do2 s = Get current state for task T3 if �C (s) == �∗C (s) then4 if �P ,c (s) != �∗P ,c (s) then5 Aggregate state s and label �∗P ,c (s) to the dataset6 end7 end8 else9 Aggregate state s and label �∗C (s) to the dataset

10 c∗ = �∗C (s)11 if �P ,c∗ (s) != �∗P ,c∗ (s) then12 Aggregate state s and label �∗P ,c (s) to the dataset13 end14 end

/* Assign T to the predicted PE */15 end

Our hierarchical IL-based scheduler policies approximatethe Oracle with two levels, as outlined in Algorithm 1. Thefirst level policy �C (s) ∶ → is a coarse-grained schedulerthat assigns tasks into clusters. This is a natural choicesince individual PEs within a cluster have identical staticparameters, i.e., they differ only in terms of their dynamicstates. The second level (i.e., fine-grained scheduling) consistsof one dedicated policy �P ,c(s) ∶ → for each clusterc ∈ . These policies assign the input task to a PE withinits own cluster, i.e., �P ,c(s) ∈ c , ∀c ∈ . We leverageoff-the-shelf machine learning techniques, such as regressiontrees and neural networks, to construct the IL policies. Theapplication of these policies approximates the correspondingOracle policies constructed offline.

IL policies suffer from error propagation as the state-actionpairs in the Oracle are not necessarily i.i.d. (independent andidentically distributed). Specifically, if the decision taken bythe IL policies at a particular decision epoch is differentfrom the Oracle, then the resultant state for the next epochis also different with respect to the Oracle. Therefore, theerror further accumulates at each decision epoch. This canoccur during runtime task scheduling when the policies areapplied to applications that the policies did not train with.This problem is addressed by the data aggregation algorithm(DAgger), proposed to improve IL policies [26]. DAgger addsthe system state and the Oracle decision to the training datawhenever the IL policy makes a wrong decision. Then, thepolicies are retrained after the execution of the workload.

DAgger is not readily applicable to the runtime schedulingproblem since the number of states is unbounded as ascheduling decision at time t for state s (st) can result inany possible resultant state, st+1. In other words, the feature

space is continuous, and hence, it is infeasible to generatean exhaustive Oracle offline. We overcome this challengeby generating an Oracle on-the-fly. More specifically, weincorporate the proposed framework into a simulator. Theoffline scheduler used as the Oracle is called dynamically foreach new task. Then, we augment the training data with allthe features, Oracle actions, as well as the results of the ILpolicy under construction. Hence, the data aggregation processis performed as part of the dynamic simulation.The hierarchical nature of the proposed IL framework

introduces one more complexity to data aggregation. Thecluster policy’s output may be correct, while the PE clusterreaches a wrong decision (or vice versa). If the clusterprediction is correct, we use this prediction to select the PEpolicy of that cluster, as outlined in Algorithm 2. Then, ifthe PE prediction is also correct, the execution continues;otherwise, the PE data is aggregated in the dataset. However,if the cluster prediction does not align with the Oracle, inaddition to aggregating the cluster data, an on-the-fly Oracleis invoked to select the PE policy, then the PE prediction iscompared to the Oracle, and the PE data is aggregated in caseof a wrong prediction.

V. EXPERIMENTAL RESULTSSection V-A presents the experimental methodology and

setup. Section V-B explores different machine learningclassifiers for IL. The significance of the proposed featuresis studied using a regression tree classifier in Section V-C.Section V-D presents the evaluation of the proposed IL-scheduler. Section V-E analyzes the generalization capabilitiesof IL-scheduler. The performance analysis with multipleworkloads is presented in Section V-F. We demonstrate theevaluation of the proposed IL technique to energy-basedoptimization objectives in Section V-G. Section V-H presentscomparisons with RL-based scheduler and Section V-Ianalyzes the complexity of the proposed approach.

A. Experimental Methodology and SetupDomain Applications: The proposed IL schedulingmethodology is evaluated using applications from wireless

TABLE III: Characteristics of applications used in this study andthe number of frames of each application in the workload

App # ofTasks

ExecutionTime (�s)

SupportedClusters

Representationin workload

#frames #tasks

WiFi-TX 27 301 big, LITTLE, FFT 69 1863WiFi-RX 34 71 big, LITTLE,

FFT, Viterbi 111 3774RangeDet 7 177 big, LITTLE, FFT 64 448

SC-TX 8 56 big, LITTLE 64 512SC-RX 8 154 big, LITTLE,

Viterbi 91 728

TempMit 10 81 big, LITTLE,Matrix mult. 101 1010

TOTAL 500 8335

7

Big ClusterGeneral-Purpose (4 PEs)

LITTLE ClusterGeneral-Purpose

(4 PEs)

Matrix MultiplicationAccelerator (2 PEs)

Fast Fourier TransformAccelerator (4 PEs)

Viterbi DecoderAccelerator (2 PEs)

Fig. 3: Configuration of the heterogeneous many-core platformcomprising 16 processing elements, used for scheduler evaluations.

communication and radar processing domains. We employWiFi-transmitter (WiFi-TX), WiFi-receiver (WiFi-RX), rangedetection (RangeDet), single-carrier transmitter (SC-TX),single-carrier receiver (SC-RX) and temporal mitigation(TempMit) applications, as summarized in Table III. Weconstruct workload mixes using these applications and runthem in parallel.Heterogeneous DSSoC Configuration: Considering the natureof applications, we employ a DSSoC with 16 PEs, includingaccelerators for the most computationally intensive tasks; theyare divided into five clusters with multiple homogeneousPEs, as illustrated in Fig. 3. To enable power-performancetrade-off while using general-purpose cores, we include abig cluster with four Arm A57 cores and a LITTLE clusterwith four Arm A53 cores. In addition, the DSSoC integratesaccelerator clusters for matrix multiplication, FFT, and Viterbidecoder to address the computing requirements of the targetdomain applications summarized in Table III. The acceleratorinterfaces are adapted from [17]. The number of acceleratorinstances in each cluster is selected based on how much thetarget applications use them. For example, three out of thesix reference applications involve FFT, while range detectionapplication alone has three FFT operations. Therefore, weemploy four instances of FFT hardware accelerators and twoinstances of Viterbi and matrix multiplication accelerators, asshown in Fig. 3.Simulation Framework: We evaluate the proposed ILscheduler using the discrete event-based simulationframework [5], which is validated against two commercialSoCs: Odroid-XU3 [12] and Zynq Ultrascale+ ZCU102 [1].This framework enables simulations of the target applicationsmodeled as DAGs under different scheduling algorithms.More specifically, a new instance of a DAG arrives followinga specified inter-arrival time rate and distribution, such asan exponential distribution. After the arrival of each DAGinstance, called a frame, the simulator calls the schedulerunder study. Then, the scheduler uses the information in theDAG and the current system state to assign the ready tasksto the waiting queues of the PEs. The simulator facilitatesstoring this information and the scheduling decision toconstruct the Oracle, as described in Section IV-B.

The execution times and power consumption for the tasksin our domain applications are profiled on Odroid-XU3 andZynq ZCU102 SoCs. The simulator uses these profiling resultsto determine the execution time and power consumptionof each task. After all the tasks that belong to the same

W i F i - T X W i F i - R XR a n g e D e t

S C - T X S C - R X T e m p M i tA v e r a g e1 0 - 1

1 0 01 0 11 0 21 0 31 0 4

0 . 8 s

0 . 1 s

0 . 3m s

1 . 8 s 3 . 4 s

0 . 0 9 s0 . 4 2 s0 . 3 4 s

8 . 8 2 s

0 . 2m s

0 . 4m s

0 . 3m s

0 . 3m s

0 . 3m s

2 . 2 s

0 . 3m sAv

g. Tim

e per

Sche

duling

Decis

ion (m

s)

C P 5 - m i n C P 1 - m i n E T F1 1 . 1 s

Fig. 4: A comparison of average runtime per scheduling decision foreach application with CP5−min, CP1−min and ETF schedulers.

frame are executed, the processing of the corresponding framecompletes. The simulator keeps track of the execution time andenergy consumed for each frame. These end-to-end values arewithin 3%, on average, of the measurements on Odroid-XU3and Zynq ZCU102 SoCs.Scheduling Algorithms used for Oracle and Comparisons:We developed a CP formulation using IBM ILOG CPLEXOptimization Studio [2] to obtain the optimal scheduleswhenever the problem size allows. After the arrival of eachframe, the simulator calls the CP solver to find the scheduledynamically as a function of the current system state. Sincethe CP solver takes hours for large inputs (∼100 tasks), weimplemented two versions with one minute (CP1−min) and fiveminutes (CP5−min) time-out per scheduling decision. Whenthe model fails to find an optimal schedule, we use the bestsolution found within the time limit. Fig. 4 shows that theaverage time of the CP solver per scheduling decision for thebenchmark applications is about 0.8 seconds and 3.5 seconds,respectively, based on the time limit. Consequently, one entiresimulation can take up to 2 days, even with a time-out.We also implemented the ETF heuristic scheduler, which

goes over all tasks and possible assignments to find the earliestfinish time considering communication overheads. Its averageexecution time is close to 0.3 ms, which is still prohibitive fora runtime scheduler, as shown in Fig. 4. However, we observedthat it performs better than CP1−min and marginally worse thanCP5−min, as we detail in Section V-D.Oracle generation with the CP formulation is not practical

for two reasons. First, it is possible that for small input sizes(e.g., less than ten tasks), there might be multiple (incumbent)optimal solutions, and CP would choose one of them randomly.The other reason is that for large input sizes, CP terminatesat the time limit providing the best solution found so far,which is sub-optimal. The sub-optimal solutions produced byCP vary based on the problem size and the limit. In contrast,ETF is easier to imitate at runtime and its results are within8.2% of CP5−min results. Therefore, we use ETF as the Oraclepolicy in our experiments and use the results of CP schedulersas reference points. We train IL policies for this Oracle inSection V-B and evaluate their performance in Section V-D.B. Exploring Different Machine Learning Classifiers for ILWe explore various ML classifiers within the IL

methodology to approximate the Oracle policy. One of

8

TABLE IV: Classification accuracies of trained IL policies withdifferent machine learning classifiers.

Classifier ClusterPolicy

LITTLEPolicy

bigPolicy

MatMultPolicy

FFTPolicy

ViterbiPolicy

RT 99.6 93.8 95.1 99.9 99.5 100SVC 95.0 85.4 89.9 97.8 97.5 98.0LR 89.9 79.1 72.0 98.7 98.2 98.0NN 97.7 93.3 93.6 99.3 98.9 98.1

TABLE V: Execution time and storage overheads per IL policy forregression tree and neural network classifiers.

Classifier Latency (�s) Storage (KB)Odroid-XU3(Arm A15)

Zynq Ultrascale+ ZCU102(Arm A53)

RT 1.1 1.1 19.3NN 14.4 37 16.9

the key metrics that drive the choice of machine learningtechniques is the classification accuracy of the IL policies.At the same time, the policy should also have a low storageand execution time overheads. We evaluate the followingalgorithms for classification accuracy and implementationefficiency: regression tree (RT), support vector classifier(SVC), logistic regression (LR), and a multi-layer perceptronneural network (NN) with 4 hidden layers and 32 neurons ineach hidden layer.

The classification accuracy of ML algorithms under studyare listed in Table IV. In general, all classifiers achieve ahigh accuracy to choose the cluster (the first column). At thesecond level, they choose the correct PE with high accuracy(>97%) within the hardware accelerator clusters. However,they have lower accuracy and larger variation for the LITTLEand big clusters (highlighted columns). This is intuitive as theLITTLE and big clusters can execute all types of tasks inthe applications, whereas accelerators execute fewer tasks. Instrong contrast, a flat policy, which directly predicts the PE,results in training accuracy with 55% at best. Therefore, wefocus on the proposed hierarchical IL methodology.

Regression trees (RT) trained with a maximum depth of12 produce the best accuracy for the cluster and PE policies,with more than 99.5% accuracy for the cluster and hardwareacceleration policies. RT also produces an accuracy of 93.8%and 95.1% to predict PEs within the LITTLE and big clusters,respectively, which is the highest among all the evaluatedclassifiers. The classification accuracy of NN policies arecomparable to RT, with a slightly lower cluster predictionaccuracy of 97.7%. In contrast, support vector classifiers(SVC) and logistic regression (LR) are not preferred due tolower accuracy of less than 90% and 80%, respectively, topredict PEs within LITTLE and big clusters.

We choose regression trees and neural networks to analyzethe latency and storage overheads (due to their superiorperformance). The latency of RT is 1.1�s on Arm Cortex-A15in Odroid-XU3 and on Arm Cortex-A53 in Zynq ZCU102, asshown Table V. In comparison, the scheduling overhead ofCFS, the default Linux scheduler, on Zynq ZCU102 runningLinux Kernel 4.9 is 1.2�s, which is slightly larger than oursolution. The storage overhead of an RT policy is 19.33 KB.

TABLE VI: Training accuracy of IL policies with subsets of theproposed feature set

Features Excludedfrom Training

ClusterPolicy

LITTLEPolicy

bigPolicy

MatMulPolicy

FFTPolicy

ViterbiPolicy

None 99.6 93.8 95.1 99.9 99.5 100Static features 87.3 93.8 92.7 99.9 99.5 100Dynamic features 88.7 52.1 57.6 94.2 70.5 98PE availability times 92.2 51.1 61.5 94.1 66.7 98.1Task ID, depth, app. ID 90.9 93.6 95.3 99.9 99.5 100

The NN policies incur an overhead of 14.4�s on the ArmCortex-A15 cluster in Odroid-XU3 and 37�s on Arm Cortex-A53 in Zynq, with a storage overhead of 16.89 KB. NNs arepreferable for use in an online environment as their weightscan be incrementally updated using the back-propagationalgorithm. However, due to competitive classification accuracyand lower latency overheads of RTs over NNs, we choose RTfor the rest of the experiments.

C. Feature Space Exploration with Regression Tree ClassifierThis section explores the significance of the features chosen

to represent the state. For this analysis, we assess the impact ofthe input features on the training accuracy with RT classifierand average execution time following a systematic approach.The training accuracy with subsets of features and the

corresponding scheduler performance is shown in Table VIand Fig. 5 respectively. First, we exclude all static featuresfrom the training dataset. The training accuracy for theprediction of the cluster significantly drops by 10%. Since weuse hierarchical IL policies, an incorrect first-level decisionresults in a significant penalty for the decisions at the nextlevel. Second, we exclude all dynamic features from training.This results in a similar impact for the cluster policy (10%) butsignificantly affects the policies constructed for the LITTLE,big, and FFT. Next, a similar trend is observed when PEavailability times are excluded from the feature set. Theaccuracy is marginally higher since the other dynamic featurescontribute to learning the scheduling decisions. Finally, weremove a few task related features such as the downward depth,task, and application identifier. In this case, the impact is tothe cluster policy accuracy since these features describe thenode in the DAG and influence the cluster mapping.

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 5 02 2 53 0 03 7 54 5 05 2 5

Avg.

Exec

ution

Time

(µs)

N o r m a l i z e d T h r o u g h p u t

O r a c l e I L ( P r o p o s e d ) S t a t i c F e a t u r e s E x c l . D y n a m i c F e a t u r e s E x c l .

P E A v a i l . T i m e s E x c l . T a s k F e a t u r e s E x c l .

Fig. 5: Average execution time comparison of the applications withOracle, IL (Proposed) and IL policies with subsets of features. Asshown, the average execution time with IL closely follows the Oracle.

9

As observed in Fig. 5, the average execution time of theworkload significantly degrades when all features are notincluded. Hence, the chosen features help to construct effectiveIL policies, approximating the Oracle with over 99% accuracyin execution time.

D. IL-Scheduler Performance EvaluationThis section compares the performance of the proposed

policy to the ETF Oracle, CP1−min, and CP5−min. Sinceheterogeneous many-core systems are capable of runningmultiple applications simultaneously, we stream the frames inour application mix (see Table III) with increasing injectionrates. For example, a normalized throughput of 1.0 in Fig. 6corresponds to 19.78 frames/ms. Since the frames are injectedfaster than they can be processed, there are many overlappingframes at any given time.

First, we train the IL policies with all six referenceapplications and refer to this as the baseline-IL scheduler.IL policies suffer from error propagation due to the noni.i.d. nature of training data. To overcome this limitation, weuse a data aggregation technique adapted for a hierarchicalIL framework (IL-DAgger), as discussed in Section IV-C. ADAgger iteration involves executing the entire workload. Weexecute ten DAgger iterations and choose the best iterationwith performance within 2% of the Oracle. If we fail to achievethe target, we continue to perform more iterations.

Fig. 6 shows that the proposed IL-DAgger schedulerperforms almost identical to the Oracle; the mean averagepercentage difference between them is 1%. More notably, thegap between the proposed IL-DAgger policy and the optimalCP5−min solution is only 9.22%. We emphasize that CP5−minis included only as a reference point, but it has six ordersof magnitude larger execution time overhead and cannot beused at runtime. Furthermore, the proposed approach performsbetter than CP1−min, which is not able to find a good schedulewithin the one-minute time limit per decision. Finally, wenote that the baseline IL can approach the performance of theproposed policy. This is intuitive since both policies are testedon known applications in this experiment. This is in contrastto the leave one out experiments presented in Section V-E.

0 . 0 0 . 1 0 . 2 0 . 3 0 . 4 0 . 5 0 . 6 0 . 7 0 . 8 0 . 9 1 . 01 5 01 7 52 0 02 2 52 5 02 7 5

Avg.

Exec

ution

Time

(µs)


C P 1 - m i n C P 5 - m i n O r a c l e B a s e l i n e - I L ( b e f o r e D A g g e r ) I L - D A g g e r ( P r o p o s e d )

Fig. 6: Comparison of average job execution time between Oracle,CP solutions, and imitation learning policies to schedule a workloadcomprising a mix of six streaming applications. IL scheduler policieswith baseline-IL (before DAgger) and with IL-DAgger (Proposed) areshown in the comparison.

Pulse Doppler Application Case Study: We demonstratethe applicability of the proposed IL-scheduling technique incomplex scenarios using a pulse Doppler application. It is areal-world radar application, which computes the velocity ofa moving target object. This application is significantly morecomplex, with 13-64× more tasks than the other applications.Specifically, it consists of 449 tasks comprising 192 FFT tasks,128 inverse-FFT tasks, and 129 other computations. The FFTand inverse-FFT operations can execute on the general-purposecores and hardware accelerators. In contrast, the other taskscan execute only on the general-purpose cores.The proposed IL policies achieve an average execution

time within 2% of the Oracle. The 2% error is acceptable,considering that the application saturates the computingplatform quickly due to its high complexity. Moreover, theCP-based approach does not produce a viable solution eitherwith 1-minute or 5-minute time limits due to the large problemsize. For this reason, this application is not included in ourworkload mixes and the rest of the comparisons.

E. Illustration of Generalization with IL for UnseenApplications, Runtime Variations and PlatformsThis section analyzes the generalization of the proposed

IL-based scheduling approach to unseen applications, runtimevariations, and many-core platform configurations.IL-Scheduler Generalization to Unseen Applications usingLeave-one-out Experiments: IL, being an adaptation ofsupervised learning for sequential decision making, suffersfrom lack of generalization to unseen applications. To analyzethe effects of unseen applications, we train IL policies,excluding applications one each at a time from the trainingdataset [36].To compare the performances of two schedulers S1 and S2,we use the job slowdown metric slowdownS1,S2 = TS1∕TS2 .

SlowdownS1,S2 > 1 when TS1 > TS2 [20]. The averageslowdown of scheduler S1 with respect to scheduler S2is computed as the average slowdown for all jobs at allinjection rates. The results present an interesting and intuitiveexplanation of the average job slowdown in execution timesfor each of the leave-one-out experiments.Fig. 7 shows the average slowdown of the baseline IL

(IL-LOO) and proposed policy with DAgger iterations (IL-

W i F i - T XW i F i - R X

R a n g e - D e tS C - T X

S C - R XT e m p - M i t0 . 0

0 . 51 . 01 . 52 . 02 . 5

Avera

ge Jo

b Slow

down

I L - L O O ( b e f o r e D A g g e r ) I L - L O O - D A g g e r ( P r o p o s e d )

Fig. 7: Average slowdown of IL policies in comparison withOracle for leave-one-out (LOO) experiments before and after DAgger(Proposed).

10

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 6 01 8 02 0 02 2 02 4 02 6 02 8 0

( f )( e )( d )

( c )( b )

Avg.

Exec

. Tim

e (µs

) C P 1 - m i n C P 5 - m i n O r a c l e I L - D A g g e r ( a l l a p p s i n c l u d e d i n t r a i n i n g ) I L - L O O I L - L O O - D A g g e r ( P r o p o s e d )

( a ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 5 02 2 53 0 03 7 54 5 05 2 56 0 0


Avg.

Exec

. Tim

e (µs

)

N o r m a l i z e d T h r o u g h p u t0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 5 0

1 7 52 0 02 2 52 5 02 7 53 0 03 2 5

Avg.

Exec

. Tim

e (µs

)


0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 6 01 8 02 0 02 2 02 4 02 6 02 8 0

Avg.

Exec

. Tim

e (µs

)

N o r m a l i z e d T h r o u g h p u t0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 0

1 5 02 2 53 0 03 7 54 5 05 2 56 0 0

Avg.

Exec

. Tim

e (µs

)

N o r m a l i z e d T h r o u g h p u t0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 2 0

1 6 02 0 02 4 02 8 03 2 03 6 04 0 0

Avg.

Exec

. Tim

e (µs

)

N o r m a l i z e d T h r o u g h p u tFig. 8: Average execution time with Oracle, IL-DAgger (all applications are included for training), IL with one application excludedfrom training (IL-LOO) and finally, the leave-one-out policy improved with DAgger (Proposed IL-LOO-DAgger) technique. The excludedapplications are: (a) WiFi-TX, (b) WiFi-RX, (c) Range Detection (d) Single-Carrier TX (e) Single-Carrier RX and (f) Temporal Mitigation.

LOO-DAgger) with respect to the Oracle. We observe thatthe proposed policy outperforms the baseline IL for allapplications, with the most significant gains obtained for WiFi-RX and SC-RX applications. These two applications consistof a Viterbi decoder operation, which is very expensive tocompute on general-purpose cores and highly efficient tocompute on hardware accelerators. When these applicationsare excluded, the IL policies are not exposed to thecorresponding states in the training dataset and make incorrectdecisions. The erroneous PE assignments lead to an averageslowdown of more than 2× for the receiver applications. Theslowdown when the transmitter applications (WiFi-TX and SC-TX) are excluded from training is approximately 1.13×. Rangedetection and temporal mitigation applications experience aslowdown of 1.25× and 1.54×, respectively, for leave-one-outexperiments. The extent of the slowdown in each scenariodepends on the application excluded from training and itsexecution time profile in the different processing clusters.In summary, the average slowdown of all leave-one-out ILpolicies after DAgger (IL-LOO-DAgger) improves to ~1.01×in comparison with the Oracle, as shown in Fig. 7.

Fig. 8(a)-(f) show the average job execution times forthe Oracle (ETF), baseline-IL, IL with leave-one-out andDAgger for IL with leave-one-out policies for each of theapplications. The highest number of DAgger iterations neededwas 8 for SC-RX application, and the lowest was 2 for rangedetection application. If the DAgger criterion is relaxed toachieving a slowdown of 1.02×, all applications achieve thesame in less than 5 iterations. A drastic improvement in theaccuracy of the IL policies with few iterations shows that thepolicies generalize quickly and well to unseen applications,thus making them suitable for applicability at runtime.IL-Scheduler Generalization with Runtime Variations:

Tasks experience runtime variations due to variations in

TABLE VII: Standard deviation (in percentage of execution time)profiling of applications in Odroid-XU3 and Zynq ZCU-102.Application WiFi-TX WiFi-RX RangeDet SC-TX SC-RX TempMit

Zynq ZCU-102 0.34 0.56 0.66 1.15 1.80 0.63Odroid-XU3 6.43 5.04 5.43 6.76 7.14 3.14

system workload, memory, and congestion. Hence, it is crucialto analyze the performance of the proposed approach whentasks experience such variations, rather than observing onlytheir static profiles. Our simulator accounts for variationsby using a Gaussian distribution to generate variations inexecution time [39]. To allow evaluation in a realistic scenario,all tasks in every application are profiled on big and LITTLEcores of Odroid-XU3, and, on Cortex-A53 cores and hardwareaccelerators on Zynq for variations in execution time.We present the average standard deviation as a ratio of

execution time for the tasks in Table VII. The maximumstandard deviation is less than 2% of the execution time forthe Zynq platform, and less than 8% on the Odroid-XU3. Toaccount for variations in runtime, we add a noise of 1%, 5%,10%, and 15% in task execution time during simulation. TheIL policies achieve average slowdowns of less than 1.01× in allcases of runtime variations. Although IL policies are trainedwith static execution time profiles, the aforementioned resultsdemonstrate that the IL policies adapt well to execution time

TABLE VIII: Configuration of many-core platforms.PlatformConfig.

LITTLEPEs

bigPEs

MatMulAcc. PEs

FFTAcc. PEs

DecoderAcc. PEs

G1 (Baseline) 4 4 2 4 2G2 2 2 2 2 2G3 1 1 1 1 1G4 4 4 1 1 1G5 4 4 0 0 0

11

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 0 02 2 53 5 04 7 56 0 07 2 58 5 0

Avg.

Exec

ution

Time

(µs)


π* G 1 π* G 2 π* G 3 π* G 4 π* G 5

πG 1 πG 2 πG 3 πG 4 πG 5

Fig. 9: IL policy evaluation with multiple many-core platformconfigurations. IL policies are trained with only configuration G1.

variations at runtime. Similarly, the policies also generalize tovariations in communication time and power consumption.IL-Scheduler Generalization with Platform Configuration:This section presents a detailed analysis of the IL policiesby varying the configuration i.e., the number of clusters,general-purpose cores, and hardware accelerators. To thisend, we choose five different SoC configurations presentedin Table VIII. The Oracle policy for a configuration G1 isdenoted by �∗G1. An IL policy evaluated on configurationG1 is denoted as �G1. G1 is the baseline configuration thatis used for extensive evaluation. Between configurations G1–G4, we vary the number of PEs within each cluster. We alsoconsider a degenerate case that comprises only LITTLE andbig clusters (configuration G5). We train IL policies with onlyconfiguration G1. The average execution times of �G1, �G2,and �G3 are within 1%, �G4 performs within 2%, and �G5performs within 3%, of their respective Oracles. The accuracyof �G5 with respect to the corresponding Oracle (�∗G5) isslightly lower (97%) as the platform saturates the computingresources very quickly, as shown in Fig. 9.

Based on these experiments, we observe that the ILpolicies generalize well for the different many-core platformconfigurations. The change in system configuration isaccurately captured in the features (in execution times, PEavailability times, etc.), which enables us to generalize well tonew platform configurations. When the cluster configurationin the many-core platform changes, the IL policies generalizewell (within 3%) but can also be improved by using DAggerto obtain improved performance (within 1% of the Oracle).

F. Performance Analysis with Multiple WorkloadsTo demonstrate the generalization capability of the IL

policies trained and aggregated on one workload (IL-DAgger),we evaluate the performance of the same policies on 50different workloads consisting of different combinations ofapplication mixes at varying injection rates, and each of theseworkloads contains 500 frames. For this extensive evaluation,we consider workloads each of which are intensive on oneof WiFi-TX, WiFi-RX, range detection, SC-TX, SC-RX, andtemporal mitigation. Finally, we also consider workloads inwhich all applications are distributed similarly.

Fig. 10 presents the average slowdown for each of the50 different workloads (represented as W-1, W-2 and so

on). While W-22 observes a slowdown of 1.01× against theOracle, all other workloads experience an average slowdownof less than 1.01× (within 1% of Oracle). Independent ofthe distribution of the applications in the workloads, theIL policies approximate the Oracle well. On average, theslowdown is less than 1.01×, demonstrating the IL policiesgeneralize to different workloads and streaming intensities.

G. Evaluation with Energy and Energy-Delay ObjectivesAverage execution time is crucial in configuring computing

systems for meeting application latency requirements anduser experience. Another critical metric in modern computingsystems, especially battery-powered platforms, is energyconsumption [23, 25]. Hence, this section presents theproposed IL-based approach with the following objectives:performance, energy, energy-delay product (EDP), and energy-delay2 product (ED2P). We adapt ETF to generate Oracles foreach objective. Then, the different Oracles are used to trainIL policies for the corresponding objectives. The schedulingdecisions are significantly more complex for these Oracles.Hence, we use an RT of depth 16 (execution time uses RT ofdepth 12) to learn the decisions accurately. The average latencyper scheduling decision remains similar for RT of depth 16(~1.1�s) on Cortex-A53.Fig. 11(a) and Fig. 11(b) present the average execution time

and average energy consumption, respectively, for IL policieswith different objectives. The lowest energy is achieved by theenergy Oracle, while it increases as more emphasis is added toperformance (EDP→ ED2P→ performance), as expected. Theaverage execution time and energy consumption in all casesare within 1% of the corresponding Oracles. This demonstratesthe proposed IL scheduling approach is powerful as it learnsfrom Oracles that optimize for any objective.

H. Comparison with Reinforcement LearningSince the state-of-the-art machine learning techniques [20,

21] do not target streaming DAG scheduling in heterogeneousmany-core platforms, we implemented a policy-gradient basedreinforcement learning technique using a deep neural network(multi-layer perceptron with 4 hidden layers with 32 neurons ineach hidden layer) to compare with the proposed IL-based taskscheduling technique. For the RL implementation, we vary theexploration rate between 0.01 to 0.99 and learning rate from0.001 to 0.01. The reward function is adapted from [21]. RLstarts with random weights and then updates them based on theextent of exploration, exploitation, learning rate, and rewardfunction. These factors affect convergence and quality of thelearned RL models.Fewer than 20% of the experiments with RL converge to a

stable policy and less than 10% of them provide competitiveperformance compared to the proposed IL-scheduler. Wechoose the RL solution that performs best to comparewith the IL-scheduler. The Oracle generation and trainingparts of the proposed technique take 5.6 minutes and 4.5minutes, respectively, when running on an Intel Xeon E5-2680processor at 2.40 GHz. In contrast, an RL-based schedulingpolicy that uses the policy gradient method converges in 300

12

W-1

W-2

W-3

W-4

W-5

W-6

W-7

W-8

W-9

W-10

W-11

W-12

W-13

W-14

W-15

W-16

W-17

W-18

W-19

W-20

W-21

W-22

W-23

W-24

W-25

W-26

W-27

W-28

W-29

W-30

W-31

W-32

W-33

W-34

W-35

W-36

W-37

W-38

W-39

W-40

W-41

W-42

W-43

W-44

W-45

W-46

W-47

W-48

W-49

W-50

0 . 0 0

0 . 9 9

1 . 0 0

1 . 0 1

A l l a p p s s i m i l a r l y d i s t r i b u t e d

T e m p M i ti n t e n s i v e

S C - R Xi n t e n s i v e

S C - T Xi n t e n s i v e

R a n g e D e t i n t e n s i v e

W i F i - R Xi n t e n s i v e

Avera

ge Jo

b Slow

down

W i F i - T Xi n t e n s i v e

Fig. 10: Comparison of average job slowdown normalized with IL-DAgger (Proposed) policies against the Oracle for 50 different workloads.The slowdown of IL-DAgger policies are shown for workloads with different intensities of each application in the benchmark suite.

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 5 02 2 53 0 03 7 54 5 0

( b )

Avg.

Exec

ution

Time

(µs)


O r a c l e ( P e r f ) O r a c l e ( E n e r g y ) O r a c l e ( E D P ) O r a c l e ( E D 2 P ) I L ( P e r f ) I L ( E n e r g y ) I L ( E D P ) I L ( E D 2 P )

( a ) 0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 00 . 10 . 20 . 30 . 40 . 5

Avg.

Energ

y (J)

N o r m a l i z e d T h r o u g h p u tFig. 11: (a) Average execution time and (b) average energy consumption of the workload with Oracles and IL policies for performance,energy, energy-delay product (EDP) and energy-delay2 product (ED2P) objectives.

0 . 0 0 . 2 0 . 4 0 . 6 0 . 8 1 . 01 5 01 7 52 0 02 2 52 5 02 7 53 0 0

Avg.

Exec

ution

Time

(µs)


O r a c l e I L ( P r o p o s e d ) R L

Fig. 12: Comparison of average execution time between Oracle, IL,and RL policies to schedule a workload comprising a mix of sixstreaming real-world applications.

minutes on the same machine. Hence, the proposed techniqueis 30× faster than RL. As shown in Fig. 12, the RL schedulerperforms within 11% of the Oracle, whereas the IL schedulerpresents average execution time that is within 1% of the Oracle.

In general, RL-based schedulers suffer from the followingdrawbacks: (1) need for excessive fine-tuning of the parameters(learning rate, exploration rate, and NN structure), (2) rewardfunction design, and (3) slow convergence for complexproblems. In strong contrast, IL policies are guided by strongsupervision eliminating the slow convergence problem and theneed for a reward function.

I. Complexity Analysis of the Proposed ApproachIn this section, we compare the complexity of our proposed

IL-based task scheduling approach with ETF, which is usedto construct the Oracle policies. The complexity of ETF isO(n2m) [14], where n is the number of tasks and m is thenumber of PEs in the system. While ETF is suitable for use inOracle generation (offline), it is not efficient for online use dueto the quadratic complexity on the number of tasks. However,the proposed IL-policy which uses regression tree has thecomplexity of O(n). Since the complexity of the proposedIL-based policies is linear, it is practical to implement inheterogeneous many-core systems.

VI. CONCLUSION AND FUTURE WORKEfficient task scheduling in heterogeneous many-core

platforms is crucial to improve the system performance, butis very challenging due to its NP-hardness. In this work,we have presented an imitation learning based approach fortask scheduling in many-core platforms executing streamingapplications from wireless communications and radar systems.We have presented a hierarchical imitation learning frameworkthat learns from an Oracle to develop task scheduling policiesto minimize the execution time of applications. The frameworkhas been evaluated comprehensively with six domain-specificapplications and analyzed the storage and latency overheadsof the IL policies. We have shown that the IL policiesapproximate the Oracle better than 99%. The overhead of the

13

policies are significantly low at 1.1�s latency per schedulingdecision and lower than the completely fair scheduler (1.2�s).Our IL policies achieve application execution times within9.3% of optimal schedules obtained offline using constraintprogramming.

Preliminary experiments in which we have used IL tobootstrap RL for task scheduling in heterogeneous many-coreplatforms, show much faster convergence to optimal policies.We envision this work to initiate a new direction in schedulingstudies with optimal Oracle generation and evaluation withapplications from various domains.

REFERENCES[1] “Zynq ZCU102 Evaluation Kit,” https://www.xilinx.com/products/boards-

and-kits/ek-u1-zcu102-g.html, Accessed 10 April 2020.[2] “V12.8: User’s Manual for CPLEX,” International Business Machines

Corporation, vol. 46, no. 53, p. 157, 2009.[3] A. M. Aji, A. J. Peña, P. Balaji, and W.-c. Feng, “MultiCL: Enabling

Automatic Scheduling for Task-Parallel Workloads in OpenCL,” ParallelComputing, vol. 58, pp. 37–55, 2016.

[4] H. Arabnejad and J. G. Barbosa, “List Scheduling Algorithm forHeterogeneous Systems by an Optimistic Cost Table,” IEEE Trans. onParallel and Distributed Systems, vol. 25, no. 3, pp. 682–694, 2013.

[5] S. E. Arda et al., “DS3: A System-Level Domain-Specific System-on-Chip Simulation Framework,” IEEE Transactions on Computers, vol. 69,no. 8, pp. 1248–1262, 2020.

[6] S. Baskiyar and R. Abdel-Kader, “Energy Aware DAG Scheduling onHeterogeneous Systems,” Cluster Computing, vol. 13, no. 4, pp. 373–383, 2010.

[7] T. Beisel, T. Wiersema, C. Plessl, and A. Brinkmann, “CooperativeMultitasking for Heterogeneous Accelerators in the Linux CompletelyFair Scheduler,” in IEEE International Conference on Application-Specific Systems, Architectures and Processors, 2011, pp. 223–226.

[8] L. F. Bittencourt, R. Sakellariou, and E. R. Madeira, “DAG SchedulingUsing a Lookahead Variant of the Heterogeneous Earliest Finish TimeAlgorithm,” in Euromicro Conference on Parallel, Distributed andNetwork-based Processing. IEEE, 2010, pp. 27–34.

[9] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guideto the Theory of NP-Completeness. USA: W. H. Freeman & Co., 1979.

[10] V. Goel, M. Slusky, W.-J. van Hoeve, K. C. Furman, and Y. Shao,“Constraint Programming for LNG Ship Scheduling and InventoryManagement,” European Journal of Operational Research, vol. 241,no. 3, pp. 662–673, 2015.

[11] D. Green et al., “Heterogeneous Integration at DARPA: Pathfinding andProgress in Assembly Approaches,” ECTC, May, 2018.

[12] Hardkernel, “ODROID-XU3,” https://wiki.odroid.com/old_product/odroid-xu3/odroid-xu3 Accessed 20 Mar. 2020, 2014.

[13] J. L. Hennessy and D. A. Patterson, “A New Golden Age for ComputerArchitecture,” Commun. of the ACM, vol. 62, no. 2, pp. 48–60, 2019.

[14] J.-J. Hwang, Y.-C. Chow, F. D. Anger, and C.-Y. Lee, “SchedulingPrecedence Graphs in Systems with Interprocessor CommunicationTimes,” SIAM Journal on Computing, vol. 18, no. 2, pp. 244–257, 1989.

[15] R. G. Kim et al., “Imitation Learning for Dynamic VFI Control in Large-Scale Manycore Systems,” IEEE Trans. on Very Large Scale Integration(VLSI) Systems, vol. 25, no. 9, pp. 2458–2471, 2017.

[16] Y.-K. Kwok and I. Ahmad, “Dynamic Critical-Path Scheduling: AnEffective Technique for Allocating Task Graphs to Multiprocessors,”IEEE Trans. Parallel Distrib. Syst., vol. 7, no. 5, pp. 506–521, 1996.

[17] J. Mack, N. Kumbhare, U. Y. Ogras, and A. Akoglu, “User-SpaceEmulation Framework for Domain-Specific SoC Design,” arXiv preprintarXiv:2004.01636, 2020.

[18] S. K. Mandal, G. Bhat, J. R. Doppa, P. P. Pande, and U. Y. Ogras, “AnEnergy-Aware Online Learning Framework for Resource Managementin Heterogeneous Platforms,” ACM Transactions on Design Automationof Electronic Systems (TODAES), vol. 25, no. 3, pp. 1–26, 2020.

[19] S. K. Mandal, G. Bhat, C. A. Patil, J. R. Doppa, P. P. Pande, andU. Y. Ogras, “Dynamic Resource Management of Heterogeneous MobilePlatforms via Imitation Learning,” IEEE Trans. on Very Large ScaleIntegration (VLSI) Systems, 2019.

[20] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “ResourceManagement with Deep Reinforcement Learning,” in ACM Workshopon Hot Topics in Networks, 2016, pp. 50–56.

[21] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, andM. Alizadeh, “Learning Scheduling Algorithms for Data ProcessingClusters,” in ACM Special Interest Group on Data Communication,2019, pp. 270–288.

[22] A. Mirhoseini et al., “Device Placement Optimization withReinforcement Learning,” in International Conference on MachineLearning-Volume 70. JMLR. org, 2017, pp. 2430–2439.

[23] K. Moazzemi, B. Maity, S. Yi, A. M. Rahmani, and N. Dutt, “HESSLE-FREE: Heterogeneous Systems Leveraging Fuzzy Control for RuntimeResource Management,” ACM Transactions on Embedded ComputingSystems (TECS), vol. 18, no. 5s, pp. 1–19, 2019.

[24] C. S. Pabla, “Completely Fair Scheduler,” Linux Journal, no. 184, 2009.[25] B. K. Reddy, A. K. Singh, D. Biswas, G. V. Merrett, and

B. M. Al-Hashimi, “Inter-Cluster Thread-to-Core Mapping and DVFSon Heterogeneous Multi-Cores,” IEEE Transactions on Multi-ScaleComputing Systems, vol. 4, no. 3, pp. 369–382, 2017.

[26] S. Ross, G. Gordon, and D. Bagnell, “A Reduction of Imitation Learningand Structured Prediction To No-Regret Online Learning,” in Proc. ofthe Int. Conf. on Art. Intel. and Stat., 2011, pp. 627–635.

[27] F. Rossi, P. Van Beek, and T. Walsh, Handbook of ConstraintProgramming. Elsevier, 2006.

[28] R. Sakellariou and H. Zhao, “A Hybrid Heuristic for DAG Schedulingon Heterogeneous Systems,” in Int. Parallel and Distributed ProcessingSymposium. IEEE, 2004, p. 111.

[29] A. L. Sartor, A. Krishnakumar, S. E. Arda, U. Y. Ogras, andR. Marculescu, “HiLITE: Hierarchical and Lightweight ImitationLearning for Power Management of Embedded SoCs,” IEEE ComputerArchitecture Letters, vol. 19, no. 1, pp. 63–67, 2020.

[30] S. Schaal, “Is Imitation Learning the Route To Humanoid Robots?”Trends in Cognitive Sciences, vol. 3, no. 6, pp. 233–242, 1999.

[31] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour,“Policy Gradient Methods for Reinforcement Learning with FunctionApproximation,” in Advances in Neural Information Processing Systems,2000, pp. 1057–1063.

[32] V. Swaminathan and K. Chakrabarty, “Real-Time Task Scheduling forEnergy-Aware Embedded Systems,” Journal of the Franklin Institute,vol. 338, no. 6, pp. 729–750, 2001.

[33] H. Topcuoglu, S. Hariri, and M.-Y. Wu, “Performance-Effective andLow-Complexity Task Scheduling for Heterogeneous Computing,” IEEETrans. on Parallel and Distrib. Syst., vol. 13, no. 3, pp. 260–274, 2002.

[34] J. Ullman, “NP-Complete Scheduling Problems,” Journal of Computerand System Sciences, vol. 10, no. 3, pp. 384 – 393, 1975.

[35] M.-A. Vasile, F. Pop, R.-I. Tutueanu, V. Cristea, and J. Kołodziej,“Resource-Aware Hybrid Scheduling Algorithm in HeterogeneousDistributed Computing,” Future Generation Computer Systems, vol. 51,pp. 61–71, 2015.

[36] A. Vehtari, A. Gelman, and J. Gabry, “Practical Bayesian ModelEvaluation using Leave-one-out Cross-validation and WAIC,” Statisticsand Computing, vol. 27, no. 5, pp. 1413–1432, 2017.

[37] Y. Wang et al., “Multi-Objective Workflow Scheduling with Deep-Q-Network-based Multi-Agent Reinforcement Learning,” IEEE Access,vol. 7, pp. 39 974–39 982, 2019.

[38] Y. Wen, Z. Wang, and M. F. O’Boyle, “Smart Multi-Task Schedulingfor OpenCL Programs on CPU/GPU Heterogeneous Platforms,” inInternational Conf. on High Performance Computing, 2014, pp. 1–10.

[39] C. Xian, Y.-H. Lu, and Z. Li, “Dynamic Voltage Scaling for MultitaskingReal-time Systems with Uncertain Execution Time,” IEEE Transactionson Computer-Aided Design of Integrated Circuits and Systems, vol. 27,no. 8, pp. 1467–1478, 2008.

[40] T. Xiaoyong, K. Li, Z. Zeng, and B. Veeravalli, “A Novel Security-Driven Scheduling Algorithm for Precedence-Constrained Tasks inHeterogeneous Distributed Systems,” IEEE Transactions on Computers,vol. 60, no. 7, pp. 1017–1029, 2010.

[41] G. Xie, G. Zeng, L. Liu, R. Li, and K. Li, “Mixed Real-Time Schedulingof Multiple DAGs-based Applications on Heterogeneous Multi-coreProcessors,” Microprocessors and Microsystems, vol. 47, pp. 93–103,2016.

https://wiki.odroid.com/old_product/odroid-xu3/odroid-xu3

https://wiki.odroid.com/old_product/odroid-xu3/odroid-xu3

14

Anish Krishnakumar received his B.Tech degreein Electrical and Electronics Engineering fromNational Institute of Technology, Tiruchirappalli,India and Masters degree from Birla Instituteof Technology, Pilani, India. Anish worked as aPhysical Design Engineer at Qualcomm and as aResearch Scientist at Intel Labs in India. He iscurrently pursuing Ph.D in Electrical Engineering atArizona State University.

Samet E. Arda received his M.S. and Ph.D.degrees in Electrical Engineering from ArizonaState University in 2013 and 2016. He iscurrently an Assistant Research Scientist at ASU.His Ph.D. thesis focused on development ofdynamic models for small modular reactors. Hiscurrent research interests include system-level designand optimization techniques for scheduling inheterogeneous SoCs.

A. Alper Goksoy received his B.S. degree inElectrical and Electronics Engineering fromBogazici University, Istanbul, Turkey. He iscurrently pursuing his Ph.D. in ElectricalEngineering at Arizona State University.

Sumit K. Mandal received his dual degree (BS+ MS) from Indian Institute of Technology (IIT),Kha- ragpur. Currently, he is pursuing his Ph.D.in Arizona State University. His research interestincludes analysis and design of NoC architecture,power management of multicore processors and AIhardware.

Umit Y. Ogras received his Ph.D. degree inElectrical and Computer Engineering from CarnegieMellon University, Pittsburgh, PA, in 2007. From2008 to 2013, he worked as a research scientist atthe Strategic CAD Laboratories, Intel Corporation.He is an Associate Professor at the School ofElectrical, Computer and Energy Engineering, andthe Associate Director of WISCA Center.

Anderson L. Sartor received his B.Sc. andPh.D. in Computer Engineering from UniversidadeFederal do Rio Grande do Sul (UFRGS), Brazil,in 2013 and 2018, respectively. He is currentlya Postdoctoral Researcher at Carnegie MellonUniversity (CMU). His primary research interestsinclude embedded systems design, machine learningfor energy optimization, SoC modeling, and adaptiveprocessors.

Radu Marculescu is the Laura Jennings TurnerChair in Engineering and Professor in the Electricaland Computer Engineering department at TheUniversity of Texas at Austin. He received his Ph.D.in Electrical Engineering from the University ofSouthern California in 1998. Radu’s current researchfocuses on developing ML/AI methods and tools formodeling and optimization of embedded systems,cyber-physical systems, and social networks.

Runtime Task Scheduling using Imitation Learning for ... - arXiv

Documents

Transcript of Runtime Task Scheduling using Imitation Learning for ... - arXiv