Machine Teaching A New Paradigm for Building ... - arXiv

Machine TeachingA New Paradigm for Building Machine Learning Systems

Patrice Y. Simard [email protected] Amershi [email protected] M. Chickering [email protected] Edelman Pelton [email protected] Ghorashi [email protected] Meek [email protected] Ramos [email protected] Suh [email protected] Verwey [email protected] Wang [email protected] Wernsing [email protected]

Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

Abstract

The current processes for building machinelearning systems require practitioners withdeep knowledge of machine learning. Thissignificantly limits the number of machinelearning systems that can be created and hasled to a mismatch between the demand formachine learning systems and the ability fororganizations to build them. We believe thatin order to meet this growing demand for ma-chine learning systems we must significantlyincrease the number of individuals that canteach machines. We postulate that we canachieve this goal by making the process ofteaching machines easy, fast and above all,universally accessible.

While machine learning focuses on creatingnew algorithms and improving the accuracyof “learners”, the machine teaching disciplinefocuses on the efficacy of the “teachers”. Ma-chine teaching as a discipline is a paradigmshift that follows and extends principles ofsoftware engineering and programming lan-guages. We put a strong emphasis on theteacher and the teacher’s interaction withdata, as well as crucial components such astechniques and design principles of interac-tion and visualization.

In this paper, we present our position re-garding the discipline of machine teachingand articulate fundamental machine teaching

principles. We also describe how, by decou-pling knowledge about machine learning al-gorithms from the process of teaching, we canaccelerate innovation and empower millionsof new uses for machine learning models.

1. Introduction

The demand for machine learning (ML) models farexceeds the supply of “machine teachers” that canbuild those models. Categories of common-sense un-derstanding tasks that we would like to automate withcomputers include interpreting commands, customersupport, or agents that perform tasks on our behalf.The combination of categories, domains, and tasksleads to millions of opportunities for building special-ized, high-accuracy machine learning models. For ex-ample, we might be interested in building a model tounderstand voice commands for controlling a televi-sion or building an agent for making restaurant reser-vations. The key to opening up the large space of solu-tions to is to increase the number of machine teachersby making the process of teaching machines easy, fastand universally accessible.

A large fraction of the machine learning community isfocused on creating new algorithms to improve the ac-curacy of the “learners” (machine learning algorithms)on given labeled data sets. The machine teaching(MT) discipline is focused on the efficacy of the teach-ers given the learners. The metrics of machine teachingmeasure performance relative to human costs, such as

arX

iv:1

707.

0674

2v3

[cs

.LG

] 1

1 A

ug 2

017

Machine Teaching

productivity, interpretability, robustness, and scalingwith the complexity of the problem or the number ofcontributors.

Many problems that affect model building productiv-ity are not addressed by traditional machine learning.One such problem is concept evolution, a process inwhich the teacher’s underlying notion of the targetclass is defined and refined over time (Kulesza et al.,2014). Label noise or inconsistencies can be detrimen-tal to traditional machine learning because it assumesthat the target concept is fixed and is defined by thelabels. In practice, concept definitions, schemas, andlabels can change as new sets of rare positives are dis-covered or when teachers simply change their minds.

Consider a binary classification task for gardening webpages where the machine learner and feature set isfixed. The teacher may initially label botanical gar-den web pages as positive examples for the gardeningconcept, but then later decide that these are negativeexamples. Relabeling the examples when the targetconcept evolves is a huge burden on the teacher. Froma teacher’s perspective, concepts should be decompos-able into sub-concepts and the manipulation of therelationship between sub-concepts should be easy, in-terpretable, and reversible. At the onset, the teachercould decompose gardening into sub-concepts (that in-clude botanical gardens) and label the web page ac-cording to this concept schema.

In this scenario, labeling for sub-concepts has no ben-efits to the machine learning algorithm, but it benefitsthe teacher by enabling concept manipulation. Manip-ulation of sub-concepts can be done in constant time(i.e., not dependent on the number of labels), and theteacher’s semantic decisions can be documented forcommunication and collaboration. Addressing conceptevolution is but one example of where the emphasis onthe teacher’s perspective can make a large differencein model building productivity.

Machine teaching is a paradigm shift away from ma-chine learning, akin to how other fields in program-ming language have shifted from optimizing perfor-mance to optimizing productivity with the notions offunctional programming, programming interfaces, ver-sion control, etc. The discipline of machine teachingfollows and extends principles of software engineeringand languages that are fundamental to software pro-ductivity. Machine teaching places a strong empha-sis on the teacher and the teacher’s interaction withdata, and techniques and design principles of interac-tion and visualization are crucial components. Ma-chine teaching is also directly connected to machinelearning fundamentals as it defines abstractions and

interfaces between the underlying algorithm and theteaching language. Therefore, machine teaching livesat the interaction of the human-computer interaction,machine learning, visualization, systems and engineer-ing fields. The goal of this paper is to explore machinelearning model building from the teacher’s perspec-tive.

2. The need for a new discipline

In 2016, at one of Microsoft’s internal conferences(TechFest) during a panel titled “How Do We Buildand Maintain Machine Learning Systems?”, the hoststarted the discussion by asking the audience “What isyour worst nightmare?” in the context of building ma-chine learning models for production. A woman raisedher hand and gave the first answer:

“[...] Manage versions. Manage data ver-sions. Being able to reproduce the models.What if, you know, the data disappears, theperson disappears, the model disappears...And we cannot reproduce this. I have seenthis hundreds of times in Bing. I have seenit every day. Like... Oh yeah, we had a goodmodel. Ok, I need to tweak it. I need tounderstand it. And then... Now we cannotreproduce it. That is my biggest nightmare!”

To put context to this testimony, we review whatbuilding a machine learning model may look like ina product group:

1. A problem owner collects data, writes labelingguidelines, and optionally contributes some labels.

2. The problem owner outsources the task of labelinga large portion of the data (e.g., 50,000 examples).

3. The problem owner examines the labels and maydiscover that the guidelines are incorrect or thatthe sampled examples are inappropriate or in-adequate for the problem. When that happens,GOTO step 1.

4. An ML expert is consulted to select the algorithm(e.g., deep neural network), the architecture (e.g.,number of layers, units per layer, etc.), the objec-tive function, the regularizers, the cross-validationsets, etc.

5. Engineers adjust existing features or create newfeatures to improve performance. Models aretrained and deployed on a fraction of traffic fortesting.

Machine Teaching

6. If the system does not perform well on test traffic,GOTO step 1

7. The model is deployed on full traffic. Performanceof the model is monitored, and if that performancegoes below a critical level, the model is modifiedby returning to step 1.

An iteration through steps 1 to 6 typically takes weeks.The system can be stable at step 7 for months. Whenit eventually breaks, it can be for a variety of reasons:the data distribution has changed, the competitionhas improved and the requirements have increased,new features are available and some old features areno longer available, the definition of the problem haschanged, or a security update or other change has bro-ken the code. At various steps, the problem owner,the machine learning expert, or the key engineer mayhave moved on to another group or another company.The features or the labels were not versioned or doc-umented. No one understands how the data was col-lected because it was done in an ad hoc and organicfashion. Because multiple players with different ex-pertise are involved, it takes a significant amount ofeffort and coordination to understand why the modeldoes not perform as well as expected after being re-trained. In the worst case, the model is operating butno one can tell if it is performing as expected, andno one wants the responsibility of turning it off. Ma-chine learning “litter” starts accumulating everywhere.These problems are not new to machine learning inpractice (Sculley et al., 2014).

The example above illustrates the fact that building amachine learning model involves more than just col-lecting data and applying learning algorithms, andthat the management process of building machinelearning solutions can be fraught with inefficiencies.There are other forms of inefficiencies that are deeplyembedded in the current machine learning paradigm.For instance, machine learning projects typically con-sist of a single monolithic model trained on a largelabeled data set. If the model’s summary performancemetrics (e.g., accuracy, F1 score) were the only re-quirements and the performance remained unchanged,adding examples would not be a problem even if thenew model errs on the examples that were previouslypredicted correctly. However, for many problems forwhich predictability and quality control are important,any negative progress on the model quality leads tolaborious testing of the entire model and incurs highmaintenance cost. A single monolithic model lacks themodularity required for most people to isolate and ad-dress the root cause of a regression problem.

2.1. Definitions of machine learning andmachine teaching

It is difficult to argue that the challenges discussedabove are given a high priority in the world’s bestmachine learning conferences. These problems andinefficiencies do not stem from the machine learningalgorithm, which is the central topic of the machinelearning field; they come from the processes that usemachine learning, from the interaction between peopleand machine learning algorithms, and from people’sown limitations.

To give more weight to this assertion, we will definethe machine learning research field narrowly as:

Definition 2.1 (Machine learning research)Machine Learning research aims at making thelearner better by improving ML algorithms.

This field covers, for instance, any new variations orbreakthroughs in deep learning, unsupervised learning,recurrent networks, convex optimization, and so on.

Conversely, we see version control, concept decomposi-tion, semantic data exploration, expressiveness of theteaching language, interpretability of the model, andproductivity as having more in common with program-ming and human-computer interaction than with ma-chine learning. These “machine teaching” concepts,however, are extraordinarily important to any practi-tioners of machine learning. Hence, we define a disci-pline aimed at improving these concept as:

Definition 2.2 (Machine teaching research)Machine teaching research aims at making the teachermore productive at building machine learning models.

We have chosen these definitions to minimize the inter-section between the two fields and thus provide clarityand scoping. The two disciplines are complementaryand can evolve independently. Of course, like any gen-eralization, there are limitations. Curriculum learn-ing (Bengio et al., 2009), for instance, could be seenas belonging squarely in the intersection because it in-volves both a learning algorithm and teacher behavior.Nevertheless, we have found these definitions useful todecide what to work on and what not to work on.

2.2. Decoupling machine teaching frommachine learning

Machine teaching solutions require one or more ma-chine learning algorithms to produce models through-out the teaching process (and for the final output).This requirement can make things complex for teach-ers. Different deployment environments may support

Machine Teaching

different runtime functions, depending on what re-sources are available (e.g., DSPs, GPUs, FPGAs, tightmemory or CPU constraints) or what has been imple-mented and green-lighted for deployment. Machinelearning algorithms can be understood as “compilers”that convert the teaching information to an instanceof the set of functions available at runtime. For ex-ample, each such instance might be characterized bythe weights in a neural network, the “means” in K-means, the support vectors in SVMs, or the decisionsin decision trees. For each set of runtime functions,different machine learning compilers may be available(e.g., LBFGS, stochastic gradient descent), each withits own set of parameters (e.g., history size, regulariz-ers, k-folds, learning rates schedule, batch size, etc.)

Machine teaching aims at shielding the teacher fromboth the variability of the runtime and the complex-ity of the optimization. This has a performance cost:optimizing for a target runtime with expert control ofthe optimization parameters will always outperformgeneric parameter-less optimization. It is akin to in-lining assembly code. But like high-level programminglanguages, our goal with machine teaching is to reducethe human cost in terms of both maintenance timeand required expertise. The teaching language shouldbe “write once, compile anywhere”, following the ISOC++ philosophy.

Using well-defined interfaces describing the inputs(feature values) and outputs (label value predictions)of machine learning algorithms, the teaching solutioncan leverage any machine learning algorithms thatsupport these interfaces. We impose three additionalsystem requirements:

1. The featuring language available to the teachershould be expressive enough to enable examplesto be distinguished in meaningful ways (a hash ofa text document has distinguishing power, but itis not considered meaningful). This enables theteacher to remove feature blindness without nec-essarily increasing concept complexity.

2. The complexity (VC dimension) of the set of func-tions that the system can return increases withthe dimension of the feature space. This enablesthe teacher to decrease the approximation errorby adding features.

3. The available ML algorithms must satisfy theclassical definition of learning consistency (Vap-nik, 2013). This enables the teacher to decreasethe estimation error by adding labeled examples.

The aim of these requirements is to enable teachers to

create and debug any concept function to an arbitrarylevel of accuracy without being required to understandthe runtime function space, learning algorithms, or op-timization.

3. Analogy to programming

In this section, we argue that teaching machines is aform of programming. We first describe what machineteaching and programming have in common. Next, wehighlight several tools developed to support softwaredevelopment that we argue are likely to provide valu-able guidance and inspiration to the machine teachingdiscipline. We conclude this section with a discussionof the history of the discipline of programming andhow it might be predictive of the trajectory of the dis-cipline of machine teaching.

3.1. Commonalities and differences betweenprogramming and teaching

Assume that a software engineer needs to create astateless target function (e.g., as in functional pro-gramming) that returns value Y given input X. Whilenot strictly sequential, we can describe the program-ming process as a set of steps as follows:

1. The target function needs to be specified

2. The target function can be decomposed into sub-functions

3. Functions (including sub-functions) need to betested and debugged

4. Functions can be documented

5. Functions can be shared

6. Functions can be deployed

7. Functions need to be maintained (scheduled andunscheduled debug cycles)

Further assume that a teacher wants to build a targetclassification function that returns class Y given in-put X. The process for machine teaching presented inthe previous section is similar to the set of program-ming steps above. While there are strong similarities,there are also significant differences, especially in thedebugging step (Table 1).

In order to strengthen the analogy between teachingand programming, we need a machine teaching lan-guage that lets us express these steps in the contextof a machine learning model building task. For pro-gramming, the examples of languages include C++,

Machine Teaching

Table 1. Comparison of debugging steps in programming and machine teaching

Debugging in programming Debugging in machine teaching(3) Repeat: (3) Repeat:

(a) Inspect (a) Inspect(b) Edit code (b) Edit/add knowledge (e.g., labels, features, ...)(c) Compile (c) Train(d) Test (d) Test

Python, JavaScript, etc. which can be compiled intomachine language for execution. For teaching, the lan-guage is a means of expressing teacher knowledge intoa form that a machine learning algorithm can leveragefor training. Teacher knowledge does not need to belimited to providing labels but can be a combinationof schema constraints (e.g., mutually exclusive labelsfor classification, state transition constraints in entityextraction1), labeled examples, and features. Just asnew programing languages are being developed to ad-dress current limitations, we expect that new teachinglanguages will be developed that allow the teacher tocommunicate different types of knowledge and to com-municate knowledge more effectively.

3.2. Programming paving the way forward

As we have illustrated in the previous sections, cur-rent machine learning processes require multiple peo-ple of different expertise and strong knowledge depen-dency among them, there are no standards or toolingfor versioning of data and models, and there is a strongco-dependency between problem formulation, trainingand the underlying machine learning algorithms. For-tunately, the emerging discipline of machine teachingcan leverage lessons learned from the programming,software engineering and related disciplines. Thesedisciplines have developed over the last half centuryand addressed many analogous problems that machineteaching aims to solve. This is not surprising giventheir strong commonalities. In this section, we high-light several lessons and relate them to machine teach-ing.

3.2.1. Solving complex problems

The programming discipline has developed and im-proved a set of tools, techniques and principles thatallow software engineers to solve complex problems inways that allow for efficient, maintainable and under-standable solutions. These principles include problemdecomposition, encapsulation, abstraction, and design

1In an address recognizer, we might want to require thatthe zip code appears after the state.

patterns. Rather than discussing each of these, we con-trast the differing expectations between software engi-neers solving a complex problem and machine teacherssolving a complex problem. One of the most power-ful concepts that allowed software engineers to writesystems that solve complex problems is that of decom-position. The next anecdote illustrates its importanceand power.

We asked dozens of software engineers the following:

1. Can you write a program that correctly imple-ments the game Tetris?

2. Can you do it in a month?

The answer to the first question is universally “yes”.The answer to the second question varies from “I thinkso” to “why would it take more than 2 days?”. Thefirst question is arguably related to the Church-Turingthesis which states that all computable functions arecomputable by a Turing machine. If a human can com-pute the function, there exists a program that can per-form the same computation on a Turing machine. Inother words, given that there is an algorithm to imple-ment the Tetris game, most respectable software en-gineers believe they can also implement the game onwhatever machine they have access to and in whateverprogramming language they are familiar with. Theanswer to the second question is more puzzling. Thestate space in a Tetris game (informally the number ofconfigurations of the pieces on the screen) is very large,in fact, far larger than can be examined by the softwareengineer. Indeed, one might expect that the complex-ity of the program should grow exponentially with thesize of the representation of a state in the state space.Yet, the software engineers seem confident that theycan implement the game in under a month. The mostlikely explanation is that they consider the complexityof the implementation and the debugging to be poly-nomial in both the representation of the state spaceand the input.

Let us examine how machine learning experts reactto similar questions asked about teaching a complex

Machine Teaching

problem:

1. Can you teach a machine to recognize kitchenutensils in an image as well as you do?

2. Can you do it in a month?

When these questions were asked to another handful ofmachine learning experts, the answers were quite var-ied. While one person answered “yes” to both ques-tions without hesitation, most machine learning ex-perts were less confident about both questions withanswers including “probably”, “I think so”, “I am notsure”, and “probably not”. Implementing the Tetrisgame and recognizing simple non-deformable objectsseem like fairly basic functions in either fields, thus itis surprising that the answers to both sets of questionsare so different.

The goal of both programming and teaching is to cre-ate a function. In that respect, the two activities havefar more in common than they have differences. Inboth cases we are writing functions, so there is no rea-son to think that the Church-Turing thesis is not truefor teaching. Despite the similarities, the expectationsof success for creating, debugging, and maintainingsuch function differ widely between software engineersand teachers. While the programming languages andteaching languages are different, the answers to thequestions were the same for all software engineers re-gardless of the programming languages. Yet, most ma-chine learning experts did not give upper bounds onhow long it would take to solve a teaching problem,even when they thought the problem was solvable.

Software engineers have the confidence of being ableto complete the task in a reasonable time becausethey have learned to decompose problems into smallerproblems. Each smaller problem can be further de-composed until coding and debugging can be done inconstant or polynomial time. For instance, to codeTetris, one can create a state module, a state transfor-mation module, an input module, a scoring module, ashape display module, an animation module, and soon. Each of these modules can be further decomposedinto smaller modules. The smaller modules can thenbe composed and debugged in polynomial time. Giventhat each module can be built efficiently, software en-gineers have confidence that they can code Tetris inless than a month’s time.

It is interesting to observe that that the ability to de-compose a problem is a learned skill and is not easy tolearn. A smart student could understand and learn allthe functions of a programming language (variables,arrays, conditional statements, for loops, etc.) in a

week or two. If the same student was asked to codeTetris after two weeks, they would not know where tostart. After 6 to 12 months of learning how to pro-gram, most software engineers would be able to ac-commodate the task of programming the Tetris gamein under a month.

Akin to how decomposition brings confidence to soft-ware engineers2 and an upper bound to solving com-plex problems, machine teachers can learn to decom-pose complex machine learning problems with theright tools and experiences, and the machine teach-ing discipline can bring the expectations of success forteaching a machine to a level comparable to that ofprogramming.

3.2.2. Scaling to multiple contributors

The complexity of the problems that software engi-neers can solve has increased significantly over the pasthalf century, but there are limits to the scale of prob-lems that one software engineer can solve. To addressthis, many tools and techniques have been developedto enable multiple engineers to contribute to the solu-tion of a problem. In this section, we focus on threeconcepts - programming languages, interfaces (APIs),and version control.

One of the key developments that enables scaling withthe number of contributors is the creation of standard-ized programming languages. The use of a standard-ized programming language along with design pat-terns and documentation enables other collaboratorsto read, understand and maintain the software. Theanalog to programming languages for machine teach-ing is the expressions of a teacher’s domain knowledgewhich include labels, features and schemas. Currently,there is no standardization of the programming lan-guages for machine teaching.

Another key development that enables scaling with thenumber of contributors is the use of componentizationand interfaces, which are closely related to the idea ofproblem decomposition discussed above. Componen-tization allows for a separation of concerns that re-duces development complexity, and clear interfaces al-low for independent development and innovation. Forinstance, a software engineer does not need to considerthe details of the hardware upon which the solutionwill run. For machine teaching, the development ofclear interfaces for services required for teaching, suchas training, sampling and featuring, would enable inde-pendent teaching. In addition, having clear interfaces

2For similar reasons, the ability to decompose also bringconfidence to professional instructors and animal trainers.

Machine Teaching

for models, features, labels, and schemas enables com-posing these constituent parts to solve more complexproblems, and thus, allowing for their use in problemdecomposition.

The final development that enables scaling with thenumber of contributors is the development of versioncontrol systems. Modern version control systems sup-port merging contributions by multiple software engi-neers, speculative development, isolation of bug fixesand independent feature development, and rolling backto previous versions among many other benefits. Theprimary role of a version control system is to track andmanage changes to the source code rather than keepingtrack of the compiled binaries. Similarly, in machineteaching, a version control system could support man-aging the changes of the labels, features, schemas, andlearners used for building the model and enable re-producibility and branching for experimentation whileproviding documentation and transparency necessaryfor collaboration.

3.2.3. Supporting the development ofproblem solutions

In the past few decades, there has been an explosionof tools and processes aimed at increasing program-ming productivity. These include the development ofhigh-level programming languages, innovations in inte-grated development environments, and the creation ofdevelopment processes. Some of these tools and pro-cesses have a direct analog in machine teaching, andsome are yet to be developed and adapted. Table 2presents a mapping of many of these tools and con-cepts to machine teaching.

3.3. The trajectory of the machine teachingdiscipline

We conclude this section with a brief review of thehistory of programming and how that might informthe trajectory of the machine teaching discipline. Thehistory of programming is inexorably linked to the de-velopment of computers. Programming started withscientific and engineering tasks (1950s) with few pro-grams and programming languages like FORTRANthat focused on compute performance. In the 1960s,the range of problems expanded to include manage-ment information systems and the range of program-ming languages expanded to target specific applicationdomains (e.g., COBOL). The explosion of the numberof software engineers led to the realization that scal-ing with contributors was difficult (Brooks Jr, 1995).In the 1980s, the scope of problems to which pro-gramming was applied exploded with the advent of

the personal computer as did the number of softwareengineers solving the problems (e.g., with Basic). Fi-nally, in the 1990s, another explosive round of growthbegan with the advent of web programming and pro-gramming languages like JavaScript and Java. As ofwriting this paper, the number of software engineersin the world is approaching 20 million!

Machine teaching is undergoing a similar explosion.Currently, much of the machine teaching effort is un-dertaken by experts in machine learning and statistics.Like the story of programming, the range of problemsto which machine learning has been applied has beenexpanding. With the deep-learning breakthroughs inperceptual tasks in the 2010s (e.g., speech, vision, self-driving cars), there has been an incredible effort tobroaden the range of problems addressed by teachingmachines to solve the problems. Similar to the expand-ing population of software engineers, the advent of ser-vices like LUIS.ai3 and Wit.ai4 have enabled domainexperts to build their own machine learning modelswith no machine learning knowledge. The discipline ofmachine teaching is young and in its formative stages.One can only expect that this growth will continueat an even quicker pace. In fact, machine teachingmight be the path to bringing machine learning to themasses.

4. The role of teachers

The role of the teacher is to transfer knowledge to thelearning machine so that it can generate a useful modelthat can approximate a concept. Let’s define what wemean by this.

Definition 4.1 (Concept) A concept is a mappingfrom any example to a label value.

For example, the concept of a recipe web page canbe represented by a function that returns zero orone, based on whether a web page contains a cookingrecipe. In another example, an address concept canbe represented by a function that, given a document,returns a list of token ranges, each labeled “address”,“street”, “zip”, “state”, etc. Label values for a binaryconcept could be “Is” and “Is Not”. We may alsoallow a “Undecided” label which allows a teacher topostpone labeling decisions or ignore ambiguous ex-amples. Postponing a decision is important becausethe concept may be evolving in the teacher’s head.An example of this is in (Kulesza et al., 2014).

Definition 4.2 (Feature) A feature is a concept that

3https://www.luis.ai/4https://wit.ai/

https://www.luis.ai/

https://wit.ai/

Machine Teaching

Table 2. Mapping between programming and machine teaching

Programming Machine teachingCompiler ML Algorithms (Neural Networks, SVMs)

Operating System/Services/IDEs Training, Sampling, Featuring Services,etc.

Frameworks ImageNet, word2vec, etc.

Programming Languages (Fortran,Python, C#)

Labels, Features, Schemas, etc.

Programming Expertise Teaching Expertise

Version Control Version Control

Development Processes (specifications,unit testing, deployment, monitoring,

etc.)

Teaching Processes (data collection,testing, publishing, etc.)

assigns each example a scalar value.

We usually use feature to denote a concept when em-phasizing its use in a machine learning model. Forexample, the concept corresponding to the presence orabsence of the word “recipe” in text examples mightbe a useful feature when teaching the recipe concept.

Definition 4.3 (Teacher) A teacher is the personwho transfers concept knowledge to a learning ma-chine.

To clarify this definition of a teacher, the methods ofknowledge transfer need to be defined. At this point,they include a) example selection (biased), b) label-ing, c) schema definition (relationship between labels),d) featuring, and e) concept decomposition (wherefeatures are recursively defined as sub-models). Theteachers are expected to make mistakes in all the formsof knowledge transfer. These teaching “bugs” are com-mon occurrences.

Figure 1 illustrates how concepts, labels, features, andteachers are related. We assume that every concept isa computable function of a representation of examples.The representation is assumed to include all availableinformation about each example. The horizontal axisrepresents the (infinite) space of examples. The verti-cal axis represents the (infinite) space of programs orconcepts. In computer science theory, programs andexamples can be represented as (long) integers. Us-ing that convention, each integer number on the ver-tical axis could be interpreted as a program, and each

integer number on the horizontal axis could be inter-preted as an example. We ignore the programs thatdo not compile and the examples that are nonsensical.We now use Figure 1 to refer to the different ways ateacher can pass information to a learning system.

Definition 4.4 (Selection) Selection is the processby which teachers gain access to an example that ex-emplifies useful aspects of a concept.

Teachers can select specific examples by filtering theset of unlabeled examples. By choosing these fil-ters deliberately, they can systematically explore thespace and discover information relevant to concepts.For example, a teacher may discover insect recipeswhile building a recipe classifier by issuing a query on“source of proteins”. We note that uniform samplingand uncertainty sampling, which have no explicit inputfrom a teacher, are likely of little use for discoveringrare clusters of positive examples. Combinations of se-mantic filters involving trained models are even morepowerful (e.g., “nutrition proteins” and low score withcurrent classifier). This ability to find examples con-taining useful aspects of a concept enables the teacherto find useful features and provide the labels to trainthem. Furthermore, the selection choices themselvescan be valuable documentation of the teaching pro-cess.

Definition 4.5 (Label) A label is a (example, con-cept value) pair created by a teacher in relation to aconcept.

Machine Teaching

Figure 1. Representation of examples and concepts. Each column represents an example and contains all concept valuesfor that example. A teacher looks in that direction to “divine” a label. The teacher has access to feature concepts notavailable to the training set (it is part of the teaching power). However, the teacher does not know his/her own program.Each row represents a concept and contains the value of that concept for all examples. A teacher looks in that directionto “divine” the usefulness of a feature concept. A teacher can guess the values over the space of examples (it is part ofthe teaching power). Features selected by the teacher looking horizontally are immune to over-training.

Teachers can provide labels by “looking at a column”in Figure 1. It is important to realize that the teachersdo not know which programs are running in their headswhen they evaluate the target concept values. If theyknew the programs, they would transfer their knowl-edge in programmatic form to the machine and wouldnot need machine learning. Teachers instead look atthe available data of an example and “divine” its label.They do this by unconsciously evaluating sub-featuresand combining them to make labeling decisions. Thefeature spaces and the combination functions availableto the teachers are beyond what is available throughthe training sets. This power is what makes the teach-ers valuable for the purpose of creating labels.

Definition 4.6 (Schema) A schema is a relation-ship graph between concepts.

When multiple concepts are involved, a teacher canexpress relationship between them. For instance, theteacher could express that the concepts “Tennis” and“Soccer” are mutually exclusive, or that concept “Ten-nis” implies the concept “Sport”. These concept con-straints are relationships between lines on the dia-gram (true across all examples). Separating knowl-edge captured by the schema from the knowledge cap-tured by the labels allows information to be conveyed

and edited at a high level. The implied labels canbe changed simply by changing the concept relation-ship. For instance, “Golf” could be moved from beinga sub-concept of “Sport” to being mutually exclusiveor vice versa. Teachers can understand and changethe semantics of a concept by reviewing its schema.Semantic decisions can be reversed without editing in-dividual labels.

Definition 4.7 (Generic feature) A generic fea-ture is a set of related feature functions.

Generic features are created by engineers inparametrizable form, and teachers instantiate individ-ual features by providing useful and semantic parame-ters. For instance, a generic feature could be: “Log(1+ number of instances of words in list X in a docu-ment)” and an instantiation would be setting X to alist of car brands (useful for an automotive classifier).

Given a set of generic features, teachers have the abil-ity to evaluate different (instantiated) features by look-ing along the corresponding horizontal lines in Fig-ure 1. Given two features, the teachers can “divine”that one is better than the other on a large unlabeledset. For instance, a teacher may choose a feature thatmeasures the presence of the word “recipe” over a fea-ture that measures the presence of the word “the”,

Machine Teaching

even though the latter feature might yield better re-sults on the training set. This ability to estimate thevalue of a feature over estimated distributions of thetest set is essential to feature engineering, and is prob-ably the most useful capability of a teacher. Featuresselected by the teacher in this manner are immune toover-training because they are created independentlyof the training set. Note the contrast to “automaticfeature selection”, which only looks at the training setand concept-independent statistics and is susceptibleto over-training.

Definition 4.8 (Decomposition) Decomposition isthe act of using simpler concepts to express more com-plex ones.

Whereas teachers do not have direct access to the pro-gram implementing their concept, they sometimes caninfer how these programs work. Socrates used to teachby asking the right questions. The “right question” isakin to providing a useful sub-concept, whose valuemakes evaluating the overall concept easier. In otherwords, Socrates was teaching by decomposition ratherthan by examples. This ability is not equally availableto teachers. It is learned. It is essential to scaling withcomplexity and with the number of teachers. It is thesame ability that helps software engineers decomposefunctions into sub-functions. Software engineers alsoacquire this ability with experience. As in program-ming, teaching decompositions are not unique (in soft-ware engineering, switching from one decomposition toanother is called refactoring).

The knowledge provided by the teacher through con-cept decomposition is high level and modular. Eachconcept implementation might provide its own exam-ple selection, labels, schema, and features. These canbe viewed as documentation of interfaces and con-tracts. Each concept implementation may be a blackbox, but the concept hierarchy is transparent and in-terpretable. Concept decomposition is the highestform of knowledge provided by the teacher.

Now that we have defined some of the key roles of the(machine) teacher, we turn to the question of how dowe meet the demand for them.

Meeting the demand for teachers

We postulate that the right solution to satisfy the in-creasing demand for machine learning models is to in-crease the number of people that can teach machinesthese models. But how do we do that and who arethey?

The current ML-focused work flows put the machine

learning or data scientist on the driver’s seat. Whiletraining more scientists is a way to increase the numberof teachers, we believe that that is not the right pathto follow. For starters, machine learning and data sci-entists are a scarce and expensive resource. Secondly,machine learning scientists can serve a better purposeinventing and optimizing learning algorithms. In thesame way, data scientists are indispensable applyingtheir expertise to make sense of data and transform itinto a usable form.

The machine teaching process that we envision doesnot require the skills of a ML expert or data scientist.Machine teachers use their domain knowledge to pickthe right examples and counterexamples for a conceptand explain why they differ. They do this throughan interactive information exchange with a learningsystem. It is within the ranks of the domain expertswhere we will find the large population of machineteachers that will increase, by orders of magnitude,the number of ML models used to solve problems. Wecan transform domain experts by making a machineteaching language universally accessible.

A key characteristic of domain experts is that theyunderstand the semantics of a problem. To this point,we argue that if a problem’s data does not need to beinterpreted by a person to be useful, machine teachingis not needed. For example, problems for which thelabeled data is abundant or practically limitless; e.g.Computer Vision, Speech Understanding, GenomicsAnalysis, Click-Prediction, Financial Forecasting. Forthese, powerful learning algorithms or hardware maybe the better strategy to arrive at an effective solution.In other problems like the above, feature selection us-ing cross validation can be used to arrive at a goodsolution without the need of a machine teacher.

There is nonetheless, an ever-growing set of prob-lems for which machine teaching is the right approach;problems where unlabeled data is plentiful and domainknowledge to articulate a concept is essential. Exam-ples of these include controlling Internet-of-Things ap-pliances through spoken dialogs and the environment’scontext, or routing customer feedback for a brand newproduct of a start-up to the right department, build-ing a one-time assistant to help a paralegal sift throughhundreds of thousands of briefs, etc.

We aim at reaching the same number of machine teach-ers as there are software engineers, a set counted inthe tens of millions. Table 3 illustrates the differencesin numbers between machine learning scientists, datascientists, and domain experts. By enabling domainexperts to teach, we will enable them to apply theirknowledge to solve directly millions of meaningful, per-

Machine Teaching

sonal, shared, “one-off” and recurrent problems at ascale that we have never seen.

5. Teaching process

A teaching or programming language can be applied inmany different ways, some more effective than others.We propose the following principles for the languageand process of machine teaching:

Universal teaching language We do not rely onthe power of specific machine learning algorithms. Theteaching interface is the same for all algorithms. Ifa machine learning algorithm is swapped for anotherone, more teaching may be necessary, but the teach-ing language and the model building experience is notchanged. Machine learning algorithms should be inter-changeable. Conversely, the teaching language shouldbe simple and easy to learn given the domain (e.g.,text, signal, images). Ideally, we aim at designingan ANSI or ISO standard per domain. Teachers thatspeak the same language should be interchangeable.

Feature completeness (or realizability) We as-sume that all the target concepts that a teacher maywant to implement are “realizable” through a recur-sive composition of models and existing features. Thisimplies a property on the feature set, which we call“feature completeness”. Feature completeness is theresponsibility of the teaching tool, not the teachers.Teachers achieve realizability through the following ac-tions:

1. Add missing features: If a teacher can distin-guish two documents belonging to two differentclasses in a meaningful way, there must be a (cor-responding) feature expressible in the system thatcan make an equivalent semantically meaningfuldistinction. By adding such a feature, the teachercan correct feature blindness errors. If no suchfeature exists, the language is not feature com-plete for distinguishing the desired classes.

2. Create features through decomposition: Ifthe concept function cannot be learned from theexisting set of features due to limitations of themodel class, the teacher can circumvent this prob-lem by creating features that are themselves mod-els; we call this process “model decomposition”.To illustrate the point, suppose there are two bi-nary features A and B, and the teacher would liketo produce a model for A⊕B (where ⊕ stands forXOR) using logistic regression. Because of thecapacity limitations of logistic regression, it is im-possible to represent A ⊕ B without additional

features. If the teacher adds a third AND featureA∧B, however, logistic regression can work. Notethat A∧B is itself learnable via logistic regressionin the A and B feature space.

3. Explicitly ignore ambiguous patterns: Am-biguous patterns can be marked as “don’t care”to avoid wasting features, labels, and the teacher’stime on difficult examples. Areas of “don’t care”are used as a coping mechanism to keep the real-izability assumption despite the Bayes error rate.This action does not constrain the feature set.

Feature completeness of a teaching language does notimply that the language can be used to efficiently teachconcepts. If a feature complete language is not veryexpressive, realizability can require a large number ofmodel compositions. If a feature complete languageis too expressive (e.g. a features can be specified asprograms), the teachers have to become engineers.

Rich and diverse sampling set We call the set ofunlabeled documents accessible to the teacher whenbuilding models the “sampling distribution”. Wecall the set of documents for which models are builtthe “deployment distribution”. The rich-and-diverse-sampling-set principle is that the sampling distributioncaptures the richness and diversity of examples in thedeployment distribution. The sampling distributionand the deployment distributions are preferably simi-lar, but they do not have to be perfectly matched. Themost important requirement of the sampling distribu-tion is that all important types of documents be repre-sented (rich and diverse). If important documents aremissing from the sampling distribution, performancecould be impacted in unpredictable ways. As a rule ofthumb, unlabeled data should be collected indiscrim-inately because the cost of storing data is negligiblecompared to the cost of teaching; we view selectivelycollecting only the data that is meant to be labeledas both risky and limiting5. A rich and diverse dataset allows the teacher to explore it to express knowl-edge through selection. It also allows the teacher tofind examples that can be used to train sub-conceptsthat are more specific than the original concept. Forinstance, a teacher could decide to build classifiers forbonsai gardening (sub-concept) and botanical garden-ing (excluded concept) to be used as features to a gar-dening classifier. The sampling set needs to be richenough to contain sufficient examples to successfullylearn the sub-concepts. The sampling distribution can

5For example, the collected set may not contain im-portant examples that would otherwise be found via themachine teaching process.

Machine Teaching

Table 3. Where to find machine teachers

Potential teacher Quantities CharacteristicsMachine learning experts Tens of thousands Has profound understanding

of machine learning. Canmodify a machine learning

algorithm or architecture toimprove performance.

Data Scientist / Analyst Hundreds of thousands Can analyze big data, detecttrend and correlations usingmachine learning. Can trainmachine learning models on

existing values to extractvalue for a business.

Domain expert Tens of millions Understands the semantics ofa problem. Can provideexamples and counter

examples, and explain thedifference between them.

be updated or re-collected. Examples that have beenlabeled by teachers, however, are kept forever becauselabels always retain some semantic value.

Distribution robustness The assumption that thetraining distribution matches the sampling or deploy-ment distribution is unrealistic in practice. The roleof the teacher is to create a model that is correct forany example, regardless of the deployment distribu-tion. Given our assumption of feature completenessand a rich and diverse sampling set, the result of a suc-cessful teaching process should be robust to not know-ing the deployment distributions. Imagine program-ming a “Sort” function. We expect “Sort” to workregardless of the distribution of the data it is sorting.Thanks to realizability, we have the same correctnessexpectation for teaching. Because the training data isdiscovered and labeled for the training set in an ad hocway using filtering, distribution robustness is a criticalassumption and we therefore favor machine learningalgorithms that are robust to covariate shifts. Warn-ing: having a mismatch between train and sampling(or deployment) distributions complicates evaluation.

Modular development Decomposition is a centralprinciple of both programming and machine teach-ing. The machine teaching process should supportthe modular development of concept implementation.This includes the decomposition of concepts into sub-concepts, and the use of models as features for othermodels. We can achieve this by standardizing modeland feature interfaces. Similar to a programming in-tegrated development environment (IDE), within ourteaching IDE, concept implementation is done through

“projects” that are grouped into “solutions”. Projectsin a solution are trained together because their retrain-ing can affect each other. Dependencies across differ-ent solutions are treated as versioned packages, whichmeans that retraining a project in one solution doesnot affect a project in a different solution (the teachermust update the package reference to incorporate suchchanges). The modular development principle encour-ages the sharing of explicit concept implementations.

Version control All teacher actions (e.g., labels,features, label constraints, schema and dependencygraph, and even programming code if necessary) areequivalent to a concept “program”. They are savedin the same “commit”. Like programming code, theteacher’s actions relevant to a concept are saved in aversion control system. Different type of actions arekept in different files to facilitate merge operations be-tween contributions from different teachers.

The combination of these principles suggests a teach-ing process that is different from the standard teach-ing process. The universal teaching language impliesthat the machine learning expert can be left out of theteaching loop. The featuring completeness principleimplies that the engineers can be left out of the teach-ing loop as well. The teaching tool should provide theteacher with all that is needed to build models effec-tively. The engineers can update the data pipeline andthe programming language, but neither are concept-dependent so the engineer is out of the teaching loop.These two principles imply that a single person withdomain and teaching knowledge can own the whole

Machine Teaching

repeatwhile training set is realizable do

if quality criteria is met thenexit

end// Actively and semantically explore sampling set using concept based filters.Find a test error (i.e., an incorrectly predicted (example, label) pair);Add example to training set.;

end// Fix training set errorif training error is caused by labeling error(s) then

Correct labeling error(s);else

// Fix feature blindness. This may entail one or more of the following actions:Add or edit basic features;Create a new concept/project for a new feature (decomposition);Change label constraints or schema (high level knowledge);

end

until forever ;Algorithm 1: A machine teaching process

process. The availability of a rich and diverse sam-pling set means that the traditional data collection forlabeling step is not part of the concept teaching pro-cess. The distribution robustness principle allows theteacher to explore and label freely throughout the pro-cess without worrying about balancing classes or ex-ample types. Concept modularity and version controlguarantee that a function created in a project is repro-ducible provided that (1) all of its features are deter-ministic and (2) training is deterministic. The conceptmodularity principle enables interpretability and scal-ing with complexity. The interpretability comes frombeing able to explain what each sub-concept does bylooking at the labels, features, or schema. Even if eachsub-concept is a black box inside, their interfaces aretransparent. The merge functionality in version con-trol enables easy collaboration between multiple teach-ers.

Based on the above, we propose a skeleton for a teach-ing process in Algorithm 1. Note that this process isnot unique.

Evaluating the quality criteria in a distribution-robustsetting is difficult and beyond the scope of this pa-per. A simple criteria could be to pause when theteacher’s cost or time invested reaches a given limit.Finding test error effectively is also difficult and be-yond the scope of this paper. The idea is to queryover the large sample set by leveraging query-specificteacher-created concepts and sub-concepts. The art isto maximize the semantic expressiveness of queryingand the diversity of results. Uncertainty sampling is

a trivial and uninteresting case (ambiguous examplesare not useful for coming up with new decompositionconcepts).

There are a few striking differences between the teach-ing process above and the standard model buildingprocess. The most important aspect is that it can bedone by a single actor operating on the true distribu-tion. Knowledge transfer from teacher to learner hasmultiple modalities (selection, labels, features, con-straints, schema). The process is a never-ending loopreminiscent of Tom Mitchell’s NELL (Carlson et al.,2010). Capacity is increased on demand, so there is noneed for traditional regularization because the teachercontrols the capacity of the learning system by addingfeatures only when necessary.

6. Conclusion

Over the past two decades, the machine learning fieldhas devoted most of its energy to developing and im-proving learning algorithms. For problems in whichdata is plentiful and statistical guarantees are suffi-cient, this approach has paid off handsomely. Thefield is now evolving toward addressing a larger setof simpler and more ephemeral problems. While thedemand to solve these problems effectively grows, theaccess to teachers that can build corresponding solu-tions is limited by their scarcity and cost. To trulymeet this demand, we need to advance the disciplineof machine teaching. This shift is identical to the shiftin the programming field in the 1980s and 1990s. This

Machine Teaching

parallel yields a wealth of benefits. This paper takesinspiration from three lessons from the history of pro-gramming. The first one is problem decomposition andmodularity. They have allowed programming to scalewith complexity. We argue that a similar approachhas the same benefits for machine teaching. The sec-ond lesson is the standardization of programming lan-guages: write once, run everywhere. This paper is notproposing a standard machine teaching language, butwe enumerated the most important machine-learning-agnostic knowledge channels available to the teacher.The final lesson is the process discipline, which in-cludes separation of concerns and the building of stan-dard tools and libraries. This addresses the same limi-tations to productivity and scaling with the number ofcontributors that plagued programming (as describedin the ”Mythical Man Month” (Brooks Jr, 1995)). Wehave proposed a set of principles that lead to a betterteaching process discipline. Some of the tools of pro-gramming, such as version control, can be used as is.Some of these principles have been successfully appliedin services such as LUIS.ai and by product groups in-side Microsoft such as Bing Local. We are in the earlystages of building a teaching interactive developmentenvironment.

On a more philosophical note, large monolithic sys-tems, as epitomized by deep learning, are a populartrend in artificial intelligence. We see this as a form ofmachine learning behaviorism. It is the idea that com-plex concepts can be learned from a large set of (in-put, output) pairs. With the aid of regularizers and/ordeep representations computed using unsupervised orsemi-supervised learning, the monolithic learning ap-proach has yielded impressive results. This has beenthe case in several fields where labeled data is abun-dant (speech, vision, machine translation). The mono-lithic approach, however, has limitations when labeleddata is hard to come by. Deep representations builtfrom unlabeled data optimize where the data is. Raremisspellings that are domain specific are likely to be ig-nored or misinterpreted if they appear more frequentlyin a different domain context. Corner cases with littleor no labels for autonomous-driving may be ignoredat great perils. Large (amorphic) models are hard tointerpret. These limitations can be overcome by in-jecting semantic knowledge via active teaching (e.g.,labels, features, structure). For this reason, we believethat both large monolithic systems and systems moreactively supervised by teaching have important rolesto play in machine learning. As a bonus, they caneasily be combined to complement each other.

7. Acknowledgements

We thank Jason Williams for his support and contri-butions to Machine Teaching. Jason is the creatorof the LUIS (Language Understanding Internet Ser-vices) project, a service for building language under-standing models, based on the principles mentionedin this paper. We thank Riham Mansour and theMicrosoft Cairo team for co-building, maintaining,and improving the www.LUIS.ai service. Finally, wethank Matthew Hurst and his team for building high-performing web page entity extractors leveraging ourmachine teaching tools. These entity extractors aredeployed in Bing Local.

References

Yoshua Bengio, Jerome Louradour, Ronan Collobert,and Jason Weston. 2009. Curriculum Learning. InProceedings of the 26th Annual International Con-ference on Machine Learning (ICML ’09). ACM,New York, NY, USA, 41–48. https://doi.org/

10.1145/1553374.1553380

Frederick P Brooks Jr. 1995. The Mythical Man-Month: Essays on Software Engineering, Anniver-sary Edition, 2/E. Pearson Education India.

Andrew Carlson, Justin Betteridge, Bryan Kisiel,Burr Settles, Estevam R. Hruschka, Jr., andTom M. Mitchell. 2010. Toward an Architec-ture for Never-ending Language Learning. In Pro-ceedings of the Twenty-Fourth AAAI Conferenceon Artificial Intelligence (AAAI’10). AAAI Press,1306–1313. http://dl.acm.org/citation.cfm?

id=2898607.2898816

Todd Kulesza, Saleema Amershi, Rich Caruana,Danyel Fisher, and Denis Charles. 2014. Structuredlabeling for facilitating concept evolution in machinelearning. In Proceedings of the SIGCHI Conferenceon Human Factors in Computing Systems. ACM,3075–3084.

D. Sculley, Gary Holt, Daniel Golovin, Eugene Davy-dov, Todd Phillips, Dietmar Ebner, Vinay Chaud-hary, and Michael Young. 2014. Machine Learning:The High Interest Credit Card of Technical Debt. InSE4ML: Software Engineering for Machine Learn-ing (NIPS 2014 Workshop).

Vladimir Vapnik. 2013. The Nature of StatisticalLearning Theory. Springer science & business me-dia.

https://doi.org/10.1145/1553374.1553380

https://doi.org/10.1145/1553374.1553380

http://dl.acm.org/citation.cfm?id=2898607.2898816

http://dl.acm.org/citation.cfm?id=2898607.2898816

Machine Teaching A New Paradigm for Building ... - arXiv

Documents

Transcript of Machine Teaching A New Paradigm for Building ... - arXiv