SHARE: a System for Hierarchical Assistive Recipe Editing

11
SHARE: a System for Hierarchical Assistive Recipe Editing Shuyang Li UC San Diego San Diego, California [email protected] Yufei Li UT Dallas Dallas, Texas [email protected] Jianmo Ni UC San Diego San Diego, California [email protected] Julian McAuley UC San Diego San Diego, California [email protected] Alfredo Sauce Butter, Flour, Cream, Garlic, Parmesan, Parsley 1) Melt butter in saucepan. 2) Add flour and stir. 3) Add garlic and saute. 4) Add cream and simmer. 5) Add cheese and melt... Dairy-Free “Alfredo Sauce” Olive Oil, Cashews, Nutritional Yeast, Flour, Garlic, Parsley 1) Soak cashews overnight. 2) Heat olive oil in saucepan. 3) Whisk in flour to make roux. 4) Add roux, cashews, garlic, and yeast to blender... Butter Flour Cream Garlic Parmesan Parsley +Olive Oil +Cashews +Nutritional Yeast User provides dietary constraint (Dairy-Free) and base recipe: System makes ingredient substitutions System re-writes recipe directions System assists users with key ingredient ratios 2 : 1 (Cashews : Nutritional Yeast) Figure 1: We investigate the task of controllable recipe editing: edit a base recipe to satisfy a given dietary restriction. ABSTRACT We introduce SHARE: a System for Hierarchical Assistive Recipe Editing to assist home cooks with dietary restrictions—a popula- tion under-served by existing cooking resources. Our hierarchical recipe editor makes necessary substitutions to a recipe’s ingredients list and re-writes the directions to make use of the new ingredi- ents. We introduce the novel RecipePairs dataset of 84K pairs of similar recipes in which one recipe satisfies one of seven dietary constraints, allowing for supervised training of such recipe editing models. Experiments on this dataset demonstrate that our system produces convincing, coherent recipes that are appropriate for a target dietary constraint (contain no prohibited ingredients). We show that this is a challenging task that cannot be adequately solved with human-written ingredient substitution rules or straightfor- ward adaptation of state-of-the-art models for recipe generation. We further demonstrate through human evaluations and real-world cooking trials that recipes edited by our system can be easily fol- lowed by home cooks to create delicious and satisfactory dishes. 1 CCS CONCEPTS Computing methodologies Natural language processing; Applied computing Consumer health. KEYWORDS recipe editing, text generation, health, personalization 1 We will release code and data upon publication. 1 INTRODUCTION Cooking has played an integral role in human civilization and evo- lution for over 1.8 million years [34]. A growing population follows some form of dietary restriction [9], with motivations ranging from socioeconomic to medical [6, 30]. Home cooks browsing through recipes on online recipe aggregators may often encounter an inter- esting recipe that does not fit their dietary needs (e.g. vegetarian- ism), and would benefit from a way to edit that recipe to accom- modate their needs. These restrictions are easy for users to specify, but existing online recipe websites offer few options for users with dietary constraints—even those with common restrictions like dairy or gluten intolerance (Table 1). We see an opportunity to build an automatic system to tailor recipes to fit the preferences and dietary constraints of users. Such systems for controllable recipe editing can help home cooks of any experience level find diverse ways to satisfy their palates and dietary needs. Controllable recipe editing can also help develop interactive, personalized products including meal kits and ready-to-eat food offerings to accommodate individual dietary restrictions and preferences. Our System for Hierarchical Assistive Recipe Editing (SHARE) edits an ingredient list to satisfy a user-specified dietary restriction, and then re-writes the instructions to make use of those new in- gredients (Figure 1). We show that recipe editing is a challenging, complicated task where ingredient substitutions entail not only nu- tritional, but also structural and flavor-related impacts on a recipe. This makes rule-based substitution methods and straightforward application of existing recipe generators inadequate. We conduct arXiv:2105.08185v1 [cs.CL] 17 May 2021

Transcript of SHARE: a System for Hierarchical Assistive Recipe Editing

SHARE: a System for Hierarchical Assistive Recipe EditingShuyang LiUC San Diego

San Diego, [email protected]

Yufei LiUT Dallas

Dallas, [email protected]

Jianmo NiUC San Diego

San Diego, [email protected]

Julian McAuleyUC San Diego

San Diego, [email protected]

Alfredo SauceButter, Flour, Cream, Garlic, Parmesan, Parsley

1) Melt butter in saucepan.2) Add flour and stir.3) Add garlic and saute.4) Add cream and simmer.5) Add cheese and melt...

Dairy-Free “Alfredo Sauce”Olive Oil, Cashews, Nutritional Yeast, Flour, Garlic, Parsley

1) Soak cashews overnight.2) Heat olive oil in saucepan.3) Whisk in flour to make roux.4) Add roux, cashews, garlic, and

yeast to blender...

ButterFlourCreamGarlicParmesanParsley+Olive Oil+Cashews+Nutritional Yeast

User provides dietary constraint (Dairy-Free)

and base recipe:

System makes ingredient substitutions

System re-writesrecipe directions

System assists users with key ingredient ratios

2 : 1(Cashews : Nutritional Yeast)

Figure 1: We investigate the task of controllable recipe editing: edit a base recipe to satisfy a given dietary restriction.

ABSTRACTWe introduce SHARE: a System for Hierarchical Assistive RecipeEditing to assist home cooks with dietary restrictions—a popula-tion under-served by existing cooking resources. Our hierarchicalrecipe editor makes necessary substitutions to a recipe’s ingredientslist and re-writes the directions to make use of the new ingredi-ents. We introduce the novel RecipePairs dataset of 84K pairs ofsimilar recipes in which one recipe satisfies one of seven dietaryconstraints, allowing for supervised training of such recipe editingmodels. Experiments on this dataset demonstrate that our systemproduces convincing, coherent recipes that are appropriate for atarget dietary constraint (contain no prohibited ingredients). Weshow that this is a challenging task that cannot be adequately solvedwith human-written ingredient substitution rules or straightfor-ward adaptation of state-of-the-art models for recipe generation.We further demonstrate through human evaluations and real-worldcooking trials that recipes edited by our system can be easily fol-lowed by home cooks to create delicious and satisfactory dishes.1

CCS CONCEPTS•Computingmethodologies→Natural language processing;• Applied computing→ Consumer health.

KEYWORDSrecipe editing, text generation, health, personalization

1We will release code and data upon publication.

1 INTRODUCTIONCooking has played an integral role in human civilization and evo-lution for over 1.8 million years [34]. A growing population followssome form of dietary restriction [9], with motivations ranging fromsocioeconomic to medical [6, 30]. Home cooks browsing throughrecipes on online recipe aggregators may often encounter an inter-esting recipe that does not fit their dietary needs (e.g. vegetarian-ism), and would benefit from a way to edit that recipe to accom-modate their needs. These restrictions are easy for users to specify,but existing online recipe websites offer few options for users withdietary constraints—even those with common restrictions like dairyor gluten intolerance (Table 1). We see an opportunity to build anautomatic system to tailor recipes to fit the preferences and dietaryconstraints of users. Such systems for controllable recipe editing canhelp home cooks of any experience level find diverse ways to satisfytheir palates and dietary needs. Controllable recipe editing can alsohelp develop interactive, personalized products including meal kitsand ready-to-eat food offerings to accommodate individual dietaryrestrictions and preferences.

Our System for Hierarchical Assistive Recipe Editing (SHARE)edits an ingredient list to satisfy a user-specified dietary restriction,and then re-writes the instructions to make use of those new in-gredients (Figure 1). We show that recipe editing is a challenging,complicated task where ingredient substitutions entail not only nu-tritional, but also structural and flavor-related impacts on a recipe.This makes rule-based substitution methods and straightforwardapplication of existing recipe generators inadequate. We conduct

arX

iv:2

105.

0818

5v1

[cs

.CL

] 1

7 M

ay 2

021

Li, et al. 2021

experiments on a novel dataset of 84K recipe pairs from 459K user-reviewed recipes on a popular recipe website, Food.com. Each pairis associated with a dietary restriction and contains a base recipe anda similar target recipe that satisfies the constraint. We investigatemulti-task and end-to-end methods for adapting state-of-the-artrecipe generation models for our task, and develop a hierarchicalframework for our controllable editing system. We also develop amodel for predicting the ratio of important ingredient pairs in anedited recipe.We evaluate edited recipes when compared to existingconstraint-satisfying recipes from our corpus via automatic met-rics to demonstrate that our approach produces the most diverseand highest-fidelity recipes. We then survey 672 home cooks toassess the quality of our edited recipes compared to human-writtenrecipes, finding that SHARE consistently produced the highest qual-ity edits—in the process discovering several stylistic and structuralaspects that cooks evaluate when searching for recipes. Finally, werecruit seven home cooks to cook 21 recipes generated from SHAREand evaluate both the cooking process and final product, showingthat such recipes are delicious, easy to make, and appropriate fortarget dietary restrictions.

We summarize our main contributions as follows:

• We propose a system to assist an under-served populationof home cooks by editing recipes to satisfy their dietaryconstraints;

• We investigate three approaches to train such a system forcontrollable recipe editing and evaluate each on the novelRecipePairs dataset containing 84K pairs of recipes and‘versions’ thereof that satisfy some dietary constraint; and

• We find through quantitative evaluations, a human sur-vey, and real-world cooking trials that a hierarchical ap-proach produces coherent, constraint-respecting, and practi-cal recipes that home cooks find appealing.

2 RELATEDWORKWe specifically aim to help a population under-served by exist-ing resources: people with dietary restrictions. We are inspired byworks in nutritional recipe recommendation [4, 10] that aim torecommend healthier food options. To our knowledge, while priorwork has focused on ingredient substitution or generating recipe di-rections, ours is to first to examine both to create a complete recipe.We are motivated by work on single ingredient substitution [31, 35],but our system accommodates multiple simultaneous substitutionswhen predicting a target ingredient set. Recent work in recipe gen-eration has focused on improving coherence [2] and ingredientspecificity [17]. We draw from these as well as sentence editingwork that uses retrieved prototype sequences to control generatedtext [11, 12], to improve conditioning on input ingredients. In oursetting, the user-provided base recipe is analogous to a prototype.

The editing of procedural texts like recipes resembles conditionaltext generation and paraphrase generation. Transformer [32] lan-guage models have seen success in both fields [16, 33], and Raffelet al. [26] demonstrated that such models can make use of a set ofcontrol codes for simultaneous multi-task learning. Lee et al. [19]train a single model for two tasks—ingredient extraction and recipestep generation conditioned on the recipe directions and ingredi-ents, respectively. A user must thus provide either the complete set

Table 1: For each dietary constraint: demand and percent ofrecipes on Food.com accommodating each constraint.

Constraint % Users % Recipes

Low-Carb 77.6 16.9Low-Calorie 69.9 14.3Low-Fat 57.5 8.6Low-Sugar 24.3 2.6

Vegetarian 65.0 13.3Gluten-Free 28.2 2.3Dairy-Free 23.7 1.5

of ingredients or a full set of directions—neither of which may beknown to home cooks looking for ways to adapt a recipe to theirindividual needs. Majumder et al. [20] explore ways to use browsinghistories to personalize recipes for cooking website users. Theirmodel uses these histories, a recipe name, and ingredients as inputto generate personalized instructions, but cannot be controlled togenerate constraint-satisfying (e.g. gluten-free) recipes.

3 DATASETSeveral corpora have been proposed for recipe generation andmodeling. The 150K-recipe Now You’re Cooking! dataset has beenused to support recipe step generation [2, 17]. Larger-scale datasetsinclude Recipe1M+ [21] for cross-modal retrieval tasks, as wellas the Food.com [20] dataset, which includes user activity traces(e.g. reviews and comments) from a widely-used online resourcethat helps identify users who may have dietary restrictions.

We explore supervised recipe editing: predicting a target recipe’singredients and instructions given a base recipe and dietary restric-tion (category). We extend the Food.com recipe dataset by scrapingcategory tags for each of our 459,071 recipes—comprising ‘soft’ con-straints (low-sugar, low-fat, low-carb, low-calorie) and ‘hard/strict’dietary restrictions (gluten-free, dairy-free, vegetarian). We pairrecipes into a base recipe that does not satisfy a category (dietaryconstraint) and a target version of the base recipe that satisfies theconstraint. We cannot exhaustively enumerate all possible recipevariants (e.g. a user may enjoy ‘gazpacho’ as a vegetarian alternativeto ‘bacon tomato soup’), so we rely on a name similarity heuristicto pair related recipes: we sample target recipes whose name con-tains a base recipe name (e.g. ‘healthy oat chocolate cookie’ as agluten-free version of ‘chocolate cookie’), resulting in 84,814 totalpairs in the RecipePairs dataset. We sampled 100 random pairsand verified that recipes were sufficiently similar and that targetrecipes indeed satisfied the dietary constraint in all cases.

The large ingredient space contributes to the difficulty of thetask—unlike previous works in recipe generation that collapse re-dundant ingredients with coarse heuristics [27], ingredient varia-tions such as bacon ranch dressing and ranch dressing are substan-tially different with respect to categories: the latter is lower fat andvegetarian. We normalize ingredients by stripping brand names andamounts (e.g. ‘Knoxville Farms’ or ‘1 tablespoon’), and remove rareingredient variants by preserving only the 2,942 unique ingredientsthat appear across 10+ recipes in our dataset.

SHARE: a System for Hierarchical Assistive Recipe Editing

ConstraintBase Recipe

Transformer Decoder

EditedIngredients

EditedSteps

ConstraintBase NameBase Steps

Transformer Decoder

Edited Ingredients

ConstraintBase NameEdited Ingredients

Transformer Decoder

EditedSteps

SharedModel

Set Transformer

ConstraintBase NameBase Ingredients

Edited Ingredients

Ingredient Logits

Cardinality Prediction

Edited Ingredients

Transformer Encoder

Transformer Decoder

IngredientEmbedding

Attn. Fusion

Pointer Gate

Edited StepsEnd-to-End (E2E)Multi-Task (MT) Hierarchical (SHARE)

Figure 2: We investigate three approaches for controllable recipe editing to satisfy dietary restrictions. We demonstrate theworkflow for each approach, using the example of editing a recipe for sugar cookies to accommodate gluten intolerance.

3.1 Recipes Satisfying Dietary RestrictionsTo better understand the audience that would benefit from a recipeediting application, we first investigate the prevalence of recipes inthe popular Food.com database and how registered users lookingfor recipes for dietary constraints are served. For user statistics,we investigate 35,121 users who have interacted with at least fourrecipes on Food.com. Apart from a few dietary restrictions oftenassociated with weight loss or popular diets, a long tail of dietaryconstraints are poorly supported by online recipe aggregators likeFood.com. While over 95% of returning users choose to cook atleast one categorical recipe, only 43% of all recipes satisfy one ofour dietary categories—the stark difference between the proportionof users looking to follow a dietary constraint and the proportionof recipes accommodating such constraints (hereafter ‘categorical’recipes) is seen in Table 1. In particular, there is a dearth of low-sugar, dairy-free, and gluten-free recipes: collectively they make upless than 10% of available recipes despite over 20% of users lookingfor such recipes. Even as up to 7% of the American populationsuffers from some form of gluten allergy or intolerance and upto 20% opt for a gluten-free diet [14], less than 2.5% of recipes onFood.com are labeled as gluten-free. The prognosis for people withmultiple dietary restrictions is even more dire: if you have allergiesto both dairy and gluten, less than 0.5% of recipes on Food.com aresafe to cook. Users looking for recipes that fit their requirementsmay thus be discouraged from using popular recipe aggregators.

4 TASKTo help home cooks discover delicious and appropriate recipes,we desire a system for controllable recipe editing. A user should beable to specify the base recipe they would like to edit as well asthe constraint to satisfy. Our model should output a similar butnovel recipe that satisfies the provided constraint. In particular,to perform editing on the RecipePairs corpus, we consider a baserecipe R𝑏 = {𝑁𝑏 ; 𝐼𝑏 ; 𝑆𝑏 } comprising name 𝑁𝑏 = {𝑠0

𝑏, 𝑠1𝑏, . . . , 𝑠𝑛

𝑏|𝑠 ∈

W}, ingredients 𝐼𝑏 = {𝑖0𝑏, 𝑖1𝑏, . . . , 𝑖𝑘

𝑏} ∈ I, and directions (steps)

𝑆𝑏 = {𝑠0𝑏, 𝑠1𝑏, . . . , 𝑠𝑛

𝑏|𝑠 ∈ W}, whereW is our vocabulary of natural

language tokens. This recipe does not satisfy a dietary constraint 𝑐 ∈C (e.g. low-fat, dairy-free). The goal of our model is to transform thebase recipe into some target recipe R𝑡 = {𝑁𝑡 ; 𝐼𝑡 ; 𝑆𝑡 } that satisfies 𝑐 .

5 APPROACHWe pose controllable recipe editing as a supervised learningtask: predict a target recipe conditioned on a base recipe and aprovided constraint (e.g. a dietary category). As such a task has notbeen attempted to our knowledge, we explore three frameworksfor controllable recipe editing (Figure 2). First, we explore waysto adapt existing state-of-the-art (SOTA) recipe step generationmodels for our editing task. In Section 5.1 we train one multi-taskmodel to predict ingredient and step sequences in separate steps.In Section 5.2 we explore an end-to-end framework for generatinga modified set of ingredients and recipe directions in a single pass,leveraging pre-trained language models that have been shown tobe capable of generating coherent long documents [25]. We then in-vestigate a novel hierarchical approach to train independent modelsfor ingredient editing and step prediction (Section 5.3).

As ingredient quantities in a recipemay vary based on the desiredserving size and personal seasoning preferences, we present editedingredient lists without exact amounts listed. Indeed, in Section 7.2we show that when asked to cook recipes edited by our system,home cooks find it relatively easy to decide the amount of eachingredient to use. In Section 7.3 we train an additional model toassist users by predicting the ratio of any pair of ingredients in agenerated recipe (e.g. a 5:3 ratio of flour and water for bread).

To ensure that recipes produced by our systems are 100% safefor home cooks with allergies and dietary constraints, we furtherfilter edited ingredient lists and incorporate a blacklist modulethat prevents inappropriate ingredient references when using oursystem in the real world.We discuss the instrumentation and impactof our filtering and blacklisting modules in Section 6.1.

Why isn’t substitution enough? At first glance, it may seem ac-ceptable to simply replace problematic ingredients with fixed substi-tutes. To test this, we compiled 128 category-specific food substitu-tion rules from medical and dietary websites2 into a baseline model(Rule) that replaces all ingredients in a base recipe that violatethe constraint (e.g. for dairy-free, ‘butter’ → ‘margarine’). In therecipe instructions, we replace all references to removed ingredientswith the substitutes (e.g. ‘beat sugar and butter’ → ‘beat sugar andmargarine’). This baseline struggles to predict target ingredients

2e.g. https://www.mayoclinic.org/

Li, et al. 2021

(Table 2), and cannot account for subtle changes (e.g. cooking tech-niques) necessary to accommodate new ingredients. Indeed, whilenutritional impacts of ingredient substitutions are easily inferred,they have more intricate effects on recipe structure and palatability.This suggests that recipe editing is a challenging task that cannotbe solved by simple rule-based systems.

5.1 Multi-Task Model for EditingMany works in recipe generation focus on generating directionsgiven a known recipe name and ingredients [2, 17, 20], and cannotbe easily adapted to generate new ingredients in addition to writingrecipe steps. Lee et al. [19] trained a multi-task model (RecipeGPT)on Recipe1M+ [21] to extract ingredients from recipe directionsand write recipe directions given ingredients. As RecipeGPT takesas input a sequence of text tokens, we investigate whether a similarapproach can accommodate reasonable conditional recipe editing.Instead of ingredient extraction, we ask this model (MT) to bothpredict target ingredients from the category, base recipe name, andbase steps ([𝑐 ;𝑁𝑏 ; 𝑆𝑏 ]), as well as predict target steps from the cat-egory, base recipe name and target recipe ingredients ([𝑐;𝑁𝑏 ; 𝐼𝑡 ]).

Byte-pair encoded [29] input text sequences are processed bya 12-layer, 768-hidden-dimension Transformer where each layerapplies 12-headed attention with causal masking. The final hiddenstates are projected into our 50,257-element vocabulary space. Sinceour input format corresponds to the RecipeGPT pre-training format,we fine-tune an existing RecipeGPT checkpoint on our task, in thehope that we can more easily leverage the common-sense recipeknowledge gained through the pre-training process [23].

5.2 End-to-End Recipe EditingWe find that direct fine-tuning of RecipeGPT via MT is prone togenerating ingredient lists and references that violate target strictdietary constraints, which raises concerns about the safety of sucha model for real-world use. We thus look toward simplifying themodel training framework in the hopes of generating more appro-priate ingredients and steps. Much recent work in NLP has focusedon large Transformer [32] language models pre-trained on a widevariety of English documents [7, 24]. Such models have widely usedto generate long documents [25], and Keskar et al. [16] demonstratethat large language models are capable of some level of controllablegeneration via control codes and a short prompt at the beginning ofa sequence. We apply this principle by priming our model with a di-etary constraint control code and the base recipe name as a prompt[𝑐;𝑁𝑏 ] (e.g. ‘low_fat cheesecake’)—asking the language model togenerate a recipe similar to the base but satisfying the restriction.The model generates the edited recipe as a single sequence withthe predicted ingredients concatenated to the recipe steps.

We thus adapt RecipeGPT for end-to-end training (E2E), witha single training sequence comprising the concatenation of targetcategory, base recipe name, target ingredients, and target directionsof length 𝑛: 𝑥 = [𝑐;𝑁𝑏 ; 𝐼𝑡 ; 𝑆𝑡 ]. This allows us to treat editing asa conditional language modeling task to maximize the factorizedjoint probability of the full sequence: 𝑃 (𝑥) =

∏𝑛𝑖 𝑃 (𝑥𝑖 |𝑥<𝑖 ). We

find that this framework produces shorter but more diverse recipeswhen compared to MT, with 22% fewer recipes violating the target

dietary constraint. E2E is also twice as efficient as MT to train, onlyrequiring a single pass to predict ingredients and steps.

Prototype Editing Model. While our E2E model enjoys great flexi-bility—the user needs only provide a brief description by way of abase recipe name—we also implement a form of prototype editing inour end-to-end model, by treating the full base recipe as a prototypewhich needs to be edited to accommodate a constraint (E2E-P).While methods for sentence editing use a latent edit vector inaddition to a prototype sentence [11], we condition our edits on ahuman-readable attribute (e.g. ‘low_fat’). Here, we prime our modelwith the fully specified base recipe ([𝑐;𝑁𝑏 ; 𝐼𝑏 ; 𝑆𝑏 ]) in addition tothe category to predict ingredients and steps of an edited recipe. Ateach time step, our prototype editing model can thus attend overall tokens in the base recipe and make reasonable edits. This modeluses the same architecture and initialization as E2E, and trades offdiversity in exchange for producing more realistic recipes.

5.3 SHARE: a Hierarchical Edit ModelWhile the aforementioned approaches toward recipe editing relyon a single model to generate new ingredients and steps, this taskseems intuitively sequential: first predict ingredients, and then cre-ate a recipe from the known ingredients. Breaking recipe editinginto two parts also allows us to consider each task in a differentmodality, taking advantage of our limited (~2K) ingredient vocabu-lary I in comparison with the ~50K total vocabulary tokensW.

The first sub-task, ingredient prediction, is posed as a multi-labelbinary classification task, learning the probability that each ingredi-ent appears in the target recipe: 𝑃 (𝑖 ∈ 𝐼𝑡 |𝑅𝑏 , 𝑐)∀𝑖 ∈ I. We treat thesecond sub-task—step prediction—as generative language modeling,seeking to maximize the conditional likelihood of generating thetarget steps 𝑆𝑡 using a cross-entropy loss:

𝑃 (𝑆𝑡 |𝐼𝑡 ) =𝑚∏𝑘=0

𝑃 (𝑠𝑘𝑡 |𝑠𝑘−1𝑡 . . . 𝑠1𝑡 , 𝐼𝑡 ) .

Our ingredient classifier takes as input the base recipe name 𝑁𝑏 ,the sequence of base ingredient IDs 𝐼𝑏 , and category 𝑐 to predicta new set of ingredient IDs 𝐼𝑝 that satisfies the dietary constraint.We use a 6-layer, 128-dimensional Set Transformer [27], wherethe final hidden states are projected to an |I |-dimensional vectorrepresenting logits for each ingredient. While a standard Trans-former predicts these ingredients in an auto-regressive manner,we instead max-pool outputs across each time step into a singlelogit for each ingredient. When training the model, we minimizethe binary cross-entropy (BCE) between ground-truth ingredientsand our predictions (after the max-pool operation), ignoring theEOS terminal token. We use the EOS token instead to determinethe number of ingredients 𝑘 to predict: at training, we computean auxiliary BCE loss between EOS logits at each position and theground truth (1 at the 𝑘-th position and 0 elsewhere). At inferencetime, we return the top 𝑘 ingredients from our max-pooled outputs,where 𝑘 is the first position with a positive EOS prediction (usingBernoulli sampling with the EOS logit at each position).

Our hierarchical step prediction model (SHARE) then writes arecipe conditioned only on the predicted ingredient names. Here,we once again use a Transformer encoder-decoder architectureto generate the steps in a single sequence. As the step predictor

SHARE: a System for Hierarchical Assistive Recipe Editing

Table 2: Fidelity of edited ingredient lists when comparedto target recipe ingredients, in terms of overall IoU and F1score, as well as insertion and deletion F1 and precision (%).

Insertion Deletion

Model IoU F1 F1 Precision F1 Precision

Rule 22.1 33.9 1.2 1.9 18.1 41.3MT 31.6 45.8 25.6 28.9 69.5 82.4E2E 29.5 43.2 21.1 26.8 69.5 79.9E2E-P 30.6 44.7 26.2 29.5 73.2 81.0SHARE 33.0 47.5 26.7 35.2 66.1 83.2

in our hierarchical framework only takes ingredients as input, itis category-agnostic. We can thus train our hierarchical step pre-dictor on all 455K recipes from our dataset (excluding evaluationpairs). At training time, we compute cross-entropy loss over thepredicted recipe directions against the tokens of the target recipesteps, and minimize the log likelihood of the predicted sequencevia the factorized joint distribution:

𝑃 (𝑆𝑡 |𝐼 ′𝑝 ) =𝑛∏𝑗

𝑃 (𝑠 𝑗𝑡 |𝑠< 𝑗𝑡 , 𝐼 ′𝑝 )

Encoder-decoders can struggle with recipe coherence and fail toproperly reference and make use of input ingredients [17]. We aimto improve this ingredient-step coherence via an attention fusionlayer which attends over ingredient embeddings using the decoderfinal hidden state as a query. To further bias our model towardstronger ingredient conditioning, we apply Pointer Attention [28]to ingredients at each time-step by computing a weighted sum offusion attention weights and the decoder hidden state. We thenpass this sum through a sigmoid non-linearity to give 𝜌gen, theprobability of drawing the token from the output vocabulary distri-bution, with a corresponding 𝜌copy = 1−𝜌gen likelihood of copyinga token from the input ingredients.

SHARE is much smaller than the Transformer decoders used forE2E and MT: we use an 8-layer, 256-dimensional encoder-decoderwith a 64-dimensional factorized token embedding [18] for stepprediction. SHARE thus has a total of 8.3M learnable parameters,compared with 117M (~11x larger) for our E2E, E2E-P, and MT mod-els. SHARE is also unable to take advantage of weight initializationfrom pre-trained language models with a demonstrated ability togenerate long documents.

However, its smaller size confers some additional benefits: ittakes on average 3.9s to generate a recipe from E2E-P and MT,while it takes on average only 0.9s (4.3x faster) for SHARE. Thus,a hierarchical approach to controllable recipe editing accommodatesthe training of lightweight models for each sub-task that can serveusers with acceptable latency.

6 RESULTSWe split our paired recipe dataset into disjoint training, valida-tion, and test sets with 82,707, 1,096, and 1,011 pairs respectively.We train our models using the LAMB optimizer [36], fine-tuninglarge (E2E, E2E-P, MT) models for up to 10 epochs and training

Table 3: Percent of edited ingredients lists containing itemsviolating target hard dietary restrictions (before filtering).

Model Dairy-Free Gluten-Free Vegetarian Overall

Rule 15.07 14.47 17.07 16.10MT 20.55 32.89 10.24 17.23E2E 39.73 46.05 21.46 30.51E2E-P 10.96 34.21 4.88 12.43SHARE 4.11 9.21 0.00 2.82

the SHARE model for up to 100 epochs. We generate recipe stepsvia greedy sampling. In this section, we use automated metricsto compare the performance of each model: the substitution rulebaseline (Rule); methods to adapt SOTA recipe generation modelsas multi-task editors (MT), end-to-end generators (E2E), and end-to-end prototype editors (E2E-P); and our hierarchical recipe editorprimed with ingredients from our ingredient classifier (SHARE).We demonstrate through two human studies that our hierarchicalapproach produces the highest-quality recipes (Section 7.1) thatcan be easily followed to produce tasty food (Section 7.2).

To analyze task difficulty, we compare against a human-writtenbaseline: our substitution rule (Rule) model which replaces pro-hibited ingredients according to human-crafted rules. As noted inSection 5, this model is limited and cannot accommodate proceduralchanges needed to adapt a recipe to many substitutions. Amongour MT, E2E, E2E-P, and SHARE systems, we find that our hierar-chical approach—training separate models for ingredient and stepediting—not only generates the longest and most diverse recipesbut is best able to fit the distribution of existing categorical recipesand best satisfy target dietary restrictions with its edits.

We assess recipe fidelity in terms of similarity of predicted in-gredients and steps to a known ‘gold’ recipe. For ingredients, wecalculate the Jaccard similarity (IoU) and F1 scores [27], as well asprecision and recall of ingredient insertions and deletions. Whileingredient substitution is a challenging task, all of our approachessignificantly out-perform the Rule baseline (Table 2). We also findthat correctly adding ingredients is significantly more challengingthan removing inappropriate ingredients. We find that a separatelytrained ingredient classifier (SHARE) out-performs end-to-end andmulti-task approaches in ingredient substitution.

For step fidelity, we eschew n-gram based metrics as they cor-relate poorly with human judgement of structured and creativetexts [1]. Instead, we directly compare the workflow of two recipesvia Normalized Tree Edit distance (NTED) between structuredingredient-cooking-action tree representations [3]. These trees haveingredients as root nodes, with intermediate nodes drawn from acurated set of 259 lemmatised cooking action verbs. Here, we alsoobserve that our hierarchical model outperforms alternatives, regis-tering the lowest average NTED between generated recipes and thetarget. This reflects how our hierarchical step generator is trainedto generate recipes conditioned solely on input ingredients and sug-gests that it best learns how human cooks make use of a given set ofingredients. This suggests that while large language models—suchthose used in the E2E, E2E-P, and MT models—may learn common

Li, et al. 2021

Table 4: Automatic metrics for edited recipe directions.

Model Words NTED D-1 (%) D-2 (%)

Rule 114.63 0.623 3.47 24.48

MT 80.11 0.617 1.45 7.31+Filter 80.08 0.617 1.45 7.31+Blacklist 80.06 0.616 1.47 7.38

E2E 63.23 0.623 2.06 10.61+Filter 63.21 0.623 2.06 10.61+Blacklist 63.38 0.623 2.07 10.71

E2E-P 73.76 0.621 1.61 7.87+Filter 73.76 0.621 1.61 7.87+Blacklist 73.66 0.620 1.61 7.90

SHARE 82.15 0.611 2.05 13.04+Filter 81.87 0.611 1.99 13.01+Blacklist 82.98 0.614 2.01 13.20

cooking phrases, they are unable to assemble such instructions ina way similar to human cooks.

We note that the hierarchical ingredient-to-step generation for-mat does not require a paired corpus, allowing us to train theSHARE model on all 400K+ recipes found on Food.com. As a con-sequence, the hierarchical framework may learn that many dishescan be cooked from the same set of ingredients and thus generatesrecipes with significantly higher bigram diversity (percentage ofunique bigrams in generations: D-2) compared to end-to-end andmulti-task approaches to recipe editing. This reflects how therecan be many satisfactory ways of editing a recipe to fit a dietaryconstraint, and addresses a flaw we observed with existing recipeaggregators: a lack of diverse categorical recipes.

6.1 Satisfying dietary constraintsPrevious work in recipe generation has focused on solely writingrecipe steps; we aim to instead create complete recipes satisfying auser’s dietary constraints. As some human-written recipes satisfy-ing a soft constraint (e.g. low-sugar) nonetheless make judicioususage of prohibited ingredients (e.g. corn syrup), we focus our anal-ysis on hard/strict dietary constraints: dairy-free, gluten-free, andvegetarian recipes. Violating such constraints can often result insevere health risks (e.g. gluten and lactose allergies). We compiledlists of 100-300 ingredients that are unacceptable for each strictdietary constraint following the guidance of medical and dietarywebsites.3 We track category violations in terms of percentageof edited recipes that reference prohibited ingredients.

The Rule baseline violates the target strict dietary constraintin over 11% of cases, reflecting the impracticality of defining anexhaustive list of substitutions for the wide range of problematicingredients used by home cooks. Our hierarchical ingredient classi-fier makes significantly more appropriate edits to ingredient listswhen compared to alternatives (Table 3), with less than 3% of editedrecipes containing a prohibited ingredient in its ingredients list.However, we note that we pose the step generation problem as

3e.g. https://www.mayoclinic.org/, https://www.ncbi.nlm.nih.gov/books/NBK310258/

Table 5: Percent of edited recipe directions referencing in-gredients that violate target hard dietary restrictions.

Model Dairy-Free Gluten-Free Vegetarian Overall

Rule 20.55 3.95 11.71 11.86

MT 6.85 22.37 0.49 6.50+Filter 6.85 22.37 0.49 6.50+Blacklist 0.00 0.00 0.00 0.00

E2E 13.70 10.53 0.00 5.08+Filter 13.70 10.53 0.00 5.08+Blacklist 0.00 0.00 0.00 0.00

E2E-P 6.85 14.47 0.98 5.08+Filter 6.85 14.47 0.98 5.08+Blacklist 0.00 0.00 0.00 0.00

SHARE 6.85 14.47 0.00 4.52+Filter 4.11 11.84 0.00 3.39+Blacklist 0.00 0.00 0.00 0.00

a conditional language modeling task: our MT, E2E, E2E-P andSHARE approaches generate directions as unstructured text, whichmakes it possible for recipe steps to reference ingredients not listed.As such, we identify ingredient references in generated steps, andfind a 4.5-6.5% violation rate (Table 5).

Even if our hierarchical model can generate appropriate recipes>95% of the time, a 4.5% chance of generating a potential dangerousrecipe edit remains unacceptable for production use. We addressthis by 1) filtering out prohibited ingredients after the ingredientslist is edited; and 2) blacklisting prohibited ingredients during stepgeneration by setting their likelihoods to zero (Figure 1). We findthat while filtering has a relatively minor effect on ingredient ref-erences in the final generated steps, further adding the blacklistduring generation allows us to remove all problematic ingredientreferences when editing recipes for strict dietary constraints.

7 HUMAN STUDIES AND DISCUSSIONAggregate measures of generation fidelity and quality are goodcomparative measures between recipe editing approaches, but theycannot tell us whether a generated recipe is truly feasible. Textgeneration for blog posts or stories can be relatively forgiving—amoment of incoherence or repetition can be explained as artisticlicense or style. But procedural texts like recipes indicate a preciseseries of actions that must be performed with the ingredients athand to construct a dish. While there have been some attempts tojudge recipe quality (e.g. in terms of the order of generated stepsor number of explicit ingredient references) [2, 17], there currentlyexist no automatic metrics for whether a recipe can be successfullyfollowed. In Section 7.1 we conduct a crowd-sourced human reviewof recipes generated by each of our models, and in Section 7.2 weconduct a cooking study with seven cooks—attempting to followedited instructions and ingredient lists exactly—to assess real-worldviability of our system.

SHARE: a System for Hierarchical Assistive Recipe Editing

Table 6: Issues in edited recipes identified by crowd-sourcedhuman evaluators (%).

Issue Gold Rule MT E2E E2E-P SHARE

Constraint violated 10.7 15.2 14.3 24.1 8.9 8.0Unlisted ingredient 7.1 22.3 14.3 12.5 14.3 7.1Inconsistent naming 4.2 2.1 0.0 2.1 10.4 6.3Lacking detail (steps) 4.2 14.6 6.3 8.3 8.3 6.3Lacking detail (ingr.) 2.1 0.0 4.2 4.2 6.3 2.1

7.1 Human EvaluationHaving shown that recipe editing is complex and requires exper-tise, we recruited crowd-workers to evaluate edited recipes. Werandomly sampled eight recipes for each dietary constraint, andpresented the ground truth target recipe as well as versions editedby each of our models to judges (for a total of 336 recipes); eachrecipe was reviewed by two different judges to ensure quality. Eachjudge was asked whether a recipe violated the target dietary con-straint (including both hard and soft constraints) and was allowedto provide free-form written feedback about the recipe. Judges werefairly experienced home cooks who were familiar with our dietaryconstraints: on average, they reported cooking 6.3 meals per week,with 68.3% of them ‘sometimes’, ‘often’, or ‘always’ eating foodsatisfying the target dietary constraint.

We report strong inter-annotator agreement for constraint viola-tion, with a Kendall’s Tau [15] value of 0.762. We observe there to bea statistically significant difference in the constraint violation ratebetween models (𝑝 = 0.013) via a 𝜒2 test, with our end-to-end pro-totype editing and hierarchical editing models generating the mostappropriate outputs for any dietary constraint—hard or soft. Whilesome evaluators did mark constraint violations for strict dietaryconstraints, further inspection revealed these were due to com-mon misconceptions and in fact resulted in safe recipes (e.g. eggsand mayonnaise do not trigger dairy- and lactose-allergies). Softconstraint satisfaction remains subjective: different people mayfollow soft constraints in different ways—a low-carb diet may allowall types of carbs in moderation, or it may disallow specific typesof carbs (e.g. complex starches). We thus observe that even someground truth recipes are ruled to violate a target soft constraint.

77 (11%) judges additionally gave free-form written feedbackfor a recipe. We perform qualitative thematic coding [8] of suchresponses to identify common themes (Table 6). In 13% of recipedirections, judges noticed references to ingredients that were notmentioned in the ingredients list (‘Unlisted ingredient’). In groundtruth recipes, this typically consists of optional flavorings (e.g. salt,confectioner’s sugar) and garnishes (e.g. fruits, herbs), while ourlarge language-model-based models (MT, E2E, E2E-P) occasion-ally reference unlisted, substantive ingredients (vegetables andstarches). Recipes edited by our hierarchical approach were consis-tently deemed the highest quality in terms of constraint appropri-ateness and references to unlisted ingredients.

The remainder of feedback reflected how stylistic preferencesaffected perceptions of recipe quality. Several judges felt that it wasconfusing when the same ingredient was referred to by a synonymor more general phrase (‘Inconsistent naming’: e.g. ‘cheddar’ on the

Table 7: An example of controllable recipe editing. We askour SHAREmodel to create a low-calorie version of ChickenDijon (base recipe on left). Italics and strike through denoteadded and removed ingredients respectively.

Base (Chicken Dijon) SHARE (Low-Calorie)

Butter, Chicken, Chicken broth,Dijon mustard, Flour,Light cream, Onion, Pepper,Salt, Wine

Chicken, Dijon mustard, Flour,Pepper, Salt, Carrot, Garlic,Mushroom, Soy sauceButter, Chicken broth,Light cream, Onion, Wine

1) Sprinkle chicken breasts withonion powder, lemon pepper, salt.2) Saute the chicken in the butterfor 20 min, or cooked through.3) Remove chicken to a platterand keep warm.4) Measure pan juices and enoughchicken broth for 1 cup liquid.5) Return the stock mixture tothe pan and add the wine.6) Stir together the light creamand the flour till smooth.7) Add to the broth, cook and stirtill smooth, thickened and bubbly.8) Stir in the dijon mustard,stir till smooth.9) Return the chicken to the pan,or serve the sauce separately.

1) Preheat oven to 350 degrees f.2) Place chicken breasts, breastside up, in a shallow roasting pan.3) Combine soy sauce, mustard,garlic and salt.4) Brush over both sides of thechicken breasts.5) Bake for 40 minutes.6) Meanwhile, heat a small sauce-pan over medium heat.7) Stir in carrots, mushroomsand pepper.8) Cook and stir for 5 minutes.9) Stir in flour.10) Pour over chicken breasts, andbake for an additional 10 minutesuntil chicken is cooked through.11) Serve chicken with carrotsand sauce.

ingredients list referred to in directions as ‘cheese’, or references toboth ‘cilantro’ and ‘coriander’). For 5% of recipes, judges expresseda desire for more detail in either the instructions or ingredients list(‘Lacking detail’). In some cases, they desired clarity on whether aningredient was pre-cooked, pre-processed, or raw (e.g. was shrimppeeled or deveined; was chicken pre-cooked). In others, the feedbackreflected a desire to know which utensils to use to ‘fold in dryingredients’ or how small to ‘dice zucchini’—steps that appearedacross ground truth recipes as well as our edited outputs.

7.2 Cooking Study: Recipe Editing in PracticeUltimately, we can best judge the quality of a recipe’s quality byfollowing it in an attempt to cook the dish. We recruited sevenhome cooks with between 3 and 8 years of non-professional cook-ing experience, and provided each with 3 random recipes editedby our best-performing model (SHARE) for a total of 21 differentrecipes covering all seven dietary constraints. Each recipe wasnamed [Constraint] [Base recipe name] e.g. Low-CalorieChicken Dijon in Table 7. Of the cooks, two frequently cooked veg-etarian food and two were dairy-intolerant. Cooks were instructedto follow each recipe to the best of their ability without modifyingthe instructions or ingredients, and to record 1) how complete therecipe was (i.e. were there missing ingredients or steps); 2) howdifficult it was to infer ingredient quantities; 3) the overall difficultyof execution; and 4) whether the recipe was appropriate for its

Li, et al. 2021

Ingredients In Progress Final Dish

Gluten-Free Meat Pie

Dairy-Free Curry Shrimp

Low-Calorie Chicken Dijon

Figure 3: Cooked dishes following ourmodel-edited recipes.

target category. Several cooks also took pictures of the cookingprocess, with three such sets of photos displayed in Figure 3.

Out of the 21 recipes, 18 (86%) were well-received by tasters,with tasters asking for the recipe in four of the cases. The threeremaining recipes were baking recipes, with two dishes exhibit-ing structural/textural problems and one judged to be improperlyspiced. No cooks reported having difficulty following any recipes,although two cooks mentioned that they needed to read the recipesahead of time to track unlisted ingredients used in the instruc-tions. Once ingredients and quantities were inferred, recipes wereuniformly ‘easy to make’.

In the majority (90%) of cases, cooks reported using ‘commonsense’ or ‘previous experience cooking similar things’ to infer theamount of each ingredient to use. While baking recipes typicallyrequire more precise ingredient measurements and amounts thanother types of dishes, several cooks reported that baked goods like‘Low-Sugar Apple Cheesecake’ seemed fairly ‘forgiving’ and thusdeciding on ingredient amounts was easy. Of the two recipes thatposed issues in this regard, one cook found it ‘slightly difficult’ todetermine a tasty ratio of sugar, butter and chocolate; another wasunsure of how much water to use to ensure the correct consistencywhen simmering a tomato-based sauce.

All cooks agreed that the recipes received were appropriate fortheir dietary constraint. However, the most common (19%) com-plaint among cooks was that the final dish did not resemble thebase recipe name. In one case, the cook expected a ‘Gluten-FreeChocolate Crunch’ to resemble a puffed rice treat, but the recipeinstead omitted dry flour substitutes to produce more of a toffeedessert. A recipe for ‘Gluten-Free Banana Brownie’ produced some-thing resembling a thin sponge cake rather than a brownie. In lessegregious cases, a Bundt cake recipe was reported to result in awrongly-shaped cake and one cook reported that it was somewhatimpractical to make a One-Pot Spaghetti in a single pot. This sug-gests that there is room for our models to better learn specific tex-tures, shapes, and cooking techniques associated with different food

types—while our models edit recipes in constraint-respecting anddiverse ways, they may surprise the user in the process (e.g. editingcurries into stews, or cakes into cookies).

Ultimately, our cooking study demonstrates that SHARE is ableto successfully create recipes that not only satisfy a specified dietaryconstraint but are delicious and easily followed by a home cookwith a medium level of experience.

7.3 Ingredient QuantitiesWhile SHARE is able to predict which ingredients to use and howto make use of them, it does not output quantities for each ingre-dient. Indeed, previous work in the recipe generation space hasignored ingredient quantities and units [17, 21]. Based on feedbackfrom our cooking study, however, we consider methods to guidecooks when they are uncertain about ingredient quantities in arecipe. It is more useful to predict ingredient ratios compared toexact amounts—recipes are typically scale-invariant: double theingredients to double the servings. However, predicting the relativeratios of all ingredients in a recipe is impractical due to imprecise ormissing units in the ground truth (e.g. ‘several handfuls of almonds’,‘two bags of chips’). We thus focus on predicting the ratio of anytwo ingredients within a recipe. This remains a valuable task: ratiosbetween a small subset of a recipe’s ingredients can majorly impacta final dish’s consistency and texture [13].

We constructed a dataset by sampling randomly ordered pairsof ingredients from recipes in the respective RecipePairs splits,resulting in 1.3M training ingredient pairs and 5K evaluation pairs.We aim to predict the log-transformed (zero-centered) ratio betweenthe first and second ingredients in a pair. We evaluate two models:a linear regression model with categorical variables for the first andsecond ingredients and 100-dimensional TF-IDF features extractedfrom the recipe name; and encoding a concatenated sequence ofthe recipe name and ingredient names via a 2-layer, 64-dimensionalBi-directional GRU [5], followed by a linear layer to project thefinal hidden state into a single log-ratio (BiGRU).

While a constant predictor that always predicts a log-ratio of0 (corresponding to a ~1:1 ratio between any pair of ingredients)achieves a coefficient of determination (𝑅2) of 0.0 on the unseenevaluation set, our linear regression and BiGRU models are able tofit and predict unseen ingredient ratios well, with 0.7135 and 0.7565𝑅2 values, respectively.

We may thus use these models to assist cooks faced with a recipefor an unfamiliar dish. In future work, we hope to investigate waysto incorporate simultaneous ingredient- and ratio-prediction in anend-to-end model for recipe editing and generation.

7.4 Remaining ChallengesFinally we discuss failure modes of our framework that can helpguide future work to improve models for recipe editing. As our steppredictor depends solely on provided ingredients, it receives noguidance with respect to cooking techniques. Thus, with the sameingredients ourmodel can create casserole instead of pasta, or a stewin place of a braise. To better serve cooks with a dietary restrictionand a craving for a specific type of food, we hope to extend our workin the future by learning structured representations of recipes [2]and generating recipe instructions as procedural directed graphs

SHARE: a System for Hierarchical Assistive Recipe Editing

[22] instead of raw text. By treating procedural text as actionsapplied to ingredients, we hope to infer the output and intermediatephysical states of a dish as it is cooked, helping to guarantee thata generated recipe is feasible—which we currently must judge viahuman evaluation and physical experimentation.

While we investigate controllable recipe editing by conditioningon dietary restrictions, our approach can also be applied to otherrecipe attributes of interest, such as cuisines. We focus on Eng-lish language recipes from an American recipe website, but homecooking is an important daily task across the globe, with popularregional recipe aggregators including XiaChuFang and Cookpad.4Cooks from different cultural contexts often exhibit different pat-terns of preferences and requirements in their desired recipes [37],and in the future we hope to extend our work to controllable recipeediting using more subtle traits such as cuisines (e.g. making aChinese version of meatloaf).

8 CONCLUSIONIn this work, we propose the task of controllable recipe editingand present a System for Hierarchical Assistive Recipe Editing(SHARE) to tackle this task in the context of categorical dietaryconstraints. On a novel dataset of 84K pairs of recipes and categori-cal versions thereof, we demonstrate that this is a challenging taskthat cannot be solved with rule-based ingredient substitutions orstraightforward application of existing recipe generation models.We show that SHARE can edit recipes to produce coherent, diverseversions that accommodate the desired dietary restriction. Humanevaluation and cooking trials also reveal that our system producesfeasible recipes that can be followed by home cooks to create ap-pealing dishes. In the future, we seek to leverage recipe structureto generate flow graphs [22] and explore the editing of recipes toaccommodate more subtle and complex preferences like cuisines.

REFERENCES[1] Ashutosh Baheti, Alan Ritter, Jiwei Li, and Bill Dolan. 2018. Generating More

Interesting Responses in Neural Conversation Models with Distributional Con-straints. In EMNLP. https://doi.org/10.18653/v1/d18-1431

[2] Antoine Bosselut, Omer Levy, Ari Holtzman, Corin Ennis, Dieter Fox, and YejinChoi. 2018. Simulating Action Dynamics with Neural Process Networks. In ICLR.

[3] Minsuk Chang, Leonore V. Guillain, Hyeungshik Jung, Vivian M. Hare, Juho Kim,and Maneesh Agrawala. 2018. RecipeScape: An Interactive Tool for AnalyzingCooking Instructions at Scale. In CHI. https://doi.org/10.1145/3173574.3174025

[4] Meng Chen, Xiaoyi Jia, Elizabeth Gorbonos, Chnh T Hong, Xiaohui Yu, and YangLiu. 2019. Eating healthier: Exploring nutrition information for healthier reciperecommendation. Information Processing & Management (2019), 102051.

[5] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho, and Yoshua Bengio. 2014.Empirical Evaluation of Gated Recurrent Neural Networks on SequenceModeling.CoRR abs/1412.3555 (2014). arXiv:1412.3555 http://arxiv.org/abs/1412.3555

[6] Karen B. DeSalvo, Richard Olson, and Kellie O. Casavale. 2016. Dietary Guidelinesfor Americans. JAMA 315, 5 (02 2016), 457–458. https://doi.org/10.1001/jama.2015.18396

[7] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. InNAACL-HLT. https://doi.org/10.18653/v1/n19-1423

[8] Graham R Gibbs. 2007. Thematic coding and categorizing. Analyzing qualitativedata 703 (2007), 38–56.

[9] Jeanne P Goldberg, Lindsay A Tanskey, Elizabeth A Sanders, and Marianne SmithEdge. 2017. The IFIC Foundation Food & Health Survey 2015: 10-Year Trendsand Emerging Issues. JAND 117, 3 (2017), 355–357.

[10] Elizabeth Gorbonos, Yang Liu, and Chính T Hoàng. 2018. NutRec: NutritionOriented Online Recipe Recommender. In WI. IEEE.

[11] Kelvin Guu, Tatsunori B. Hashimoto, Yonatan Oren, and Percy Liang. 2018. Gen-erating Sentences by Editing Prototypes. TACL 6 (2018), 437–450.

4https://www.xiachufang.com/, https://cookpad.com/

[12] Tatsunori B. Hashimoto, Kelvin Guu, Yonatan Oren, and Percy Liang.2018. A Retrieve-and-Edit Framework for Predicting Structured Outputs. InNeurIPS. http://papers.nips.cc/paper/8209-a-retrieve-and-edit-framework-for-predicting-structured-outputs

[13] Lee Hyo-Gee, Chung Rak-won, and Sin Su-Jin. 2004. Sensory and mechanicalcharacteristics of Backhapbyung by different ratios of ingredients. Korean Journalof Food and Cookery Science 20 (2004), 480–488.

[14] Samuel O Igbinedion, Junaid Ansari, Anush Vasikaran, Felicity N Gavins, PaulJordan, Moheb Boktor, and Jonathan S Alexander. 2017. Non-celiac gluten sen-sitivity: All wheat attack is not celiac. World journal of gastroenterology 23, 40(2017), 7201.

[15] Maurice G Kendall. 1938. A new measure of rank correlation. Biometrika 30, 1/2(1938), 81–93.

[16] Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, andRichard Socher. 2019. CTRL: A Conditional Transformer Language Model forControllable Generation. CoRR abs/1909.05858 (2019). arXiv:1909.05858 http://arxiv.org/abs/1909.05858

[17] Chloé Kiddon, Luke Zettlemoyer, and Yejin Choi. 2016. Globally Coherent TextGeneration with Neural Checklist Models. In EMNLP. https://doi.org/10.18653/v1/d16-1032

[18] Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, PiyushSharma, and Radu Soricut. 2020. ALBERT: A Lite BERT for Self-supervisedLearning of Language Representations. In ICLR. https://openreview.net/forum?id=H1eA7AEtvS

[19] Helena H. Lee, Shu Ke, Palakorn Achananuparp, Philips Kokoh Prasetyo, Yue Liu,Ee-Peng Lim, and Lav R. Varshney. 2020. RecipeGPT: Generative Pre-trainingBased Cooking Recipe Generation and Evaluation System. In WWW. https://doi.org/10.1145/3366424.3383536

[20] Bodhisattwa Prasad Majumder, Shuyang Li, Jianmo Ni, and Julian J. McAuley.2019. Generating Personalized Recipes from Historical User Preferences. InEMNLP. https://doi.org/10.18653/v1/D19-1613

[21] Javier Marin, Aritro Biswas, Ferda Ofli, Nicholas Hynes, Amaia Salvador, YusufAytar, Ingmar Weber, and Antonio Torralba. 2019. Recipe1M+: A Dataset forLearning Cross-Modal Embeddings for Cooking Recipes and Food Images. TPAMI(2019).

[22] Shinsuke Mori, Hirokuni Maeta, Yoko Yamakata, and Tetsuro Sasada. 2014.Flow Graph Corpus from Recipe Texts. In LREC. http://www.lrec-conf.org/proceedings/lrec2014/summaries/763.html

[23] Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick S. H. Lewis, AntonBakhtin, Yuxiang Wu, and Alexander H. Miller. 2019. Language Models asKnowledge Bases?. In EMNLP. https://doi.org/10.18653/v1/D19-1250

[24] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Im-proving language understanding by generative pre-training. OpenAI (2018).

[25] Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and IlyaSutskever. 2019. Language Models are Unsupervised Multitask Learners. OpenAI(2019).

[26] Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang,Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2019. Exploring theLimits of Transfer Learning with a Unified Text-to-Text Transformer. CoRRabs/1910.10683 (2019). arXiv:1910.10683 http://arxiv.org/abs/1910.10683

[27] Amaia Salvador, Michal Drozdzal, Xavier Giró-i-Nieto, and Adriana Romero.2019. Inverse Cooking: Recipe Generation From Food Images. In CVPR. https://doi.org/10.1109/CVPR.2019.01070

[28] Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get To The Point:Summarization with Pointer-Generator Networks. In ACL. https://doi.org/10.18653/v1/P17-1099

[29] Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural MachineTranslation of Rare Words with Subword Units. In ACL. https://doi.org/10.18653/v1/p16-1162

[30] Richard Shepherd, Claire M Paisley, Paul Sparks, Annie S Anderson, Susan Eley,and Mike EJ Lean. 1996. Constraints on dietary choice: the role of income.Nutrition & Food Science (1996).

[31] ChunYuen Teng, Yu-Ru Lin, and Lada A. Adamic. 2012. Recipe recommendationusing ingredient networks. In WebSci. https://doi.org/10.1145/2380718.2380757

[32] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is Allyou Need. In NIPS.

[33] Sam Witteveen and Martin Andrews. 2019. Paraphrasing with Large LanguageModels. In NGT@EMNLP-IJCNLP. https://doi.org/10.18653/v1/D19-5623

[34] Richard Wrangham. 2009. Catching fire: how cooking made us human. BasicBooks.

[35] Ryosuke Yamanishi, Naoki Shino, Yoko Nishihara, Junichi Fukumoto, and AyaKaizaki. 2015. Alternative-ingredient Recommendation Based on Co-occurrenceRelation on Recipe Database. In KES. https://doi.org/10.1016/j.procs.2015.08.138

[36] Yang You, Jing Li, Sashank J. Reddi, Jonathan Hseu, Sanjiv Kumar, Srinadh Bho-janapalli, Xiaodan Song, James Demmel, Kurt Keutzer, and Cho-Jui Hsieh. 2020.Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. InICLR.

Li, et al. 2021

Table 8: Examples of free-form feedback received fromcrowd-sourced recipe evaluators and the qualitative codesassigned to each feedback.

Recipe Feedback Codes

Grandma’sMeatballs

Slight error on ingredientspecies "Pepper" and not"Bell Pepper", and doesn’tlist "Mashed Potatoes" and"Garlic bread sticks".

Inconsistent namingUnlisted ingredient

Chickenwith Rice

Does not tell you if rice iscooked ahead of time,interchanges curry andcurry powder, tomato islisted as ingredient butrecipe says crushed tomato.

Inconsistent namingLacking detail (ingr.)

BranMuffin

Steps are slightly basic,doesn’t mention mixing towhat consistency, whattype of flour, sugar, or oilbut should work.

Unlisted ingredientLacking detail (step)

ChickenandDumplings

The details are not clear,steps are missing. Lacking detail (step)

Thai Curry

No mention of spareribs inthe ingredient list. Doesn’tsuggest curry type, thereare non-green curries.

Unlisted ingredientLacking detail (ingr.)

[37] Qing Zhang, Christoph Trattner, Bernd Ludwig, and David Elsweiler. 2019. Un-derstanding Cross-Cultural Visual Food Tastes with Online Recipe Platforms. InICWSM. https://aaai.org/ojs/index.php/ICWSM/article/view/3270

A QUALITATIVE CODING OF HUMANEVALUATION

As noted in Section 7.1, each evaluator assigned to read and as-sess recipe quality was given an opportunity to provide free-formwritten feedback about the recipe. In Table 8 we present five exam-ples of such feedback that we received from our judges, and thecorresponding qualitative codes assigned to each piece of feedback.

B ADDITIONAL RECIPE EXAMPLESWe present several additional representative samples of recipesedited using the E2E-P,MT, and SHARE models.

In Table 9, our models were asked to produce a low-fat versionof Squash Casserole. All three recipes were able to identify high-fatitems from the base recipe (stuffing mix, cream of chicken soup,margarine, sour cream) and remove them. Here we see that human-written recipes can also incorporate ingredients not found in theoriginal ingredients list (e.g. cheese to top the casserole).

In Table 10, our models produced dairy-free versions of TofuTikka Masala, a dish of tofu in a creamy curry sauce. We observethat large language models like those used for E2E-P and MT canoften reference extraneous ingredients because theymatch commonphrases or patterns in general cooking language—the MT modeldoes not include yogurt in the ingredients list but still uses it inthe recipe directions. Interestingly, the hierarchical model eschewstofu in favor of shrimp, making a dairy-free creamy shrimp curry.

SHARE: a System for Hierarchical Assistive Recipe Editing

Table 9: An example of controllable recipe editing. We ask our E2E-P, MT, and SHARE models to create a low-fat version ofSquash Casserole (base recipe on left). Italics and strike through denote added and removed ingredients respectively.

Base (Squash Casserole) E2E-P (Low-Fat Version) MT (Low-Fat Version) SHARE (Low-Fat Version)

Cream of chicken soup, Egg,Herb stuffing mix, Margarine,Onion, Sour cream, Squash

Onion, Squash, Basil, Garlic,Mushroom, Parmesan, Pepper,Salt, Tomato, ZucchiniCream of chicken soup, Egg,Herb stuffing mix, Margarine,Sour cream

Egg, Onion, Squash,Breadcrumb, Butter, Cheddar,Milk, Pepper, SaltCream of chicken soup,Herb stuffing mix, Margarine,Sour cream

Egg, Onion, Squash,Breadcrumb, Cheddar, Pepper,Salt, ZucchiniCream of chicken soup,Herb stuffing mix, Margarine,Sour cream

1) Combine soup, sour creamand eggs.2) Fold into squash and onion.3) Melt butter and pour overstuffing mix.4) Line bottom of 9x13-inchpyrex bowl with 1/2 ofstuffing mix.5) Pour squash mix in bowl andtop with other 1/2 of stuffing.6) Bake at 350° for 30 to40 minutes.7) Turn oven off and sprinklewith grated american cheeseand leave in oven to melt.

1) Preheat oven to 350 degrees.2) Spray a 2 quart casseroledish with nonstick cooking spray.3) In a large bowl, combine allingredients except parmesan cheese.4) Pour into prepared casserole dish.5) Sprinkle with parmesan cheese.6) Bake for 30 minutes.

1) Preheat oven to 350 degrees.2) Cook squash in boiling wateruntil tender.3) Drain well.4) Mash squash.5) Add butter, milk, salt, pepper,onion, egg, and 1/2 cup of cheese.6) Mix well.7) Pour into a greased 2 quartcasserole dish.8) Combine bread crumbs,remaining 1/2 cup cheese, andmelted butter.9) Sprinkle over squash mixture.10) Bake for 30 minutes.

1) Preheat oven to 350 degrees.2) Combine all ingredientexcept for cheese.3) Pour mixture into greasedcasserole dish.4) Bake for 40 minutes or untillightly browned.5) Top with cheese, return tooven for a minute.

Table 10: An example of controllable recipe editing. We ask our E2E-P, MT, and SHARE models to create a dairy-free versionof Tofu Tikka Masala (base recipe on left). Italics and strike through denote added and removed ingredients respectively.

Base (Tofu Tikka Masala) E2E-P (Dairy-Free Version) MT (Dairy-Free Version) SHARE (Dairy-Free Version)

Chili powder, Coriander, Garlic,Ginger, Jalapeno Chile, Lemonjuice, Onion, Salt, Soymilk,Turmeric, Vegetable oil, Yogurt

Coriander, Garlic, Ginger, Onion,Salt, Turmeric, Vegetable oil, Bean,Cinnamon, Cumin, Pepper, Tomato,WaterChili powder, Jalapeno chile,Lemon juice, Soymilk, Yogurt

Chili powder, Coriander, Garlic,Ginger, Lemon juice, Onion, Salt,Turmeric, Vegetable oil, Cumin,Pepper, Tofu, WaterJalapeno chile, Soymilk, Yogurt

Coriander, Garlic, Ginger, Onion,Salt, Turmeric, Vegetable oil,Coconut milk, Cumin, Curry,Masala, Pea, Pepper, Rice, Shrimp,Tomato, WaterChili powder, Jalapeno chile,Lemon juice, Soymilk, Yogurt

1) Mix together lemon juice,ginger, garlic, green chili,half of the cilantro, chilipowder, salt, and tofu.2) Cover and leave to marinatefor an hour.3) Heat oil in a pan and cookonions for 3-5 minutes onmedium heat until translucent.4) Add turmeric powder,yogurt, soymilk and theremaining cilantro.5) Stir for a few minutes untilthe sauce thickens.6) Add the tofu mixture intothe sauce and cook for 5minutes on medium heat.7) Adjust seasoning to taste.8) Serve garnished withfresh cilantro with yourchoice of indian nan breador heated rice.

1) Heat oil in a large skillet overmedium heat.2) Add onion and cook untilsoftened, about 5 minutes.3) Add garlic and ginger andcook for 1 minute.4) Add tomatoes, water, cumin,coriander, turmeric, cinnamon,and cayenne and bring to a boil.5) Reduce heat and simmer for10 minutes.6) Add tofu and simmer for10 minutes.7) Season with salt and pepper.

1) To make the marinade,combine the lemon juice, water,oil, garlic, ginger, cumin,coriander, turmeric, salt, andpepper in a large bowl.2) Add the tofu and toss to coat.3) Cover and refrigerate for atleast 1 hour or up to overnight.4) For sauce, combine yogurt,cilantro, green onions and garlic.5) Cover and refrigerate.6) Preheat the oven to 350°f.7) Heat the oil in a large skilletover medium-high heat.8) Add the tofu and cook,stirring occasionally, untilbrowned, about 5 minutes.9) Transfer to a baking sheetand bake until the tofu iscooked through, 10 minutes.10) To serve, spoon the tofuonto plates and top with sauce.

1) Heat the oil in a large skillet.2) Stir-fry cumin until fragrant.3) Add the onions and saute untilthey start to turn translucent.4) Add in the turmeric, garammasala and cayenne and sauteuntil fragrant, about 1 minute.5) Stir in the curry powder.6) Stir in the coconut milk.7) Add the water and tomatoes,bring to a boil, then simmer.8) Cover and cook, stirringoccasionally, until the rice istender, about 30 minutes.9) Stir in the peas, salt, andpepper, then cover the pan.10) Simmer until the peas arejust tender, about 5 minutes.11) Stir in the shrimp andsimmer for 3 to 4 minutes.12) Stir in the chopped cilantro,and season with salt and pepper.