everybody-lies-by-seth-stephens-davidowitz.pdf - WordPress ...
-
Upload
khangminh22 -
Category
Documents
-
view
1 -
download
0
Transcript of everybody-lies-by-seth-stephens-davidowitz.pdf - WordPress ...
CONTENTS
CoverTitlePageDedicationForewordbyStevenPinkerIntroduction:TheOutlinesofaRevolution
PARTI : DATA,BIGANDSMALL
1.YourFaultyGut
PARTII : THEPOWERSOFBIGDATA
2.WasFreudRight?
3.DataReimaginedBodiesasDataWordsasDataPicturesasData
4.DigitalTruthSerumTheTruthAboutSexTheTruthAboutHateandPrejudiceTheTruthAbouttheInternetTheTruthAboutChildAbuseandAbortionTheTruthAboutYourFacebookFriendsTheTruthAboutYourCustomersCanWeHandletheTruth?
5.ZoomingIn
What’sReallyGoingOninOurCounties,Cities,andTowns?HowWeFillOurMinutesandHoursOurDoppelgangersDataStories
6.AlltheWorld’saLabTheABCsofA/BTestingNature’sCruel—butEnlightening—Experiments
PARTIII : BIGDATA:HANDLEWITHCARE
7.BigData,BigSchmata?WhatItCannotDoTheCurseofDimensionalityTheOveremphasisonWhatIsMeasurable
8.MoData,MoProblems?WhatWeShouldn’tDoTheDangerofEmpoweredCorporationsTheDangerofEmpoweredGovernments
Conclusion:HowManyPeopleFinishBooks?
AcknowledgmentsNotesIndexAbouttheAuthorCopyrightAboutthePublisher
FOREWORD
Eversincephilosophersspeculatedabouta“cerebroscope,”amythicaldevicethatwoulddisplayaperson’s thoughtsona screen, social scientistshavebeenlookingfortoolstoexposetheworkingsofhumannature.Duringmycareerasan experimental psychologist, different ones have gone in and out of fashion,and I’ve tried themall—ratingscales, reaction times,pupildilation, functionalneuroimaging,evenepilepsypatientswithimplantedelectrodeswhowerehappyto while away the hours in a language experiment while waiting to have aseizure.Yetnoneofthesemethodsprovidesanunobstructedviewintothemind.The
problemisasavagetradeoff.Humanthoughtsarecomplexpropositions;unlikeWoodyAllen speed-readingWarandPeace,wedon’t just think “Itwas aboutsomeRussians.”Butpropositionsinalltheirtangledmultidimensionalgloryaredifficult fora scientist toanalyze.Sure,whenpeoplepour theirheartsout,weapprehendtherichnessoftheirstreamofconsciousness,butmonologuesarenotanidealdatasetfortestinghypotheses.Ontheotherhand,ifweconcentrateonmeasures that are easily quantifiable, like people’s reaction time to words, ortheir skin response to pictures,we can do the statistics, butwe’ve pureed thecomplextextureofcognitionintoasinglenumber.Eventhemostsophisticatedneuroimagingmethodologies can tell us how a thought is splayed out in 3-Dspace,butnotwhatthethoughtconsistsof.As if the tradeoff between tractability and richness weren’t bad enough,
scientists of human nature are vexed by the Law of Small Numbers—AmosTverskyandDanielKahneman’snameforthefallacyofthinkingthatthetraitsofapopulationwillbereflectedinanysample,nomatterhowsmall.Eventhemost numerate scientists have woefully defective intuitions about how manysubjects one really needs in a study before one can abstract away from therandom quirks and bumps and generalize to all Americans, to say nothing of
Homosapiens. It’s all the iffierwhen the sample is gathered by convenience,suchasbyofferingbeermoneytothesophomoresinourcourses.This book is about awhole newway of studying themind.BigData from
internet searches and other online responses are not a cerebroscope, but SethStephens-Davidowitzshowsthattheyofferanunprecedentedpeekintopeople’spsyches.Attheprivacyoftheirkeyboards,peopleconfessthestrangestthings,sometimes (as indatingsitesorsearches forprofessionaladvice)because theyhave real-life consequences, at other times precisely because they don’t haveconsequences:peoplecanunburdenthemselvesofsomewishorfearwithoutareal person reacting in dismay or worse. Either way, the people are not justpressingabuttonorturningaknob,butkeyinginanyoftrillionsofsequencesofcharacters to spell out their thoughts in all their explosive, combinatorialvastness.Betterstill,theylaydownthesedigitaltracesinaformthatiseasytoaggregateandanalyze.Theycomefromallwalksoflife.Theycantakepartinunobtrusive experiments which vary the stimuli and tabulate the responses inrealtime.Andtheyhappilysupplythesedataingargantuannumbers.Everybody Lies is more than a proof of concept. Time and again my
preconceptionsaboutmycountryandmyspecieswere turnedupside-downbyStephens-Davidowitz’s discoveries. Where did Donald Trump’s unexpectedsupportcomefrom?WhenAnnLandersaskedherreadersin1976whethertheyregrettedhavingchildrenandwasshockedtofind thatamajoritydid,wasshemisledbyanunrepresentative,self-selectedsample?Istheinternettoblameforthat redundantly named crisis of the late 2010s, the “filter bubble”? Whattriggershatecrimes?Dopeopleseekjokestocheerthemselvesup?AndthoughI like to think that nothing can shockme, Iwas shocked aplenty bywhat theinternet reveals about human sexuality—including the discovery that everymonth a certain number ofwomen search for “humping stuffed animals.”Noexperiment using reaction time or pupil dilation or functional neuroimagingcouldeverhaveturnedupthatfact.Everybody will enjoy Everybody Lies. With unflagging curiosity and an
endearingwit, Stephens-Davidowitz points to a newpath for social science inthe twenty-first century. With this endlessly fascinating window into humanobsessions,whoneedsacerebroscope?
—StevenPinker,2017
INTRODUCTION
THEOUTLINESOFAREVOLUTION
Surelyhewouldlose,theysaid.In the 2016 Republican primaries, polling experts concluded that Donald
Trumpdidn’tstandachance.Afterall,Trumphadinsultedavarietyofminoritygroups.ThepollsandtheirinterpreterstoldusfewAmericansapprovedofsuchoutrages.MostpollingexpertsatthetimethoughtthatTrumpwouldloseinthegeneral
election.Toomanylikelyvoterssaidtheywereputoffbyhismannerandviews.But therewere actually some clues thatTrumpmight actuallywin both the
primariesandthegeneralelection—ontheinternet.
Iamaninternetdataexpert.Everyday,Itrackthedigitaltrailsthatpeopleleaveastheymaketheirwayacrosstheweb.Fromthebuttonsorkeysweclickortap,I try to understandwhatwe reallywant,whatwewill really do, andwhowereallyare.LetmeexplainhowIgotstartedonthisunusualpath.Thestorybegins—and this seems likeagesago—with the2008presidential
electionandalong-debatedquestioninsocialscience:HowsignificantisracialprejudiceinAmerica?Barack Obama was running as the first African-American presidential
nomineeofamajorparty.Hewon—rathereasily.AndthepollssuggestedthatracewasnotafactorinhowAmericansvoted.Gallup,forexample,conducted
numerous polls before and after Obama’s first election. Their conclusion?AmericanvoterslargelydidnotcarethatBarackObamawasblack.Shortlyaftertheelection,twowell-knownprofessorsattheUniversityofCalifornia,Berkeleypored through other survey-based data, using more sophisticated data-miningtechniques.Theyreachedasimilarconclusion.Andso,duringObama’spresidency,thisbecametheconventionalwisdomin
manypartsofthemediaandinlargeswathsoftheacademy.Thesourcesthatthemedia and social scientists have used for eighty-plus years to understand theworld told us that the overwhelmingmajority of Americans did not care thatObamawasblackwhenjudgingwhetherheshouldbetheirpresident.This country, long soiled by slavery and JimCrow laws, seemed finally to
havestoppedjudgingpeoplebythecoloroftheirskin.ThisseemedtosuggestthatracismwasonitslastlegsinAmerica.Infact,somepunditsevendeclaredthatwelivedinapost-racialsociety.In2012,Iwasagraduatestudent ineconomics, lost in life,burnt-out inmy
field,andconfident,evencocky,thatIhadaprettygoodunderstandingofhowtheworldworked, ofwhat people thought and cared about in the twenty-firstcentury.Andwhenitcametothisissueofprejudice,Iallowedmyselftobelieve,basedoneverythingIhadreadinpsychologyandpoliticalscience,thatexplicitracismwas limited to a small percentageofAmericans—themajorityof themconservativeRepublicans,mostofthemlivinginthedeepSouth.Then,IfoundGoogleTrends.GoogleTrends,atoolthatwasreleasedwithlittlefanfarein2009,tellsusers
how frequently anywordorphrasehasbeen searched indifferent locations atdifferent times. It was advertised as a fun tool—perhaps enabling friends todiscusswhichcelebritywasmostpopularorwhatfashionwassuddenlyhot.Theearliestversionsincludedaplayfuladmonishmentthatpeople“wouldn’twanttowriteyourPhDdissertation”withthedata,whichimmediatelymotivatedmetowritemydissertationwithit.*
At the time, Google search data didn’t seem to be a proper source ofinformationfor“serious”academicresearch.Unlikesurveys,Googlesearchdatawasn’t createdas away tohelpusunderstand thehumanpsyche.Googlewasinvented so that people could learn about theworld, not so researchers couldlearnaboutpeople.Butitturnsoutthetrailsweleaveasweseekknowledgeon
theinternetaretremendouslyrevealing.Inotherwords,people’ssearchforinformationis,initself,information.When
andwheretheysearchforfacts,quotes,jokes,places,persons,things,orhelp,itturnsout,cantellusalotmoreaboutwhattheyreallythink,reallydesire,reallyfear,andreallydothananyonemighthaveguessed.Thisisespeciallytruesincepeoplesometimesdon’tsomuchqueryGoogleasconfideinit:“Ihatemyboss.”“Iamdrunk.”“Mydadhitme.”Theeverydayactoftypingawordorphraseintoacompact,rectangularwhite
box leaves a small traceof truth that,whenmultipliedbymillions, eventuallyrevealsprofoundrealities.ThefirstwordItypedinGoogleTrendswas“God.”Ilearned that the states thatmake themostGoogle searchesmentioning “God”wereAlabama,Mississippi,andArkansas—theBibleBelt.And thosesearchesare most frequently on Sundays. None of which was surprising, but it wasintriguing that search data could reveal such a clear pattern. I tried “Knicks,”whichitturnsoutisGoogledmostinNewYorkCity.Anotherno-brainer.ThenItyped inmy name. “We’re sorry,”GoogleTrends informedme. “There is notenough search volume” to show these results. Google Trends, I learned, willprovidedataonlywhenlotsofpeoplemakethesamesearch.But the power of Google searches is not that they can tell us that God is
populardownSouth,theKnicksarepopularinNewYorkCity,orthatI’mnotpopularanywhere.Anysurveycouldtellyouthat.ThepowerinGoogledataisthatpeopletellthegiantsearchenginethingstheymightnottellanyoneelse.Take,forexample,sex(asubjectIwillinvestigateinmuchgreaterdetaillater
inthisbook).Surveyscannotbetrustedtotellusthetruthaboutoursexlives.Ianalyzeddata from theGeneralSocialSurvey,which is consideredoneof themost influential and authoritative sources for information on Americans’behaviors.Accordingtothatsurvey,whenitcomestoheterosexualsex,womensay they have sex, on average, fifty-five times per year, using a condom 16percentofthetime.Thisaddsuptoabout1.1billioncondomsusedperyear.Butheterosexualmensaytheyuse1.6billioncondomseveryyear.Thosenumbers,by definition,would have to be the same. Sowho is telling the truth,men orwomen?Neither, it turns out. According to Nielsen, the global information and
measurement company that tracks consumer behavior, fewer than 600million
condomsaresoldeveryyear.Soeveryoneislying;theonlydifferenceisbyhowmuch.Thelyingis infactwidespread.Menwhohaveneverbeenmarriedclaimto
useonaveragetwenty-ninecondomsperyear.Thiswouldadduptomorethanthe total number of condoms sold in the United States to married and singlepeoplecombined.Marriedpeopleprobablyexaggeratehowmuchsextheyhave,too.Onaverage,marriedmenundersixty-fivetellsurveystheyhavesexonceaweek. Only 1 percent say they have gone the past year without sex.Marriedwomenreporthavingalittlelesssexbutnotmuchless.Google searches give a far less lively—and, I argue, far more accurate—
pictureofsexduringmarriage.OnGoogle,thetopcomplaintaboutamarriageisnothavingsex.Searchesfor“sexlessmarriage”arethreeandahalftimesmorecommonthan“unhappymarriage”andeighttimesmorecommonthan“lovelessmarriage.” Even unmarried couples complain somewhat frequently about nothaving sex. Google searches for “sexless relationship” are second only tosearches for “abusive relationship.” (This data, I should emphasize, is allpresented anonymously. Google, of course, does not report data about anyparticularindividual’ssearches.)And Google searches presented a picture of America that was strikingly
different from that post-racial utopia sketched out by the surveys. I rememberwhenI first typed“nigger” intoGoogleTrends.Callmenaïve.Butgivenhowtoxic theword is, I fully expected this tobe a low-volume search.Boy,was Iwrong. In theUnitedStates, theword “nigger”—or its plural, “niggers”—wasincluded in roughly the same number of searches as the word “migraine(s),”“economist,”and“Lakers.”Iwonderedifsearchesforraplyricswereskewingtheresults?Nope.Thewordusedinrapsongsisalmostalways“nigga(s).”SowhatwasthemotivationofAmericanssearchingfor“nigger”?Frequently,theywere looking for jokes mocking African-Americans. In fact, 20 percent ofsearcheswiththeword“nigger”alsoincludedtheword“jokes.”Othercommonsearchesincluded“stupidniggers”and“Ihateniggers.”There were millions of these searches every year. A large number of
Americanswere, in the privacyof their ownhomes,making shockingly racistinquiries.ThemoreIresearched,themoredisturbingtheinformationgot.OnObama’s first election night,whenmost of the commentary focused on
praise of Obama and acknowledgment of the historic nature of his election,roughlyoneineveryhundredGooglesearchesthatincludedtheword“Obama”alsoincluded“kkk”or“nigger(s).”Maybethatdoesn’tsoundsohigh,butthinkof the thousands of nonracist reasons to Google this young outsider with acharmingfamilyabouttotakeovertheworld’smostpowerfuljob.Onelectionnight, searches and sign-ups for Stormfront, a white nationalist site withsurprisingly high popularity in the United States, were more than ten timeshigher than normal. In some states, there were more searches for “niggerpresident”than“firstblackpresident.”Therewasadarknessandhatredthatwashiddenfromthetraditionalsources
butwasquiteapparentinthesearchesthatpeoplemade.Thosesearchesarehardtoreconcilewithasocietyinwhichracismisasmall
factor.In2012IknewofDonaldJ.Trumpmostlyasabusinessmanandrealityshowperformer.Ihadnomoreideathananyoneelsethathewould,fouryearslater,beaseriouspresidentialcandidate.Butthoseuglysearchesarenothardtoreconcilewiththesuccessofacandidatewho—inhisattacksonimmigrants,inhisangersandresentments—oftenplayedtopeople’sworstinclinations.
The Google searches also told us that much of what we thought about thelocationofracismwaswrong.Surveysandconventionalwisdomplacedmodernracism predominantly in the South and mostly among Republicans. But theplaceswith thehighest racist search rates includedupstateNewYork,westernPennsylvania, eastern Ohio, industrialMichigan and rural Illinois, along withWest Virginia, southern Louisiana, and Mississippi. The true divide, Googlesearchdatasuggested,wasnotSouthversusNorth;itwasEastversusWest.Youdon’t get this sort of thingmuchwest of theMississippi.And racismwasnotlimited toRepublicans. In fact, racistsearcheswerenohigher inplaceswithahigh percentage of Republicans than in places with a high percentage ofDemocrats.Googlesearches,inotherwords,helpeddrawanewmapofracismin theUnited States—and thismap looked very different fromwhat youmayhaveguessed.RepublicansintheSouthmaybemorelikelytoadmittoracism.ButplentyofDemocratsintheNorthhavesimilarattitudes.Four years later, this map would prove quite significant in explaining the
politicalsuccessofTrump.
In 2012, I was using this map of racism I had developed using Googlesearches toreevaluateexactly therole thatObama’sraceplayed.Thedatawasclear.Inpartsofthecountrywithahighnumberofracistsearches,Obamadidsubstantially worse than John Kerry, the white Democratic presidentialcandidate, had four years earlier. The relationship was not explained by anyother factor about these areas, including education levels, age, churchattendance,orgunownership.RacistsearchesdidnotpredictpoorperformanceforanyotherDemocraticcandidate.OnlyforObama.Andtheresultsimpliedalargeeffect.Obamalostroughly4percentagepoints
nationwidejustfromexplicitracism.Thiswasfarhigherthanmighthavebeenexpected based on any surveys. Barack Obama, of course, was elected andreelected president, helped by some very favorable conditions for Democrats,but he had to overcome quite a bit more than anyone who was relying ontraditionaldatasources—andthatwasjustabouteveryone—hadrealized.TherewereenoughraciststohelpwinaprimaryortipageneralelectioninayearnotsofavorabletoDemocrats.Mystudywas initially rejectedbyfiveacademic journals.Manyof thepeer
reviewers,ifyouwillforgivealittledisgruntlement,saidthatitwasimpossibleto believe that somanyAmericans harbored such vicious racism.This simplydidnotfitwhatpeoplehadbeensaying.Besides,Googlesearchesseemedlikesuchabizarredataset.NowthatwehavewitnessedtheinaugurationofPresidentDonaldJ.Trump,
myfindingseemsmoreplausible.
The more I have studied, the more I have learned that Google has lots ofinformation that ismissed by the polls that can be helpful in understanding—amongmany,manyothersubjects—anelection.Thereisinformationonwhowillactuallyturnouttovote.Morethanhalfof
citizens who don’t vote tell surveys immediately before an election that theyintendto,skewingourestimationofturnout,whereasGooglesearchesfor“howto vote” or “where to vote” weeks before an election can accurately predictwhichpartsofthecountryaregoingtohaveabigshowingatthepolls.Theremight even be information onwho theywill vote for. Canwe really
predict which candidate people will vote for just based on what they search?
Clearly,wecan’t juststudywhichcandidatesaresearchedformostfrequently.Manypeoplesearchforacandidatebecausetheylovehim.Asimilarnumberofpeoplesearchforacandidatebecausetheyhatehim.Thatsaid,StuartGabriel,aprofessor of finance at the University of California, Los Angeles, and I havefound a surprising clue aboutwhichwaypeople are planning to vote.A largepercentage of election-related searches contain queries with both candidates’names. During the 2016 election between Trump and Hillary Clinton, somepeople searched for “TrumpClinton polls.”Others looked for highlights fromthe“ClintonTrumpdebate.”Infact,12percentofsearchquerieswith“Trump”alsoincludedtheword“Clinton.”Morethanone-quarterofsearchquerieswith“Clinton”alsoincludedtheword“Trump.”We have found that these seemingly neutral searches may actually give us
somecluestowhichcandidateapersonsupports.How?Theorderinwhichthecandidatesappear.Ourresearchsuggeststhata
person is significantlymore likely to put the candidate they support first in asearchthatincludesbothcandidates’names.In the previous three elections, the candidate who appeared first in more
searchesreceivedthemostvotes.Moreinteresting,theorderthecandidatesweresearchedwaspredictiveofwhichwayaparticularstatewouldgo.Theorderinwhichcandidatesaresearchedalsoseemstocontaininformation
that the polls canmiss. In the 2012 election betweenObama and RepublicanMitt Romney, Nate Silver, the virtuoso statistician and journalist, accuratelypredictedtheresultinallfiftystates.However,wefoundthatinstatesthatlistedRomneybeforeObamainsearchesmostfrequently,Romneyactuallydidbetterthan Silver had predicted. In states that most frequently listed Obama beforeRomney,ObamadidbetterthanSilverhadpredicted.This indicator could contain information that polls miss because voters are
either lying to themselves or uncomfortable revealing their true preferences topollsters. Perhaps if they claimed that theywere undecided in 2012, butwereconsistently searching for “Romney Obama polls,” “Romney Obama debate,”and “Romney Obama election,” they were planning to vote for Romney allalong.SodidGooglepredictTrump?Well,westillhavealotofworktodo—andI’ll
have to be joined by lotsmore researchers—beforewe knowhowbest to use
Googledatatopredictelectionresults.Thisisanewscience,andweonlyhaveafewelectionsforwhichthisdataexists.Iamcertainlynotsayingweareatthepoint—oreverwillbeatthepoint—wherewecanthrowoutpublicopinionpollscompletelyasatoolforhelpinguspredictelections.Butthereweredefinitelyportents,atmanypoints,ontheinternetthatTrump
mightdobetterthanthepollswerepredicting.During the general election, therewere clues that the electoratemight be a
favorable one for Trump. Black Americans told polls they would turn out inlargenumberstoopposeTrump.ButGooglesearchesforinformationonvotinginheavilyblackareaswerewaydown.Onelectionday,Clintonwouldbehurtbylowblackturnout.TherewereevensignsthatsupposedlyundecidedvotersweregoingTrump’s
way. Gabriel and I found that thereweremore searches for “TrumpClinton”than“ClintonTrump”inkeystatesintheMidwestthatClintonwasexpectedtowin. Indeed,Trumpowedhiselection to the fact thathesharplyoutperformedhispollsthere.But the major clue, I would argue, that Trump might prove a successful
candidate—in theprimaries, tobeginwith—wasall that secret racism thatmyObama study had uncovered. The Google searches revealed a darkness andhatredamongameaningfulnumberofAmericansthatpundits,formanyyears,missed.Searchdata revealed thatwe lived inaverydifferentsociety fromtheone academics and journalists, relying on polls, thought that we lived in. Itrevealedanasty,scary,andwidespreadragethatwaswaitingforacandidatetogivevoicetoit.People frequently lie—to themselvesand toothers. In2008,Americans told
surveys that theyno longercaredabout race.Eightyears later, theyelectedaspresidentDonaldJ.Trump,amanwhoretweetedafalseclaimthatblackpeopleare responsible for themajority ofmurders ofwhiteAmericans, defended hissupportersforroughingupaBlackLivesMattersprotesteratoneofhisrallies,andhesitatedinrepudiatingsupportfromaformerleaderoftheKuKluxKlan.ThesamehiddenracismthathurtBarackObamahelpedDonaldTrump.Earlyintheprimaries,NateSilverfamouslyclaimedthattherewasvirtually
no chance that Trumpwouldwin.As the primaries progressed and it becameincreasinglyclearthatTrumphadwidespreadsupport,Silverdecidedtolookat
the data to see if he could understandwhatwas going on.How couldTrumppossiblybedoingsowell?Silver noticed that the areaswhereTrump performed bestmade for an odd
map.TrumpperformedwellinpartsoftheNortheastandindustrialMidwest,aswell as the South. He performed notably worse out West. Silver looked forvariablestotrytoexplainthismap.Wasitunemployment?Wasitreligion?Wasitgunownership?Wasitratesofimmigration?WasitoppositiontoObama?Silver found that the single factor that best correlatedwithDonaldTrump’s
support in the Republican primaries was that measure I had discovered fouryearsearlier.AreasthatsupportedTrumpinthelargestnumberswerethosethatmadethemostGooglesearchesfor“nigger.”
IhavespentjustabouteverydayofthepastfouryearsanalyzingGoogledata.ThisincludedastintasadatascientistatGoogle,whichhiredmeafterlearningabout my racism research. And I continue to explore this data as an opinionwriter and data journalist for theNew York Times. The revelations have keptcoming. Mental illness; human sexuality; child abuse; abortion; advertising;religion;health.Notexactlysmall topics,andthisdataset,whichdidn’texistacouple of decades ago, offered surprising new perspectives on all of them.Economists and other social scientists are always hunting for new sources ofdata,soletmebeblunt:IamnowconvincedthatGooglesearchesarethemostimportantdatasetevercollectedonthehumanpsyche.This dataset, however, is not the only tool the internet has delivered for
understanding ourworld. I soon realized there are other digital goldmines aswell. I downloaded all of Wikipedia, pored through Facebook profiles, andscrapedStormfront.Inaddition,PornHub,oneofthelargestpornographicsiteson the internet, gaveme its completedataon the searches andvideoviewsofanonymouspeoplearound theworld. Inotherwords, Ihave takenaverydeepdive intowhat is now calledBigData. Further, I have interviewed dozens ofothers—academics,datajournalists,andentrepreneurs—whoarealsoexploringthesenewrealms.Manyoftheirstudieswillbediscussedhere.Butfirst,aconfession:IamnotgoingtogiveaprecisedefinitionofwhatBig
Data is.Why?Because it’s an inherently vague concept.Howbig is big?Are18,462observationsSmallDataand18,463observationsBigData? Iprefer totakeaninclusiveviewofwhatqualifies:whilemostofthedataIfiddlewithisfrom the internet, I will discuss other sources, too.We are living through anexplosionintheamountandqualityofallkindsofavailableinformation.Muchof the new information flows fromGoogle and socialmedia. Some of it is aproduct of digitization of information that was previously hidden away incabinets and files. Some of it is from increased resources devoted to marketresearch.Someofthestudiesdiscussedinthisbookdon’tusehugedatasetsatallbut instead just employ anewand creative approach todata—approaches thatarecrucialinaneraoverflowingwithinformation.SowhyexactlyisBigDatasopowerful?Thinkofalltheinformationthatis
scatteredonlineonagivenday—wehaveanumber,infact,forjusthowmuchinformation there is. On an average day in the early part of the twenty-first
century,humanbeingsgenerate2.5milliontrillionbytesofdata.Andthesebytesareclues.A woman is bored on a Thursday afternoon. She Googles for some more
“funnycleanjokes.”Shechecksheremail.ShesignsontoTwitter.SheGoogles“niggerjokes.”A man is feeling blue. He Googles for “depression symptoms” and
“depressionstories.”Heplaysagameofsolitaire.AwomanseestheannouncementofherfriendgettingengagedonFacebook.
Thewoman,whoissingle,blocksthefriend.AmantakesabreakfromGooglingabouttheNFLandrapmusictoaskthe
searchengineaquestion:“Isitnormaltohavedreamsaboutkissingmen?”AwomanclicksonaBuzzFeedstoryshowingthe“15cutestcats.”Amanseesthesamestoryaboutcats.Butonhisscreenitiscalled“15most
adorablecats.”Hedoesn’tclick.AwomanGoogles“Ismysonagenius?”AmanGoogles“howtogetmydaughtertoloseweight.”Awomanisonavacationwithhersixbestfemalefriends.Allherfriendskeep
sayinghowmuchfunthey’rehaving.ShesneaksofftoGoogle“lonelinesswhenawayfromhusband.”Aman,thepreviouswoman’shusband,isonavacationwithhissixbestmale
friends.HesneaksofftoGoogletotype“signsyourwifeischeating.”Some of this data will include information that would otherwise never be
admittedtoanybody.Ifweaggregateitall,keepitanonymoustomakesureweneverknowabout the fears,desires,andbehaviorsofanyspecific individuals,andaddsomedatascience,westart togetanew lookathumanbeings—theirbehaviors,theirdesires,theirnatures.Infact,attheriskofsoundinggrandiose,Ihavecometobelievethatthenewdataincreasinglyavailableinourdigitalagewillradicallyexpandourunderstandingofhumankind.Themicroscopeshowedus there ismore toadropofpondwater thanwe thinkwe see.The telescopeshowedusthereismoretothenightskythanwethinkwesee.Andnew,digitaldatanowshowsusthereismoretohumansocietythanwethinkwesee.Itmaybe our era’s microscope or telescope—making possible important, evenrevolutionaryinsights.There is another risk in making such declarations—not just sounding
grandiosebutalsotrendy.ManypeoplehavebeenmakingbigclaimsaboutthepowerofBigData.Buttheyhavebeenshortonevidence.ThishasinspiredBigDataskeptics,ofwhomtherearealsomany,todismiss
thesearchforbiggerdatasets.“IamnotsayingherethatthereisnoinformationinBigData,”essayistandstatisticianNassimTalebhaswritten.“Thereisplentyofinformation.Theproblem—thecentralissue—isthattheneedlecomesinanincreasinglylargerhaystack.”Oneoftheprimarygoalsofthisbook,then,istoprovidethemissingevidence
ofwhatcanbedonewithBigData—howwecanfindtheneedles,ifyouwill,inthose larger and larger haystacks. I hope to provide enough examples of BigDataofferingnewinsightsintohumanpsychologyandbehaviorsothatyouwillbegintoseetheoutlinesofsomethingtrulyrevolutionary.“Holdon,Seth,”youmightbesaying rightaboutnow.“You’repromisinga
revolution.You’rewaxingpoeticaboutthesebig,newdatasets.Butthusfar,youhaveusedallofthisamazing,remarkable,breathtaking,groundbreakingdatatotellmebasicallytwothings:thereareplentyofracistsinAmerica,andpeople,particularlymen,exaggeratehowmuchsextheyhave.”I admit sometimes thenewdatadoes just confirm theobvious. Ifyou think
thesefindingswereobvious,waituntilyougettoChapter4,whereIshowyouclear,unimpeachableevidencefromGooglesearchesthatmenhavetremendousconcernandinsecurityaround—waitforit—theirpenissize.Thereis,Iwouldclaim,somevalueinprovingthingsyoumayhavealready
suspected but had otherwise little evidence for. Suspecting something is onething. Proving it is another. But if all Big Data could do is confirm yoursuspicions,itwouldnotberevolutionary.Thankfully,BigDatacandoalotmorethan that. Time and again, data shows me the world works in precisely theoppositewayasIwouldhaveguessed.Herearesomeexamplesyoumightfindmoresurprising.You might think that a major cause of racism is economic insecurity and
vulnerability.Youmightnaturallysuspect,then,thatwhenpeoplelosetheirjobs,racism increases. But, actually, neither racist searches nor membership inStormfrontriseswhenunemploymentdoes.Youmightthinkthatanxietyishighestinovereducatedbigcities.Theurban
neuroticisafamousstereotype.ButGooglesearchesreflectinganxiety—suchas
“anxietysymptoms”or“anxietyhelp”—tend tobehigher inplaceswith lowerlevels of education, lowermedian incomes, andwhere a larger portion of thepopulationlivesinruralareas.Therearehighersearchratesforanxietyinrural,upstateNewYorkthanNewYorkCity.Youmightthinkthataterroristattackthatkillsdozensorhundredsofpeople
wouldautomaticallybefollowedbymassive,widespreadanxiety.Terrorism,bydefinition, is supposed to instill a sense of terror. I looked atGoogle searchesreflecting anxiety. I tested howmuch these searches rose in a country in thedays,weeks,andmonthsfollowingeverymajorEuropeanorAmericanterroristattacksince2004.So,onaverage,howmuchdidanxiety-relatedsearchesrise?Theydidn’t.Atall.Youmight think thatpeople search for jokesmoreoftenwhen theyare sad.
Many of history’s greatest thinkers have claimed that we turn to humor as arelease frompain.Humorhas longbeen thoughtof asaway tocopewith thefrustrations,thepain,theinevitabledisappointmentsoflife.AsCharlieChaplinputit,“Laughteristhetonic,therelief,thesurceasefrompain.”However, searches for jokes are lowest onMondays, the day when people
report theyaremostunhappy.Theyare lowestoncloudyand rainydays.Andtheyplummetafteramajor tragedy, suchaswhen twobombskilled threeandinjured hundreds during the 2013BostonMarathon. People are actuallymorelikelytoseekoutjokeswhenthingsaregoingwellinlifethanwhentheyaren’t.Sometimesanewdataset revealsabehavior,desire,orconcern that Iwould
haveneverevenconsidered.Therearenumeroussexualproclivitiesthatfallintothis category.For example, didyouknow that in India thenumberone searchbeginning “my husband wants . . .” is “my husband wants me to breastfeedhim”? This comment is far more common in India than in other countries.Moreover, porn searches for depictions ofwomen breastfeedingmen are fourtimeshigher inIndiaandBangladesh than inanyothercountry in theworld. IcertainlyneverwouldhavesuspectedthatbeforeIsawthedata.Further,whilethefactthatmenareobsessedwiththeirpenissizemaynotbe
toosurprising,thebiggestbodilyinsecurityforwomen,asexpressedonGoogle,issurprisingindeed.Basedonthisnewdata,thefemaleequivalentofworryingabout the size of your penis may be—pausing to build suspense—worryingabout whether your vagina smells. Women make nearly as many searches
expressingconcernabouttheirgenitalsasmendoworryingabouttheirs.Andthetop concern women express is its odor—and how they might improve it. Icertainlydidn’tknowthatbeforeIsawthedata.Sometimes new data reveals cultural differences I had never even
contemplated.Oneexample:theverydifferentwaysthatmenaroundtheworldrespond to theirwivesbeingpregnant. InMexico, the top searches about “mypregnantwife” include“frasesdeamorparamiesposaembarazada”(wordsoflovetomypregnantwife)and“poemasparamiesposaembarazada”(poemsformypregnantwife). In theUnitedStates, the top searches include “mywife ispregnantnowwhat”and“mywifeispregnantwhatdoIdo.”Butthisbookismorethanacollectionofoddfactsorone-offstudies,though
therewillbeplentyof those.Because thesemethodologiesaresonewandareonlygoingtogetmorepowerful,Iwilllayoutsomeideasonhowtheyworkandwhat makes them groundbreaking. I will also acknowledge Big Data’slimitations.Someoftheenthusiasmforthedatarevolution’spotentialhasbeenmisplaced.
MostofthoseenamoredwithBigDatagushabouthowimmensethesedatasetscanget.This obsessionwithdataset size is not new.BeforeGoogle,Amazon,andFacebook,before thephrase“BigData”existed, aconferencewasheld inDallas, Texas, on “Large and Complex Datasets.” Jerry Friedman, a statisticsprofessor atStanfordwhowas a colleagueofminewhen IworkedatGoogle,recallsthat1977conference.Onedistinguishedstatisticianwouldgetuptotalk.He would explain that he had accumulated an amazing, astonishing fivegigabytes of data.The next distinguished statisticianwould get up to talk.Hewould begin, “The last speaker had gigabytes. That’s nothing. I’ve gotterabytes.” The emphasis of the talk, in other words, was on how muchinformation you could accumulate, notwhat you hoped to dowith it, orwhatquestions you planned to answer. “I found it amusing, at the time,” Friedmansays,that“thethingthatyouweresupposedtobeimpressedwithwashowlargetheirdatasetis.Itstillhappens.”Too many data scientists today are accumulating massive sets of data and
telling us very little of importance—e.g., that the Knicks are popular in NewYork.Toomanybusinessesaredrowningindata.Theyhavelotsofterabytesbutfewmajorinsights.Thesizeofadataset,Ibelieve,isfrequentlyoverrated.There
isasubtle,butimportant,explanationforthis.Thebiggeraneffect,thefewerthenumberofobservationsnecessarytoseeit.Youonlyneedtotouchahotstoveonce to realize that it’sdangerous.Youmayneed todrinkcoffee thousandsoftimes to determine whether it tends to give you a headache.Which lesson ismore important? Clearly, the hot stove, which, because of the intensity of itsimpact,showsupsoquickly,withsolittledata.Infact,thesmartestBigDatacompaniesareoftencuttingdowntheirdata.At
Google,majordecisionsarebasedononlyatinysamplingofalltheirdata.Youdon’t alwaysneed a tonof data to find important insights.Youneed the rightdata.AmajorreasonthatGooglesearchesaresovaluableisnotthattherearesomany of them; it is that people are so honest in them. People lie to friends,lovers, doctors, surveys, and themselves. But on Google they might shareembarrassing information, about, among other things, their sexless marriages,theirmental health issues, their insecurities, and their animosity toward blackpeople.Mostimportant,tosqueezeinsightsoutofBigData,youhavetoasktheright
questions.Justasyoucan’tpointatelescoperandomlyatthenightskyandhaveitdiscoverPlutoforyou,youcan’tdownloadawholebunchofdataandhaveitdiscoverthesecretsofhumannatureforyou.Youmustlookinpromisingplaces—Googlesearchesthatbegin“myhusbandwants...”inIndia,forexample.Thisbook isgoing toshowhowBigData isbestusedandexplain indetail
whyitcanbesopowerful.Andalongtheway,you’llalsolearnaboutwhatIandothershavealreadydiscoveredwithit,including:
›Howmanymenaregay?›Doesadvertisingwork?›WhywasAmericanPharoahagreatracehorse?›Isthemediabiased?›AreFreudianslipsreal?›Whocheatsontheirtaxes?›Doesitmatterwhereyougotocollege?›Canyoubeatthestockmarket?›What’sthebestplacetoraisekids?›Whatmakesastorygoviral?
›Whatshouldyoutalkaboutonafirstdateifyouwantasecond?
...andmuch,muchmore.Butbeforewegettoallthat,weneedtodiscussamorebasicquestion:why
doweneeddataatall?Andforthat,Iamgoingtointroducemygrandmother.
1
YOURFAULTYGUT
Ifyou’rethirty-threeyearsoldandhaveattendedafewThanksgivingsinarowwithout a date, the topic of mate choice is likely to arise. And just abouteverybodywillhaveanopinion.“Sethneedsacrazygirl,likehim,”mysistersays.“You’recrazy!Heneedsanormalgirl,tobalancehimout,”mybrothersays.“Seth’snotcrazy,”mymothersays.“You’recrazy!Ofcourse,Sethiscrazy,”myfathersays.Allofasudden,myshy,soft-spokengrandmother,quiet through thedinner,
speaks.The loud,aggressiveNewYorkvoicesgosilent,andalleyes focusonthesmalloldladywithshortyellowhairandstillatraceofanEasternEuropeanaccent.“Seth,youneedanicegirl.Nottoopretty.Verysmart.Goodwithpeople.Social,soyouwilldothings.Senseofhumor,becauseyouhaveagoodsenseofhumor.”Whydoesthisoldwoman’sadvicecommandsuchattentionandrespectinmy
family? Well, my eighty-eight-year-old grandmother has seen more thaneverybodyelseat the table.She’sobservedmoremarriages,many thatworkedandmanythatdidn’t.Andoverthedecades,shehascatalogedthequalitiesthatmakeforsuccessfulrelationships.AtthatThanksgivingtable,forthatquestion,my grandmother has access to the largest number of data points. MygrandmotherisBigData.Inthisbook,Iwanttodemystifydatascience.Likeitornot,dataisplayingan
increasinglyimportantroleinallofourlives—anditsroleisgoingtogetlarger.Newspapersnowhavefullsectionsdevotedtodata.Companieshaveteamswiththe exclusive task of analyzing their data. Investors give start-ups tens ofmillionsofdollars if theycanstoremoredata.Evenifyounever learnhowtorunaregressionorcalculateaconfidenceinterval,youaregoingtoencounteralotofdata—inthepagesyouread,thebusinessmeetingsyouattend,thegossipyouhearnexttothewatercoolersyoudrinkfrom.Manypeopleareanxiousoverthisdevelopment.Theyareintimidatedbydata,
easily lost andconfused in aworldofnumbers.They think that aquantitativeunderstanding of the world is for a select few left-brained prodigies, not forthem.Assoonastheyencounternumbers, theyarereadytoturnthepage,endthemeeting,orchangetheconversation.But I have spent ten years in the data analysis business and have been
fortunatetoworkwithmanyofthetoppeopleinthefield.Andoneofthemostimportant lessons Ihave learned is this:Gooddatascience is lesscomplicatedthanpeoplethink.Thebestdatascience,infact,issurprisinglyintuitive.Whatmakesdatascienceintuitive?Atitscore,datascienceisaboutspotting
patternsandpredictinghowonevariablewillaffectanother.Peopledo thisallthetime.Justthinkhowmygrandmothergavemerelationshipadvice.Sheutilizedthe
largedatabaseofrelationshipsthatherbrainhasuploadedoveranearcenturyoflife—inthestoriesshehasheardfromherfamily,herfriends,heracquaintances.Shelimitedheranalysistoasampleofrelationshipsinwhichthemanhadmanyqualities that Ihave—asensitive temperament,a tendency to isolatehimself,asense of humor. She zeroed in on key qualities of thewoman—howkind shewas,howsmartshewas,howprettyshewas.Shecorrelatedthesekeyqualitiesofthewomanwithakeyqualityoftherelationship—whetheritwasagoodone.Finally, she reported her results. In other words, she spotted patterns andpredictedhowonevariablewillaffectanother.Grandmaisadatascientist.Youareadatascientist,too.Whenyouwereakid,younoticedthatwhenyou
cried, yourmom gave you attention. That is data science.When you reachedadulthood,younoticedthatifyoucomplaintoomuch,peoplewanttohangoutwithyouless.Thatisdatascience,too.Whenpeoplehangoutwithyouless,younoticed, you are less happy.When you are less happy, you are less friendly.
Whenyouare less friendly,peoplewant tohangoutwithyoueven less.Datascience.Datascience.Datascience.Becausedatascienceissonatural,thebestBigDatastudies,Ihavefound,can
beunderstoodby justaboutanysmartperson. Ifyoucan’tunderstandastudy,theproblemisprobablywiththestudy,notwithyou.Wantproofthatgreatdatasciencetendstobeintuitive?Irecentlycameacross
astudythatmaybeoneofthemostimportantconductedinthepastfewyears.ItisalsooneofthemostintuitivestudiesI’veeverseen.Iwantyoutothinknotjustabouttheimportanceofthestudy—buthownaturalandgrandma-likeitis.The study was by a team of researchers from Columbia University and
Microsoft.The teamwanted to findwhat symptomspredict pancreatic cancer.Thisdiseasehasalowfive-yearsurvivalrate—onlyabout3percent—butearlydetectioncandoubleapatient’schances.The researchers’ method? They utilized data from tens of thousands of
anonymous users of Bing, Microsoft’s search engine. They coded a user ashaving recently been given a diagnosis of pancreatic cancer based onunmistakablesearches,suchas“justdiagnosedwithpancreaticcancer”or“IwastoldIhavepancreaticcancer,whattoexpect.”Next,theresearcherslookedatsearchesforhealthsymptoms.Theycompared
thatsmallnumberofuserswholaterreportedapancreaticcancerdiagnosiswiththosewhodidn’t.Whatsymptoms,inotherwords,predictedthat,inafewweeksormonths,auserwillbereportingadiagnosis?The results were striking. Searching for back pain and then yellowing skin
turnedout tobeasignofpancreaticcancer;searchingfor justbackpainalonemade it unlikely someone had pancreatic cancer. Similarly, searching forindigestion and then abdominal painwas evidence of pancreatic cancer,whilesearching for just indigestion without abdominal pain meant a person wasunlikelytohaveit.Theresearcherscouldidentify5to15percentofcaseswithalmostnofalsepositives.Now,thismaynotsoundlikeagreatrate;butifyouhave pancreatic cancer, even a 10 percent chance of possibly doubling yourchancesofsurvivalwouldfeellikeawindfall.Thepaperdetailingthisstudywouldbedifficultfornon-expertstofullymake
senseof.Itincludesalotoftechnicaljargon,suchastheKolmogorov-Smirnovtest, the meaning of which, I have to admit, I had forgotten. (It’s a way to
determinewhetheramodelcorrectlyfitsdata.)However,notehownaturaland intuitive this remarkablestudy isat itsmost
fundamentallevel.Theresearcherslookedatawidearrayofmedicalcasesandtried toconnectsymptomstoaparticular illness.Youknowwhoelseuses thismethodologyintryingtofigureoutwhethersomeonehasadisease?Husbandsandwives,mothers and fathers, and nurses and doctors. Based on experienceandknowledge,theytrytoconnectfevers,headaches,runnynoses,andstomachpains to various diseases. In other words, the Columbia and Microsoftresearchers wrote a groundbreaking study by utilizing the natural, obviousmethodologythateverybodyusestomakehealthdiagnoses.Butwait.Let’sslowdownhere.Ifthemethodologyofthebestdatascienceis
frequently natural and intuitive, as I claim, this raises a fundamental questionabout the value of Big Data. If humans are naturally data scientists, if datascienceisintuitive,whydoweneedcomputersandstatisticalsoftware?WhydoweneedtheKolmogorov-Smirnovtest?Can’twejustuseourgut?Can’twedoitlikeGrandmadoes,likenursesanddoctorsdo?Thisgets toanargument intensifiedafter thereleaseofMalcolmGladwell’s
bestselling book Blink, which extols the magic of people’s gut instincts.Gladwell tells the stories of peoplewho, relying solely on their guts, can tellwhetherastatueisfake;whetheratennisplayerwillfaultbeforehehitstheball;how much a customer is willing to pay. The heroes in Blink do not runregressions; they do not calculate confidence intervals; they do not runKolmogorov-Smirnov tests. But they generally make remarkable predictions.Many people have intuitively supported Gladwell’s defense of intuition: theytrust their guts and feelings. Fans ofBlinkmight celebrate thewisdom ofmygrandmother giving relationship advicewithout the aid of computers. Fans ofBlinkmaybelessapttocelebratemystudiesortheotherstudiesprofiledinthisbook,whichusecomputers.IfBigData—ofthecomputertype,ratherthanthegrandmatype—isarevolution, ithastoprovethat it’smorepowerful thanourunaidedintuition,which,asGladwellhaspointedout,canoftenberemarkable.The Columbia andMicrosoft study offers a clear example of rigorous data
scienceandcomputersteachingusthingsourgutalonecouldneverfind.Thisisalso one case where the size of the dataset matters. Sometimes there isinsufficientexperienceforourunaidedguttodrawupon.Itisunlikelythatyou
—or your close friends or family members—have seen enough cases ofpancreatic cancer to tease out the difference between indigestion followed byabdominal pain compared to indigestion alone. Indeed, it is inevitable, as theBing dataset gets bigger, that the researchers will pick up many more subtlepatterns in the timing of symptoms—for this and other illnesses—that evendoctorsmightmiss.Moreover,whileourgutmayusuallygiveusagoodgeneralsenseofhowthe
worldworks, it is frequentlynotprecise.Weneeddata to sharpen thepicture.Consider, for example, the effects of weather on mood. You would probablyguessthatpeoplearemorelikelytofeelmoregloomyona10-degreedaythanona70-degreeday.Indeed,thisiscorrect.Butyoumightnotguesshowbiganimpact this temperaturedifferencecanmake.I lookedforcorrelationsbetweenanarea’sGooglesearchesfordepressionandawiderangeoffactors,includingeconomic conditions, education levels, and church attendance.Winter climateswampedalltherest.Inwintermonths,warmclimates,suchasthatofHonolulu,Hawaii,have40percent fewerdepressionsearches thancoldclimates, suchasthatofChicago,Illinois.Justhowsignificantisthiseffect?Anoptimisticreadofthe effectiveness of antidepressants would find that the most effective drugsdecreasetheincidenceofdepressionbyonlyabout20percent.TojudgefromtheGoogle numbers, a Chicago-to-Honolulu move would be at least twice aseffectiveasmedicationforyourwinterblues.*
Sometimes our gut, when not guided by careful computer analysis, can bedeadwrong.Wecangetblindedbyourownexperiencesandprejudices.Indeed,eventhoughmygrandmotherisabletoutilizeherdecadesofexperiencetogivebetterrelationshipadvicethantherestofmyfamily,shestillhassomedubiousviews on what makes a relationship last. For example, she has frequentlyemphasizedtometheimportanceofhavingcommonfriends.Shebelievesthatthiswasakeyfactor inhermarriage’ssuccess:shespentmostwarmeveningswithherhusband,mygrandfather,intheirsmallbackyardinQueens,NewYork,sittingonlawnchairsandgossipingwiththeirtightgroupofneighbors.However, at the risk of throwingmy own grandmother under the bus, data
sciencesuggeststhatGrandma’stheoryiswrong.Ateamofcomputerscientistsrecentlyanalyzedthebiggestdataseteverassembledonhumanrelationships—Facebook.Theylookedata largenumberofcoupleswhowere,atsomepoint,
“in a relationship.” Some of these couples stayed “in a relationship.” Othersswitched their status to “single.”Having a commoncoregroupof friends, theresearchersfound,isastrongpredictorthatarelationshipwillnotlast.Perhapshangingouteverynightwithyourpartnerandthesamesmallgroupofpeopleisnot such a good thing; separate social circles may help make relationshipsstronger.Asyoucansee,our intuitionalone,whenwestayawayfromthecomputers
and go with our gut, can sometimes amaze. But it can make big mistakes.Grandmamay have fallen into one cognitive trap: we tend to exaggerate therelevanceof our ownexperience. In theparlanceof data scientists,weweightourdata,andwegivefartoomuchweighttooneparticulardatapoint:ourselves.Grandmawasso focusedonhereveningschmoozeswithGrandpaand their
friends that she did not think enough about other couples. She forgot to fullyconsider her brother-in-law and his wife, who chitchatted most nights with asmall,consistentgroupoffriendsbutfoughtfrequentlyanddivorced.Sheforgotto fullyconsidermyparents,herdaughterandson-in-law.Myparentsgo theirseparatewaysmanynights—mydadtoajazzcluborballgamewithhisfriends,mymomtoarestaurantorthetheaterwithherfriends;yettheyremainhappilymarried.When relying on our gut, we can also be thrown off by the basic human
fascination with the dramatic. We tend to overestimate the prevalence ofanything that makes for a memorable story. For example, when asked in asurvey, people consistently rank tornadoes as amore common cause of deaththanasthma.Infact,asthmacausesaboutseventytimesmoredeaths.Deathsbyasthmadon’tstandout—anddon’tmakethenews.Deathsbytornadoesdo.Weareoftenwrong,inotherwords,abouthowtheworldworkswhenwerely
justonwhatwehearorpersonallyexperience.Whilethemethodologyofgooddata science is often intuitive, the results are frequently counterintuitive.Datascience takes a natural and intuitive human process—spotting patterns andmakingsenseofthem—andinjectsitwithsteroids,potentiallyshowingusthatthe world works in a completely different way from how we thought it did.That’swhathappenedwhenIstudiedthepredictorsofbasketballsuccess.
WhenIwasalittleboy,Ihadonedreamandonedreamonly:Iwantedtogrow
up to be an economist and data scientist. No. I’m just kidding. I wanteddesperately tobeaprofessionalbasketballplayer, to follow in the footstepsofmyhero,PatrickEwing,all-starcenterfortheNewYorkKnicks.Isometimessuspectthatinsideeverydatascientistisakidtryingtofigureout
whyhischildhooddreamsdidn’tcometrue.SoitisnotsurprisingthatIrecentlyinvestigatedwhatittakestomaketheNBA.Theresultsoftheinvestigationweresurprising. In fact, they demonstrate once again how good data science canchangeyourviewoftheworld,andhowcounterintuitivethenumberscanbe.TheparticularquestionIlookedatisthis:areyoumorelikelytomakeitinthe
NBAifyougrowuppoorormiddle-class?Mostpeoplewouldguesstheformer.Conventionalwisdomsaysthatgrowing
upindifficultcircumstances,perhapsintheprojectswithasingle,teenagemom,helps foster the drive necessary to reach the top levels of this intenselycompetitivesport.ThisviewwasexpressedbyWilliamEllerbee,ahighschoolbasketballcoach
inPhiladelphia, inaninterviewwithSportsIllustrated.“Suburbankids tend toplay for the fun of it,” Ellerbee said. “Inner-city kids look at basketball as amatteroflifeordeath.”I,alas,wasraisedbymarriedparentsintheNewJerseysuburbs.LeBron James, the best player ofmy generation,was born poor to asixteen-year-oldsinglemotherinAkron,Ohio.Indeed, an internet survey I conducted suggested that the majority of
Americans think thesame thingCoachEllerbeeandI thought: thatmostNBAplayersgrowupinpoverty.Isthisconventionalwisdomcorrect?Let’s look at the data. There is no comprehensive data source on the
socioeconomicsofNBAplayers.Butbybeingdatadetectives,byutilizingdatafrom a whole bunch of sources—basketball-reference.com, ancestry.com, theU.S.Census,andothers—wecanfigureoutwhatfamilybackgroundisactuallymostconducivetomakingtheNBA.Thisstudy,youwillnote,usesavarietyofdatasources,someofthembigger,someofthemsmaller,someofthemonline,andsomeofthemoffline.Asexcitingassomeofthenewdigitalsourcesare,agood data scientist is not above consulting old-fashioned sources if they canhelp. The best way to get the right answer to a question is to combine allavailabledata.
The first relevantdata is thebirthplaceofeveryplayer.Foreverycounty intheUnitedStates, I recordedhowmanyblackandwhitemenwerebornin the1980s.IthenrecordedhowmanyofthemreachedtheNBA.Icomparedthistoacounty’s average household income. I also controlled for the racialdemographicsofacounty,since—andthisisasubjectforawholeotherbook—blackmenareaboutfortytimesmorelikelythanwhitementoreachtheNBA.Thedatatellsusthatamanhasasubstantiallybetterchanceofreachingthe
NBA if he was born in a wealthy county. A black kid born in one of thewealthiest counties in the United States, for example, is more than twice aslikelytomaketheNBAthanablackkidborninoneofthepoorestcounties.Fora white kid, the advantage of being born in one of the wealthiest countiescomparedtobeingborninoneofthepoorestis60percent.This suggests, contrary to conventionalwisdom, that poormen are actually
underrepresented in the NBA. However, this data is not perfect, since manywealthycounties in theUnitedStates, suchasNewYorkCounty (Manhattan),also include poor neighborhoods, such as Harlem. So it’s still possible that adifficult childhood helps youmake theNBA.We still needmore clues,moredata.So I investigated the family backgrounds ofNBAplayers.This information
wasfoundinnewsstoriesandonsocialnetworks.Thismethodologywasquitetime-consuming,soIlimitedtheanalysistotheonehundredAfrican-AmericanNBAplayers born in the 1980swho scored themost points.Compared to theaverageblackmanintheUnitedStates,NBAsuperstarswereabout30percentlesslikelytohavebeenborntoateenagemotheroranunwedmother.Inotherwords,thefamilybackgroundsofthebestblackNBAplayersalsosuggestthatacomfortablebackgroundisabigadvantageforachievingsuccess.Thatsaid,neither thecounty-levelbirthdatanor thefamilybackgroundofa
limitedsampleofplayersgivesperfectinformationonthechildhoodsofallNBAplayers. So I was still not entirely convinced that two-parent, middle-classfamilies producemoreNBA stars than single-parent, poor families. Themoredatawecanthrowatthisquestion,thebetter.Then I remembered onemore data point that can provide telling clues to a
man’sbackground.Itwassuggestedinapaperbytwoeconomists,RolandFryerand Steven Levitt, that a black person’s first name is an indication of his
socioeconomic background. Fryer and Levitt studied birth certificates inCalifornia in the 1980s and found that, among African-Americans, poor,uneducated, and single moms tend to give their kids different names than domiddle-class,educated,andmarriedparents.Kidsfrombetter-offbackgroundsaremorelikelytobegivencommonnames,
such asKevin,Chris, and John.Kids from difficult homes in the projects aremore likely to be given unique names, such as Knowshon, Uneek, andBreionshay.African-Americankidsbornintopovertyarenearlytwiceaslikelytohaveanamethatisgiventonootherchildborninthatsameyear.SowhataboutthefirstnamesofblackNBAplayers?Dotheysoundmorelike
middle-classorpoorblacks?Lookingat the same timeperiod,California-bornNBA players were half as likely to have unique names as the average blackmale,astatisticallysignificantdifference.KnowsomeonewhothinkstheNBAisaleagueforkidsfromtheghetto?Tell
him to just listen closely to thenext gameon the radio.Tell him tonote howfrequentlyRussell dribbles pastDwight and then tries to slip the ball past theoutstretchedarmsofJoshandintothewaitinghandsofKevin.IftheNBAreallywerealeaguefilledwithpoorblackmen,itwouldsoundquitedifferent.TherewouldbealotmoremenwithnameslikeLeBron.Now, we have gathered three different pieces of evidence—the county of
birth,themaritalstatusofthemothersofthetopscorers,andthefirstnamesofplayers. No source is perfect. But all three support the same story. Bettersocioeconomic status means a higher chance of making the NBA. Theconventionalwisdom,inotherwords,iswrong.Among all African-Americans born in the 1980s, about 60 percent had
unmarried parents. But I estimate that amongAfrican-Americans born in thatdecade who reached the NBA, a significant majority had married parents. Inotherwords,theNBAisnotcomposedprimarilyofmenwithbackgroundslikethat of LeBron James. There are more men like Chris Bosh, raised by twoparentsinTexaswhocultivatedhisinterestinelectronicgadgets,orChrisPaul,the second son of middle-class parents in Lewisville, North Carolina, whosefamilyjoinedhimonanepisodeofFamilyFeudin2011.The goal of a data scientist is to understand the world. Once we find the
counterintuitiveresult,wecanusemoredatasciencetohelpusexplainwhythe
worldisnotasitseems.Why,forexample,domiddle-classmenhaveanedgeinbasketballrelativetopoormen?Thereareatleasttwoexplanations.First,becausepoormentendtoendupshorter.Scholarshavelongknownthat
childhoodhealthcareandnutritionplayalargeroleinadulthealth.Thisiswhytheaveragemaninthedevelopedworldisnowfourinchestallerthanacenturyand a half ago. Data suggests that Americans from poor backgrounds, due toweakerearly-lifehealthcareandnutrition,areshorter.Data can also tell us the effect of height on reaching the NBA. You
undoubtedlyintuitedthatbeingtallcanbeofassistancetoanaspiringbasketballplayer.Justcontrasttheheightofthetypicalballplayeronthecourttothetypicalfaninthestands.(TheaverageNBAplayeris6’7”;theaverageAmericanmanis5’9”.)Howmuchdoesheightmatter?NBAplayerssometimesfibalittleabouttheir
height, and there is no listingof the complete height distributionofAmericanmales.ButworkingwitharoughmathematicalestimateofwhatthisdistributionmightlooklikeandtheNBA’sownnumbers,itiseasytoconfirmthattheeffectsof height are enormous—maybe even more than we might have suspected. IestimatethateachadditionalinchroughlydoublesyouroddsofmakingittotheNBA.Andthisistruethroughouttheheightdistribution.A5’11”manhastwicetheoddsofreachingtheNBAasa5’10”man.A6’11”manhastwicetheoddsofreachingtheNBAasa6’10”man.Itappears that,amongmenless thansixfeettall,onlyaboutoneintwomillionreachtheNBA.Amongthoseoversevenfeettall,Iandothershaveestimated,somethinglikeoneinfivereachtheNBA.Data, you will note, clarifies why my dream of basketball stardom was
derailed.ItwasnotbecauseIwasbroughtupinthesuburbs.ItwasbecauseIam5’9”andwhite(nottomentionslow).Also,Iamlazy.AndIhavepoorstamina,awfulshooting form,andoccasionallyapanicattackwhen theballgets inmyhand.Asecondreasonthatboysfromtoughbackgroundsmaystruggletomakethe
NBAisthattheysometimeslackcertainsocialskills.Usingdataonthousandsofschoolchildren,economistshavefoundthatmiddle-class,two-parentfamiliesareon average substantially better at raising kids who are trusting, disciplined,persistent,focused,andorganized.Sohowdopoorsocialskillsderailanotherwisepromisingbasketballcareer?
Let’s look at the story ofDougWrenn, one of themost talented basketballprospects in the 1990s. His college coach, Jim Calhoun at the University ofConnecticut,whohas trained futureNBAall-stars, claimedWrenn jumped thehighest of any man he had ever worked with. But Wrenn had a challengingupbringing.HewasraisedbyasinglemotherinBloodAlley,oneoftheroughestneighborhoods in Seattle. In Connecticut, he consistently clashed with thosearound him.Hewould taunt players, question coaches, andwear loose-fittingclothes in violation of team rules. He also had legal troubles—he stole shoesfrom a store and snapped at police officers. Calhoun finally had enough andkickedhimofftheteam.WrenngotasecondchanceattheUniversityofWashington.Butthere,too,an
inability togetalongwithpeoplederailedhim.He foughtwithhiscoachoverplaying time and shot selection andwas kicked off this team as well.WrennwentundraftedbytheNBA,bouncedaroundlowerleagues,movedinwithhismother,andwaseventuallyimprisonedforassault.“Mycareerisover,”Wrenntold the Seattle Times in 2009. “My dreams, my aspirations are over. DougWrennisdead.Thatbasketballplayer,thatdudeisdead.It’sover.”WrennhadthetalentnotjusttobeanNBAplayer,buttobeagreat,evenalegendaryplayer.Butheneverdevelopedthetemperamenttoevenstayonacollegeteam.Perhapsifhe’dhadastableearlylife,hecouldhavebeenthenextMichaelJordan.MichaelJordan,ofcourse,alsohadan impressivevertical leap.Plusa large
ego and intense competitiveness—a personality at times that was not unlikeWrenn’s.Jordancouldbeadifficultkid.Attheageoftwelve,hewaskickedoutofschoolforfighting.ButhehadatleastonethingthatWrennlacked:astable,middle-class upbringing. His father was an equipment supervisor for GeneralElectric,hismotherabanker.Andtheyhelpedhimnavigatehiscareer.Infact,Jordan’slifeisfilledwithstoriesofhisfamilyguidinghimawayfrom
the traps thatagreat,competitive talentcan fall into.After Jordanwaskickedoutofschool,hismotherrespondedbytakinghimwithhertowork.Hewasnotallowed to leave thecarand insteadhad to sit there in theparking lot readingbooks.AfterhewasdraftedbytheChicagoBulls,hisparentsandsiblingstookturnsvisitinghimtomakesureheavoidedthetemptationsthatcomewithfameandmoney.Jordan’scareerdidnotendlikeWrenn’s,withalittle-readquoteintheSeattle
Times. ItendedwithaspeechuponinductionintotheBasketballHallofFamethatwaswatchedbymillionsof people. In his speech, Jordan saidhe tried tostay “focused on the good things about life—you know how people perceiveyou,howyourespectthem...howyouareperceivedpublicly.Takeapauseandthinkaboutthethingsthatyoudo.Andthatallcamefrommyparents.”ThedatatellsusJordanisabsolutelyrighttothankhismiddle-class,married
parents.Thedata tellsus that inworse-off families, inworse-offcommunities,thereareNBA-leveltalentswhoarenotintheNBA.Thesemenhadthegenes,had the ambition, but never developed the temperament to become basketballsuperstars.Andno—whateverwemightintuit—beingincircumstancessodesperatethat
basketball seems “amatter of life or death”doesnot help.Stories like that ofDougWrenncanhelpillustratethis.Anddataprovesit.InJune2013,LeBronJameswasinterviewedontelevisionafterwinninghis
secondNBAchampionship.(Hehassincewonathird.)“I’mLeBronJames,”heannounced.“FromAkron,Ohio.Fromtheinnercity.Iamnotevensupposedtobehere.”Twitter andother socialnetworks eruptedwith criticism.Howcouldsuch a supremely gifted person, identified from an absurdly young age as thefutureofbasketball, claim tobe anunderdog? In fact, anyone fromadifficultenvironment,nomatterhisathleticprowess,has theoddsstackedagainsthim.James’saccomplishments, inotherwords,areevenmoreexceptional than theyappeartobeatfirst.Dataprovesthat,too.
2
WASFREUDRIGHT?
Irecentlysawapersonwalkingdownastreetdescribedasa“penistrian.”Youcaught that, right?A “penistrian” insteadof a “pedestrian.” I saw it in a largedataset of typos peoplemake.A person sees someonewalking andwrites theword“penis.”Hastomeansomething,right?Irecentlylearnedofamanwhodreamedofeatingabananawhilewalkingto
thealtartomarryhiswife.Isawitinalargedatasetofdreamspeoplerecordonanapp.Amanimaginesmarryingawomanwhileeatingaphallic-shapedfood.Thatalsohastomeansomething,right?WasSigmundFreudright?Sincehistheoriesfirstcametopublicattention,the
mosthonestanswer to thisquestionwouldbeashrug. ItwasKarlPopper, theAustrian-British philosopher, who made this point clearest. Popper famouslyclaimed that Freud’s theories were not falsifiable. There was no way to testwhethertheyweretrueorfalse.Freudcouldsaythepersonwritingofa“penistrian”wasrevealingapossibly
repressed sexual desire. The person could respond that she wasn’t revealinganything; that she could have just as easily made an innocent typo, such as“pedaltrian.” It would be a he-said, she-said situation. Freud could say thegentleman dreaming of eating a banana on his wedding day was secretlythinking of a penis, revealing his desire to really marry a man rather than awoman.Thegentlemancouldsayhejusthappenedtobedreamingofabanana.Hecouldhavejustaseasilybeendreamingofeatinganappleashewalkedto
thealtar.Itwouldbehe-said,he-said.TherewasnowaytoputFreud’stheorytoarealtest.Untilnow,thatis.Data science makes many parts of Freud falsifiable—it puts many of his
famoustheories to the test.Let’sstartwithphallicsymbols indreams.Usingahuge dataset of recorded dreams,we can readily note how frequently phallic-shapedobjectsappear.Foodisagoodplacetofocusthisstudy.Itshowsupinmanydreams,andmanyfoodsareshapedlikephalluses—bananas,cucumbers,hotdogs,etc.Wecanthenmeasurethefactorsthatmightmakeusdreammoreaboutcertainfoodsthanothers—howfrequentlytheyareeaten,howtastymostpeoplefindthem,and,yes,whethertheyarephallicinnature.Wecantestwhethertwofoods,bothofwhichareequallypopular,butoneof
whichisshapedlikeaphallus,appearindreamsindifferentamounts.Ifphallus-shaped foods are no more likely to be dreamed about than other foods, thenphallicsymbolsarenotasignificantfactorinourdreams.ThankstoBigData,thispartofFreud’stheorymayindeedbefalsifiable.IreceiveddatafromShadow,anappthatasksuserstorecordtheirdreams.I
codedthefoodsincludedintensofthousandsofdreams.Overall,whatmakesusdreamoffoods?Themainpredictorishowfrequently
weconsumethem.Thesubstancethat ismostdreamedaboutiswater.Thetoptwenty foods include chicken, bread, sandwiches, and rice—all notably un-Freudian.Thesecondpredictorofhowfrequentlyafoodappearsindreamsishowtasty
people find it. The two foodswe dream aboutmost often are the notably un-Freudianbutfamouslytastychocolateandpizza.So what about phallic-shaped foods? Do they sneak into our dreams with
unexpectedfrequency?Nope.Bananasarethesecondmostcommonfruittoappearindreams.Buttheyare
also the second most commonly consumed fruit. So we don’t need Freud toexplain how oftenwe dream about bananas. Cucumbers are the seventhmostcommon vegetable to appear in dreams.They are the seventhmost consumedvegetable.Soagain their shape isn’tnecessary toexplain theirpresence inourmindsaswesleep.Hotdogsaredreamedoffarlessfrequentlythanhamburgers.Thisistrueevencontrollingforthefactthatpeopleeatmoreburgersthandogs.
Overall,usingaregressionanalysis(amethodthatallowssocialscientiststotease apart the impact of multiple factors) across all fruits and vegetables, Ifoundthatafood’sbeingshapedlikeaphallusdidnotgiveitmorelikelihoodofappearing in dreams thanwould be expected by its popularity. This theory ofFreud’sisfalsifiable—and,atleastaccordingtomylookatthedata,false.Next,considerFreudianslips.Thepsychologisthypothesizedthatweuseour
errors—thewayswemisspeakormiswrite—torevealoursubconsciousdesires,frequentlysexual.CanweuseBigDatatotestthis?Here’soneway:seeifourerrors—our slips—lean in the direction of the naughty. If our buried sexualdesires sneak out in our slips, there should be a disproportionate number oferrorsthatincludewordslike“penis,”“cock,”and“sex.”ThisiswhyIstudiedadatasetofmorethan40,000typingerrorscollectedby
Microsoftresearchers.Thedatasetincludedmistakesthatpeoplemakebutthenimmediately correct. In these tensof thousandsof errors, therewereplentyofindividuals committing errors of a sexual sort. There was the aforementioned“penistrian.”Therewasalsosomeonewhotyped“sexurity”insteadof“security”and “cocks” instead of “rocks.” But there were also plenty of innocent slips.Peoplewroteof“pindows”and“fegetables,”“aftermoons”and“refriderators.”Sowasthenumberofsexualslipsunusual?Totestthis,IfirstusedtheMicrosoftdatasettomodelhowfrequentlypeople
mistakenlyswitchparticularletters.Icalculatedhowoftentheyreplaceatwithans,agwithanh.Ithencreatedacomputerprogramthatmademistakesinthewaythatpeopledo.WemightcallitErrorBot.ErrorBotreplacedatwithanswiththesamefrequencythathumansintheMicrosoftstudydid.Itreplacedagwithanhasoftenastheydid.Andsoon.IrantheprogramonthesamewordspeoplehadgottenwrongintheMicrosoftstudy.Inotherwords,thebottriedtospell“pedestrian”and“rocks,”“windows”and“refrigerator.”Butitswitchedanrwithatasoftenaspeopledoandwrote,forexample,“tocks.”Itswitchedanrwithacasoftenashumansdoandwrote“cocks.”So what do we learn from comparing Error Bot with normally careless
humans?Aftermakinga fewmillionerrors, just frommisplacing letters in theways that humans do, Error Bot had made numerous mistakes of a Freudiannature. It misspelled “seashell” as “sexshell,” “lipstick” as “lipsdick,” and“luckiest” as “fuckiest,” alongwithmany other similarmistakes.And—here’s
the keypoint—ErrorBot,whichof coursedoesnot have a subconscious,wasjust as likely tomake errors that could be perceived as sexual as real peoplewere.Withthecaveat,aswesocialscientistsliketosay,thatthereneedstobemore research, thismeans that sexually oriented errors are nomore likely forhumanstomakethancanbeexpectedbychance.Inotherwords,forpeopletomakeerrorssuchas“penistrian,”“sexurity,”and
“cocks,”it isnotnecessarytohavesomeconnectionbetweenmistakesandtheforbidden,sometheoryofthemindwherepeoplerevealtheirsecretdesiresviatheir errors.These slipsof the fingers canbe explainedentirelyby the typicalfrequency of typos. People make lots of mistakes. And if you make enoughmistakes, eventually you start saying things like “lipsdick,” “fuckiest,” and“penistrian.”Ifamonkeytypeslongenough,hewilleventuallywrite“tobeornottobe.”Ifapersontypeslongenough,shewilleventuallywrite“penistrian.”Freud’stheorythaterrorsrevealoursubconsciouswantsisindeedfalsifiable
—and,accordingtomyanalysisofthedata,false.BigData tellsus abanana is always just abanana anda “penistrian” just a
misspelled“pedestrian.”
SowasFreud totally off-target in all his theories?Not quite.When I first gotaccess to PornHub data, I found a revelation there that struck me as at leastsomewhat Freudian. In fact, this is among the most surprising things I havefoundyetduringmydata investigations:a shockingnumberofpeoplevisitingmainstreampornsitesarelookingforportrayalsofincest.Of the top hundred searches bymen on PornHub, one of themost popular
porn sites, sixteen are looking for incest-themed videos. Fairwarning—this isgoingtogetalittlegraphic:theyinclude“brotherandsister,”“stepmomfucksson,” “mom and son,” “mom fucks son,” and “real brother and sister.” Thepluralityofmaleincestuoussearchesareforscenesfeaturingmothersandsons.Andwomen?Nineof the tophundredsearchesbywomenonPornHubareforincest-themedvideos,andtheyfeaturesimilarimagery—thoughwiththegenderofanyparentandchildwhoismentionedusuallyreversed.Thusthepluralityofincestuous searches made by women are for scenes featuring fathers anddaughters.It’s not hard to locate in this data at least a faint echo of Freud’s Oedipal
complex.He hypothesized a near-universal desire in childhood,which is laterrepressed, for sexual involvement with opposite-sex parents. If only theViennese psychologist had lived long enough to turn his analytic skills toPornHubdata,where interest inopposite-sexparentsseemstobeborneoutbyadults—withgreatexplicitness—andlittleisrepressed.Ofcourse,PornHubdatacan’t tellus forcertainwhopeopleare fantasizing
aboutwhenwatchingsuchvideos.Aretheyactuallyimagininghavingsexwiththeir own parents? Google searches can give some more clues that there areplentyofpeoplewithsuchdesires.Consider all searches of the form “I want to have sex with my . . .” The
number oneway to complete this search is “mom.”Overall,more than three-fourths of searches of this form are incestuous. And this is not due to theparticularphrasing.Searchesoftheform“Iamattractedto. . . ,”forexample,areevenmoredominatedbyadmissionsofincestuousdesires.NowIconcede—attheriskofdisappointingHerrFreud—thatthesearenotparticularlycommonsearches: a few thousand people every year in theUnited States admitting anattractiontotheirmother.SomeonewouldalsohavetobreakthenewstoFreudthatGoogle searches, aswill be discussed later in this book, sometimes skewtowardtheforbidden.Butstill.ThereareplentyofinappropriateattractionsthatpeoplehavethatI
wouldhaveexpectedtohavebeenmentionedmorefrequentlyinsearches.Boss?Employee? Student? Therapist? Patient? Wife’s best friend? Daughter’s bestfriend?Wife’s sister?Best friend’swife?None of these confessed desires cancompetewithmom.Maybe,combinedwith thePornHubdata, that reallydoesmeansomething.And Freud’s general assertion that sexuality can be shaped by childhood
experiencesissupportedelsewhereinGoogleandPornHubdata,whichrevealsthatmen,atleast,retainaninordinatenumberoffantasiesrelatedtochildhood.Accordingtosearchesfromwivesabouttheirhusbands,someofthetopfetishesof adult men are the desire to wear diapers and wanting to be breastfed,particularly, as discussed earlier, in India. Moreover, cartoon porn—animatedexplicit sex scenes featuring characters from shows popular among adolescentboys—hasachievedahighdegreeofpopularity.Orconsidertheoccupationsofwomenmostfrequentlysearchedforinpornbymen.Menwhoare18–24years
old searchmost frequently forwomenwhoarebabysitters.Asdo25–64-year-oldmen.Andmen65yearsandolder.Andformenineveryagegroup,teacherandcheerleaderarebothinthetopfour.Clearly,theearlyyearsoflifeseemtoplayanoutsizeroleinmen’sadultfantasies.Ihavenotyetbeenabletouseallthisunprecedenteddataonadultsexualityto
figure out precisely how sexual preferences form.Over the next few decades,other social scientists and I will be able to create new, falsifiable theories onadultsexualityandtestthemwithactualdata.Already I can predict somebasic themes thatwill undoubtedly be part of a
data-based theory of adult sexuality. It is clearly not going to be the identicalstorytotheoneFreudtold,withhisparticular,well-defined,universalstagesofchildhood and repression. But, based onmy first look at PornHub data, I amabsolutely certain the final verdict on adult sexuality will feature some keythemes that Freud emphasized. Childhood will play a major role. So willmothers.
ItlikelywouldhavebeenimpossibletoanalyzeFreudinthiswaytenyearsago.Itcertainlywouldhavebeenimpossibleeightyyearsago,whenFreudwasstillalive. So let’s think throughwhy these data sources helped.This exercise canhelpusunderstandwhyBigDataissopowerful.Remember,wehavesaidthatjusthavingmoundsandmoundsofdatabyitself
doesn’tautomaticallygenerate insights.Data size,by itself, isoverrated.Why,then, isBigData so powerful?Whywill it create a revolution in howwe seeourselves?Thereare,Iclaim,fouruniquepowersofBigData.ThisanalysisofFreudprovidesagoodillustrationofthem.Youmayhavenoticed,tobeginwith,thatwe’retakingpornographyseriously
inthisdiscussionofFreud.Andwearegoingtoutilizedatafrompornographyfrequently in this book. Somewhat surprisingly, porn data is rarely utilized bysociologists, most of whom are comfortable relying on the traditional surveydatasets theyhavebuilt theircareerson.Butamoment’s reflectionshows thatthewidespreaduseofporn—andthesearchandviewsdatathatcomeswithit—isthemostimportantdevelopmentinourabilitytounderstandhumansexualityin, well . . . Actually, it’s probably the most important ever. It is data thatSchopenhauer, Nietzsche, Freud, and Foucault would have drooled over. This
datadidnotexistwhentheywerealive.Itdidnotexistacoupledecadesago.Itexistsnow.Therearemanyuniquedatasources,onarangeoftopics,thatgiveuswindowsintoareasaboutwhichwecouldpreviouslyjustguess.OfferingupnewtypesofdataisthefirstpowerofBigData.TheporndataandtheGooglesearchdataarenotjustnew;theyarehonest.In
thepre-digitalage,peoplehidtheirembarrassingthoughtsfromotherpeople.Inthedigitalage,theystillhidethemfromotherpeople,butnotfromtheinternetand in particular sites such as Google and PornHub, which protect theiranonymity. These sites function as a sort of digital truth serum—hence ourability to uncover awidespread fascinationwith incest.BigData allows us tofinallyseewhatpeoplereallywantandreallydo,notwhat theysay theywantandsaytheydo.ProvidinghonestdataisthesecondpowerofBigData.Becausethereisnowsomuchdata,thereismeaningfulinformationoneven
tiny slices of a population.We can compare, say, the number of people whodreamofcucumbersversusthosewhodreamoftomatoes.AllowingustozoominonsmallsubsetsofpeopleisthethirdpowerofBigData.BigData has onemore impressive power—one thatwas not utilized inmy
quickstudyofFreudbutcouldbeinafutureone:itallowsustoundertakerapid,controlled experiments. This allows us to test for causality, not merelycorrelations.Thesekindsof tests aremostlyusedbybusinessesnow,but theywillproveapowerful tool forsocialscientists.Allowingus todomanycausalexperimentsisthefourthpowerofBigData.Nowit is timetounpackeachof thesepowersandexploreexactlywhyBig
Datamatters.
3
DATAREIMAGINED
At 6 A.M. on a particular Friday of every month, the streets of most ofManhattanwillbelargelydesolate.Thestoresliningthesestreetswillbeclosed,their façades covered by steel security gates, the apartments above dark andsilent.The floors of Goldman Sachs, the global investment banking institution in
lower Manhattan, on the other hand, will be brightly lit, its elevators takingthousands of workers to their desks. By 7 A.M. most of these desks will beoccupied.Itwouldnotbeunfaironanyotherday todescribe thishour in thispartof
townassleepy.OnthisFridaymorning,however,therewillbeabuzzofenergyand excitement.On this day, information thatwillmassively impact the stockmarketissettoarrive.Minutes after its release, this information will be reported by news sites.
Seconds after its release, this information will be discussed, debated, anddissected,loudly,atGoldmanandhundredsofotherfinancialfirms.Butmuchoftherealactioninfinancethesedayshappensinmilliseconds.Goldmanandotherfinancialfirmspaidtensofmillionsofdollarstogetaccesstofiber-opticcablesthat reduced the time information travels fromChicago toNew Jersey by justfourmilliseconds (from17 to13).Financial firmshave algorithms inplace toreadtheinformationandtradebasedonit—allinamatterofmilliseconds.After
this crucial information is released, themarket willmove in less time than ittakesyoutoblinkyoureye.SowhatisthiscrucialdatathatissovaluabletoGoldmanandnumerousother
financialinstitutions?Themonthlyunemploymentrate.The rate, however—which has such a profound impact on the stockmarket
that financial institutions have done whatever it takes to maximize the speedwithwhichtheyreceive,analyze,andactuponit—isfromaphonesurveythattheBureauofLaborStatisticsconductsandtheinformationissomethreeweeks—or2billionmilliseconds—oldbythetimeitisreleased.Whenfirmsarespendingmillionsofdollarstochipamillisecondofftheflow
ofinformation,itmightstrikeyouasmorethanabitstrangethatthegovernmenttakessolongtocalculatetheunemploymentrate.Indeed,gettingthesecriticalnumbersoutsoonerwasoneofAlanKrueger’s
primary agendas when he took over as President Obama’s chairman of theCouncilofEconomicAdvisors in2011.Hewasunsuccessful.“Either theBLSdoesn’t have the resources,” he concluded. “Or they are stuck in twentieth-centurythinking.”Withthegovernmentclearlynotpickingupthepaceanytimesoon,istherea
way to get at least a roughmeasure of the unemployment statistics at a fasterrate? In this high-tech era—whennearly every click anyhumanmakeson theinternet is recorded somewhere—dowe really have towaitweeks to find outhowmanypeopleareoutofwork?OnepotentialsolutionwasinspiredbytheworkofaformerGoogleengineer,
Jeremy Ginsberg. Ginsberg noticed that health data, like unemployment data,wasreleasedwithadelaybythegovernment.TheCentersforDiseaseControland Prevention takes oneweek to release influenza data, even though doctorsandhospitalswouldbenefitfromhavingthedatamuchsooner.Ginsbergsuspectedthatpeoplesickwiththefluarelikelytomakeflu-related
searches. In essence, they would report their symptoms to Google. Thesesearches, he thought, could give a reasonably accuratemeasure of the currentinfluenza rate. Indeed, searches such as “flu symptoms” and “muscle aches”haveprovenimportantindicatorsofhowfastthefluisspreading.*
Meanwhile,Googleengineerscreatedaservice,GoogleCorrelate, thatgives
outside researchers the means to experiment with the same type of analysesacross a wide range of fields, not just health. Researchers can take any dataseries that they are trackingover timeand seewhatGoogle searches correlatemostwiththatdataset.Forexample,usingGoogleCorrelate,HalVarian,chiefeconomistatGoogle,
andIwereabletoshowwhichsearchesmostcloselytrackhousingprices.Whenhousingpricesare rising,Americans tend tosearch forsuchphrasesas“80/20mortgage,”“newhomebuilder,”and“appreciation rate.”Whenhousingpricesare falling,Americans tend to search for suchphrases as “short sale process,”“underwatermortgage,”and“mortgageforgivenessdebtrelief.”SocanGooglesearchesbeusedasalitmustestforunemploymentinthesame
way they can for housing prices or influenza? Can we tell, simply by whatpeopleareGoogling,howmanypeopleareunemployed,andcanwedosowellbeforethegovernmentcollatesitssurveyresults?Oneday,IputtheUnitedStatesunemploymentratefrom2004through2011
intoGoogleCorrelate.OfthetrillionsofGooglesearchesduringthattime,whatdoyouthinkturned
out to be most tightly connected to unemployment? You might imagine“unemploymentoffice”—orsomethingsimilar.Thatwashighbutnotattheverytop.“Newjobs”?Alsohighbutalsonotattheverytop.The highest during the period I searched—and these terms do shift—was
“Slutload.”That’s right, themost frequent searchwas for a pornographic site.Thismayseemstrangeatfirstblush,butunemployedpeoplepresumablyhavealotoftimeontheirhands.Manyarestuckathome,aloneandbored.Anotherofthehighlycorrelatedsearches—thisoneinthePGrealm—is“SpiderSolitaire.”Again,notsurprisingforagroupofpeoplewhopresumablyhavealotoftimeontheirhands.Now,Iamnotarguing,basedonthisoneanalysis,thattracking“Slutload”or
“SpiderSolitaire”isthebestwaytopredicttheunemploymentrate.Thespecificdiversions that unemployed people use can change over time (at one point,“Rawtube,”adifferentpornsite,wasamongthestrongestcorrelations)andnoneoftheseparticulartermsbyitselfattractsanythingapproachingapluralityoftheunemployed.ButIhavegenerallyfoundthatamixofdiversion-relatedsearchescan track the unemployment rate—and would be a part of the best model
predictingit.ThisexampleillustratesthefirstpowerofBigData,thereimaginingofwhat
qualifiesasdata.Frequently,thevalueofBigDataisnotitssize;it’sthatitcanoffer you new kinds of information to study—information that had neverpreviouslybeencollected.BeforeGoogletherewasinformationavailableoncertainleisureactivities—
movie ticket sales, for example—that could yield some clues as to howmuchtimepeoplehaveontheirhands.Buttheopportunitytoknowhowmuchsolitaireisbeingplayedorpornisbeingwatchedisnew—andpowerful.Inthisinstancethis datamight help usmore quicklymeasure how the economy is doing—atleastuntilthegovernmentlearnstoconductandcollateasurveymorequickly.
LifeonGoogle’s campus inMountainView,California, is verydifferent fromthatinGoldmanSachs’sManhattanheadquarters.At9A.M.Google’sofficesarenearlyempty.Ifanyworkersarearound,itisprobablytoeatbreakfastforfree—banana-blueberry pancakes, scrambled egg whites, filtered cucumber water.Someemployeesmightbeoutoftown:atanoff-sitemeetinginBoulderorLasVegasorperhapsonafreeski trip toLakeTahoe.Aroundlunchtime, thesandvolleyballcourtsandgrasssoccerfieldswillbefilled.ThebestburritoI’veevereatenwasatGoogle’sMexicanrestaurant.Howcanoneofthebiggestandmostcompetitivetechcompaniesintheworld
seeminglybesorelaxedandgenerous?GoogleharnessedBigDatainawaythatnoothercompanyeverhastobuildanautomatedmoneystream.ThecompanyplaysacrucialroleinthisbooksinceGooglesearchesarebyfarthedominantsource of BigData. But it is important to remember that Google’s success isitselfbuiltonthecollectionofanewkindofdata.Ifyouareoldenoughtohaveusedtheinternetinthetwentiethcentury,you
might remember the various search engines that existed back then—MetaCrawler,Lycos,AltaVista, to name a few.Andyoumight remember thatthesesearchengineswere,atbest,mildlyreliable.Sometimes,ifyouwerelucky,theymanagedtofindwhatyouwanted.Often,theywouldnot.Ifyoutyped“BillClinton”into themostpopularsearchengines in the late1990s, the topresultsincludeda randomsite that justproclaimed“BillClintonSucks”or a site thatfeaturedabadClintonjoke.Hardlythemostrelevantinformationaboutthethen
presidentoftheUnitedStates.In 1998, Google showed up. And its search results were undeniably better
those that of every one of its competitors. If you typed “Bill Clinton” intoGooglein1998,youweregivenhiswebsite,theWhiteHouseemailaddress,andthebestbiographiesofthemanthatexistedontheinternet.Googleseemedtobemagic.WhathadGoogle’sfounders,SergeyBrinandLarryPage,donedifferently?Othersearchengineslocatedfortheirusersthewebsitesthatmostfrequently
includedthephraseforwhichtheysearched.Ifyouwerelookingforinformationon“BillClinton,”thosesearchengineswouldfind,acrosstheentireinternet,thewebsitesthathadthemostreferencestoBillClinton.Thereweremanyreasonsthisrankingsystemwasimperfectandoneofthemwasthatitwaseasytogamethesystem.Ajokesitewiththetext“BillClintonBillClintonBillClintonBillClintonBillClinton”hiddensomewhereonitspagewouldscorehigherthantheWhiteHouse’sofficialwebsite.*
WhatBrinandPagedidwasfindawaytorecordanewtypeofinformationthatwasfarmorevaluablethanasimplecountofwords.Websitesoftenwould,when discussing a subject, link to the sites they thoughtweremost helpful inunderstanding that subject. For example, theNew York Times, if it mentionedBill Clinton, might allow readers who clicked on his name to be sent to theWhiteHouse’sofficialwebsite.Everywebsitecreatingoneoftheselinkswas,inasense,givingitsopinionof
the best information on Bill Clinton. Brin and Page could aggregate all theseopinions on every topic. It could crowdsource the opinions of theNew YorkTimes, millions of Listservs, hundreds of bloggers, and everyone else on theinternet.Ifawholeslewofpeoplethoughtthatthemostimportantlinkfor“BillClinton”washisofficialwebsite,thiswasprobablythewebsitethatmostpeoplesearchingfor“BillClinton”wouldwanttosee.Thesekindsoflinksweredatathatothersearchenginesdidn’tevenconsider,
and theywere incrediblypredictiveof themost useful informationon a giventopic.ThepointhereisthatGoogledidn’tdominatesearchmerelybycollectingmoredatathaneveryoneelse.Theydiditbyfindingabettertypeofdata.Fewerthantwoyearsafteritslaunch,Google,poweredbyitslinkanalysis,grewtobethe internet’s most popular search engine. Today, Brin and Page are together
worthmorethan$60billion.AswithGoogle, sowith everyone else trying to use data to understand the
world.TheBigDatarevolutionislessaboutcollectingmoreandmoredata.Itisaboutcollectingtherightdata.Buttheinternetisn’ttheonlyplacewhereyoucancollectnewdataandwhere
gettingtherightdatacanhaveprofoundlydisruptiveresults.Thisbookislargelyabouthowthedataon thewebcanhelpusbetterunderstandpeople.Thenextsection,however,doesn’thaveanythingtodowithwebdata.Infact,itdoesn’thaveanythingtodowithpeople.Butitdoeshelpillustratethemainpointofthischapter: the outsize value of new, unconventional data. And the principles itteachesusarehelpfulinunderstandingthedigital-baseddatarevolution.
BODIESASDATA
In the summer of 2013, a reddish-brown horse, of above-average size,with ablackmane,sat inasmallbarn inupstateNewYork.Hewasoneof152one-year-old horses at August’s Fasig-Tipton Select Yearling Sale in SaratogaSprings, and one of ten thousand one-year-old horses being auctioned off thatyear.Wealthymenandwomen,whentheyshelloutalotofmoneyonaracehorse,
wantthehonorofchoosingthehorse’sname.Thusthereddish-brownhorsedidnotyethaveanameand,likemosthorsesattheauction,wasinsteadreferredtobyhisbarnnumber,85.TherewaslittlethatmadeNo.85standoutatthisauction.Hispedigreewas
good but not great. His sire (father), Pioneerof [sic] the Nile, was a topracehorse,butotherkidsofPioneeroftheNilehadnothadmuchracingsuccess.TherewerealsodoubtsbasedonhowNo.85 looked.Hehada scratchonhisankle,forexample,whichsomebuyersworriedmightbeevidenceofaninjury.ThecurrentownerofNo.85wasanEgyptianbeermagnate,AhmedZayat,
who had come to upstate NewYork looking to sell the horse and buy a fewothers.Like almost all owners, Zayat hired a team of experts to help him choose
which horses to buy. But his experts were a bit different than those used bynearlyeveryotherowner.The typicalhorseexpertsyou’d seeat anevent like
this were middle-aged men, many from Kentucky or rural Florida with littleeducationbutwitha familybackground in thehorsebusiness.Zayat’sexperts,however,camefromasmallfirmcalledEQB.TheheadofEQBwasnotanold-school horse man. The head of EQB, instead, was Jeff Seder, an eccentric,Philadelphia-bornmanwithapileofdegreesfromHarvard.ZayathadworkedwithEQBbefore,sotheprocesswasfamiliar.Afterafew
daysofevaluatinghorses,Seder’steamwouldcomebacktoZayatwithfiveorsohorsestheyrecommendedbuyingtoreplaceNo.85.This time, though,wasdifferent.Seder’s teamcameback toZayat and told
him theywereunable to fulfillhis request.Theysimplycouldnot recommendthathebuyanyofthe151otherhorsesofferedupforsalethatday.Instead,theyoffered an unexpected and near-desperate plea. Zayat absolutely, positivelycould not sell horse No. 85. This horse, EQB declared, was not just the besthorse in theauction;hewas thebesthorseof theyearand,quitepossibly, thedecade.“Sellyourhouse,”theteamimploredhim.“Donotsellthishorse.”Thenextday,withlittlefanfare,horseNo.85wasboughtfor$300,000bya
mancallinghimselfIncardoBloodstock.Bloodstock,itwaslaterrevealed,wasapseudonymusedbyAhmedZayat.InresponsetothepleasofSeder,Zayathadbought backhis ownhorse, an almost unprecedented action. (The rules of theauctionpreventedZayatfromsimplyremovingthehorsefromtheauction,thusnecessitating the pseudonymous transaction.) Sixty-two horses at the auctionsoldforahigherpricethanhorseNo.85,withtwofetchingmorethan$1millioneach.Threemonthslater,ZayatfinallychoseanameforNo.85:AmericanPharoah.
Andeighteenmonths later,ona75-degreeSaturdayevening in the suburbsofNewYork City, American Pharoah became the first horse inmore than threedecadestowintheTripleCrown.What did Jeff Seder know about horse No. 85 that apparently nobody else
knew?HowdidthisHarvardmangetsogoodatevaluatinghorses?I first met up with Seder, who was then sixty-four, on a scorching June
afternoon inOcala,Florida,more than ayear afterAmericanPharoah’sTripleCrown. The event was a weeklong showcase for two-year-old horses,culminatinginanauction,notdissimilartothe2013eventwhereZayatboughthisownhorseback.
Seder has a booming, Mel Brooks–like voice, a full head of hair, and adiscernablebounceinhisstep.Hewaswearingsuspenders,khakis,ablackshirtwithhiscompany’slogoonit,andahearingaid.Over thenext threedays, he toldmehis life story—andhowhebecame so
goodatpredictinghorses.Itwashardlyadirectroute.AftergraduatingmagnacumlaudeandPhiBetaKappafromHarvard,Sederwenton toget,alsofromHarvard,alawdegreeandabusinessdegree.Atagetwenty-six,hewasworkingas an analyst forCitigroup inNewYorkCity but felt unhappy and burnt-out.Oneday,sittingintheatriumatthefirm’snewofficesonLexingtonAvenue,hefoundhimself studying a largemural of an open field.The painting remindedhimof his love of the countryside and his love of horses.Hewent home andlookedathimselfinthemirrorwithhisthree-piecesuiton.Heknewthenthathewasnotmeant tobeabankerandhewasnotmeant to live inNewYorkCity.Thenextmorning,hequithisjob.Sedermoved to ruralPennsylvania and ambled through a variety of jobs in
textiles and sports medicine before devoting his life full-time to his passion:predictingthesuccessofracehorses.Thenumbersinhorseracingarerough.Oftheonethousandtwo-year-oldhorsesshowcasedatOcala’sauction,oneof thenation’s most prestigious, perhaps five will end up winning a race with asignificantpurse.Whatwillhappentotheother995horses?Roughlyone-thirdwill prove too slow. Another one-third will get injured—most because theirlimbscan’twithstand theenormouspressureofgallopingat full speed. (Everyyear,hundredsofhorsesdieonAmericanracetracks,mostlyduetobrokenlegs.)Andtheremainingone-thirdwillhavewhatyoumightcallBartlebysyndrome.Bartleby, the scrivener in Herman Melville’s extraordinary short story, stopsworkingandanswerseveryrequesthisemployermakeswith“Iwouldprefernotto.”Manyhorses, early in their racingcareers, apparentlycome to realize thattheydon’tneed to run if theydon’t feel like it.Theymaystart a race runningfast, but, at some point, they’ll simply slow down or stop running altogether.Why run around anoval as fast as you can, especiallywhenyour hooves andhocks ache? “I would prefer not to,” they decide. (I have a soft spot forBartlebys,horseorhuman.)Withtheoddsstackedagainstthem,howcanownerspickaprofitablehorse?
Historically,peoplehavebelieved that thebestway topredictwhetherahorse
willsucceedhasbeentoanalyzehisorherpedigree.Beingahorseexpertmeansbeingabletorattleoffeverythinganybodycouldpossiblywanttoknowaboutahorse’sfather,mother,grandfathers,grandmothers,brothers,andsisters.Agentsannounce, for instance, that a big horse “came to her size legitimately” if hermother’slinehaslotsofbighorses.Thereisoneproblem,however.Whilepedigreedoesmatter, itcanstillonly
explainasmallpartofaracinghorse’ssuccess.Considerthetrackrecordoffullsiblings of all the horses named Horse of the Year, racing’s most prestigiousannual award. These horses have the best possible pedigrees—the identicalfamily history asworld-historical horses. Still,more than three-fourths do notwinamajorrace.Thetraditionalwayofpredictinghorsesuccess,thedatatellsus,leavesplentyofroomforimprovement.It’s actuallynot that surprising thatpedigree isnot thatpredictive.Thinkof
humans.ImagineanNBAownerwhoboughthisfuture team,as ten-year-olds,based on their pedigrees. He would have hired an agent to examine EarvinJohnson III, son of “Magic” Johnson. “He’s got nice size, thus far,” an agentmight say. “It’s legitimate size, from the Johnson line. He should have greatvision,selflessness,size,andspeed.Heseemstobeoutgoing,greatpersonality.Confidentwalk.Personable.This isagreatbet.”Unfortunately, fourteenyearslater,thisownerwouldhavea6’2”(shortforaproballplayer)fashionbloggerforE!EarvinJohnsonIIImightbeofgreatassistanceindesigningtheuniforms,buthewouldprobablyofferlittlehelponthecourt.Alongwith the fashionblogger, anNBAownerwhochose a teamasmany
ownerschoosehorseswouldlikelysnapupJeffreyandMarcusJordan,bothsonsofMichael Jordan, andboth ofwhomprovedmediocre college players.Goodluck against the Cleveland Cavaliers. They are led by LeBron James, whosemom is 5’5”. Or imagine a country that elected its leaders based on theirpedigrees.We’dbeledbypeoplelikeGeorgeW.Bush.(Sorry,couldn’tresist.)Horse agents do use other information besides pedigree. For example, they
analyzethegaitsoftwo-year-oldsandexaminehorsesvisually.InOcala,Ispenthours chattingwith various agents, whichwas long enough to determine thattherewaslittleagreementonwhatinfacttheywerelookingfor.Addtotheserampantcontradictionsanduncertaintiesthefactthatsomehorse
buyers havewhat seems like infinite funds, and you get amarketwith rather
largeinefficiencies.Tenyearsago,HorseNo.153wasa two-year-oldwhoranfaster than every other horse, looked beautiful to most agents, and had awonderful pedigree—adescendant ofNorthernDancer andSecretariat, twoofthegreatest racehorsesofall time.An IrishbillionaireandaDubai sheikbothwantedtopurchasehim.Theygotintoabiddingwarthatquicklyturnedintoacontestofpride.Ashundredsofstunnedhorsemenandwomenlookedon,thebidskeptgettinghigherandhigher,untilthetwo-year-oldhorsefinallysoldfor$16million,byfarthehighestpriceeverpaidforahorse.HorseNo.153,whowas given the nameTheGreenMonkey, ran three races, earned just $10,000,andwasretired.Sederneverhadany interest in the traditionalmethodsofevaluatinghorses.
He was interested only in data. He planned to measure various attributes ofracehorses and see which of them correlated with their performance. It’simportanttonotethatSederworkedouthisplanhalfadecadebeforetheWorldWideWebwasinvented.Buthisstrategywasverymuchbasedondatascience.AndthelessonsfromhisstoryareapplicabletoanybodyusingBigData.Foryears,Seder’spursuitproducednothingbutfrustration.Hemeasuredthe
size of horses’ nostrils, creating the world’s first and largest dataset on horsenostril sizeandeventualearnings.Nostril size,he found,didnotpredicthorsesuccess.HegavehorsesEKGstoexaminetheirheartsandcutthelimbsoffdeadhorses tomeasure thevolumeof their fast-twitchmuscles.Heoncegrabbedashoveloutsideabarntodeterminethesizeofhorses’excrement,onthetheorythatsheddingtoomuchweightbeforeaneventcanslowahorsedown.Noneofthiscorrelatedwithracingsuccess.Then, twelveyearsago,hegothisfirstbigbreak.Sederdecidedtomeasure
the sizeof thehorses’ internalorgans.Since thiswas impossiblewith existingtechnology, he constructed his own portable ultrasound. The results wereremarkable.Hefoundthat thesizeof theheart,andparticularly thesizeof theleft ventricle, was a massive predictor of a horse’s success, the single mostimportant variable. Another organ that mattered was the spleen: horses withsmallspleensearnedvirtuallynothing.Seder had a couple more hits. He digitized thousands of videos of horses
galloping and found that certain gaits did correlatewith racetrack success.Healsodiscoveredthatsometwo-year-oldhorseswheezeafterrunningone-eighth
of a mile. Such horses sometimes sell for as much as a million dollars, butSeder’sdatatoldhimthatthewheezersvirtuallyneverpanout.Hethusassignsanassistanttositnearthefinishlineandweedoutthewheezers.Ofaboutathousandhorsesat theOcalaauction,roughlytenwillpassallof
Seder’stests.Heignorespedigreeentirely,exceptasitwillinfluencethepriceahorsewillsellfor.“Pedigreetellsusahorsemighthaveaverysmallchanceofbeinggreat,” he says. “But if I can see he’s great,what do I care howhegotthere?”Onenight,Seder invitedme tohis roomat theHiltonhotel inOcala. In the
room,hetoldmeabouthischildhood,hisfamily,andhiscareer.Heshowedmepicturesofhiswife,daughter,andson.HetoldmehewasoneofthreeJewishstudentsinhisPhiladelphiahighschool,andthatwhenheenteredhewas4’10”.(He grew in college to 5’9”.) He told me about his favorite horse: PinkyPizwaanski. Seder bought and named this horse after a gay rider.He felt thatPinky, the horse, always gave a great effort even if he wasn’t the mostsuccessful.Finally,heshowedme the file that includedall thedatahehad recordedon
No. 85, the file that drove the biggest prediction of his career.Was he givingawayhissecret?Perhaps,buthesaidhedidn’tcare.Moreimportanttohimthanprotecting his secretswas being proven right, showing to theworld that thesetwenty years of cracking limbs, shoveling poop, and jerry-rigging ultrasoundshadbeenworthit.Here’ssomeofthedataonhorseNo.85:
NO.85(LATERAMERICANPHAROAH)PERCENTILES
ASAONE-YEAR-OLD
PERCENTILE
Height 56
Weight 61
Pedigree 70
LeftVentricle 99.61
Thereitwas,starkandclear,thereasonthatSederandhisteamhadbecomesoobsessedwithNo.85.Hisleftventriclewasinthe99.61stpercentile!Notonlythat,butallhisotherimportantorgans,includingtherestofhisheart
andspleen,wereexceptionallylargeaswell.Generallyspeaking,whenitcomesto racing, Seder had found, the bigger the left ventricle, the better. But a leftventricle as big as this can be a sign of illness if the other organs are tiny. InAmerican Pharoah, all the key organs were bigger than average, and the leftventriclewasenormous.Thedatascreamed thatNo.85wasa1-in-100,000orevenaone-in-a-millionhorse.
WhatcandatascientistslearnfromSeder’sproject?First,andperhapsmostimportant, ifyouaregoingtotrytousenewdatato
revolutionizeafield,itisbesttogointoafieldwhereoldmethodsarelousy.Thepedigree-obsessed horse agents whom Seder beat left plenty of room forimprovement.Sodidtheword-count-obsessedsearchenginesthatGooglebeat.Oneweakness ofGoogle’s attempt to predict influenza using search data is
thatyoucanalreadypredictinfluenzaverywelljustusinglastweek’sdataandasimple seasonal adjustment. There is still debate about howmuch search dataaddstothatsimple,powerfulmodel.Inmyopinion,Googlesearcheshavemorepromise measuring health conditions for which existing data is weaker andthereforesomethinglikeGoogleSTDmayprovemorevaluableinthelonghaulthanGoogleFlu.Thesecondlessonisthat,whentryingtomakepredictions,youneedn’tworry
toomuchaboutwhyyourmodelswork.Sedercouldnotfullyexplaintomewhythe left ventricle is so important in predicting a horse’s success.Nor couldhepreciselyaccountforthevalueofthespleen.Perhapsonedayhorsecardiologistsand hematologists will solve these mysteries. But for now it doesn’t matter.Seder is in the prediction business, not the explanation business. And, in thepredictionbusiness,youjustneedtoknowthatsomethingworks,notwhy.For example,Walmart uses data from sales in all their stores to knowwhat
products to shelve. Before Hurricane Frances, a destructive storm that hit theSoutheastin2004,Walmartsuspected—correctly—thatpeople’sshoppinghabits
may change when a city is about to be pummeled by a storm. They poredthrough sales data fromprevioushurricanes to seewhat peoplemightwant tobuy. A major answer? Strawberry Pop-Tarts. This product sells seven timesfasterthannormalinthedaysleadinguptoahurricane.Basedontheiranalysis,WalmarthadtrucksloadedwithstrawberryPop-Tarts
heading down Interstate 95 toward stores in the path of the hurricane. Andindeed,thesePop-Tartssoldwell.WhyPop-Tarts?Probablybecausetheydon’trequirerefrigerationorcooking.
Why strawberry?No clue.Butwhen hurricanes hit, people turn to strawberryPop-Tartsapparently.Sointhedaysbeforeahurricane,WalmartnowregularlystocksitsshelveswithboxesuponboxesofstrawberryPop-Tarts.Thereasonfortherelationshipdoesn’tmatter.But therelationshipitselfdoes.Maybeonedayfood scientists will figure out the association between hurricanes and toasterpastries filled with strawberry jam. But, while waiting for some suchexplanation,Walmart stillneeds to stock its shelveswith strawberryPop-Tartswhenhurricanes are approaching and save theRiceKrispies treats for sunnierdays.This lesson is also clear in the storyofOrleyAshenfelter.WhatSeder is to
horses,Ashenfelter,aneconomistatPrinceton,maybetowine.Alittleoveradecadeago,Ashenfelterwasfrustrated.Hehadbeenbuyinga
lotof redwine fromtheBordeauxregionofFrance.Sometimes thiswinewasdelicious,worthyofitshighprice.Manytimes,though,itwasaletdown.Why, Ashenfelter wondered, was he paying the same price for wine that
turnedoutsodifferently?One day, Ashenfelter received a tip from a journalist friend and wine
connoisseur. There was indeed a way to figure out whether a wine would begood. The key, Ashenfelter’s friend told him, was the weather during thegrowingseason.Ashenfelter’sinterestwaspiqued.Hewentonaquesttofigureoutifthiswas
trueandhecouldconsistentlypurchasebetterwine.Hedownloadedthirtyyearsof weather data on the Bordeaux region. He also collected auction prices ofwines.Theauctions,whichoccurmanyyearsafterthewinewasoriginallysold,wouldtellyouhowthewineturnedout.Theresultwasamazing.Ahugepercentageofthequalityofawinecouldbe
explainedsimplybytheweatherduringthegrowingseason.Infact,awine’squalitycouldbebrokendowntoonesimpleformula,which
wemightcalltheFirstLawofViticulture:
Price=12.145+0.00117winterrainfall+0.0614averagegrowingseasontemperature–0.00386harvestrainfall.
So why does wine quality in the Bordeaux region work like this? Whatexplains the First Law of Viticulture? There is some explanation forAshenfelter’swineformula—heatandearlyirrigationarenecessaryforgrapestoproperlyripen.But the precise details of his predictive formula gowell beyond any theory
andwilllikelyneverbefullyunderstoodevenbyexpertsinthefield.Whydoesacentimeterofwinterrainadd,onaverage,exactly0.1centstothe
priceofafullymaturedbottleofredwine?Whynot0.2cents?Whynot0.05?Nobody can answer these questions. But if there are 1,000 centimeters ofadditional rain inawinter,youshouldbewilling topayanadditional$1forabottleofwine.Indeed,Ashenfelter,despitenotknowingexactlywhyhis regressionworked
exactly as it did, used it to purchasewines.According to him, “Itworkedoutgreat.”Thequalityofthewineshedranknoticeablyimproved.Ifyourgoalistopredictthefuture—whatwinewilltastegood,whatproducts
willsell,whichhorseswillrunfast—youdonotneedtoworrytoomuchaboutwhyyourmodelworksexactlyasitdoes.Justgetthenumbersright.ThatisthesecondlessonofJeffSeder’shorsestory.The final lesson to be learned from Seder’s successful attempt to predict a
potential Triple Crown winner is that you have to be open and flexible indeterminingwhat countsasdata. It isnot as if theold-timehorseagentswereoblivious to data before Seder came along. They scrutinized race times andpedigreecharts.Seder’sgeniuswastolookfordatawhereothershadn’tlookedbefore,toconsidernontraditionalsourcesofdata.Foradatascientist,afreshandoriginalperspectivecanpayoff.
WORDSASDATA
Onedayin2004,twoyoungeconomistswithanexpertiseinmedia,thenPh.D.studentsatHarvard,werereadingaboutarecentcourtdecisioninMassachusettslegalizinggaymarriage.The economists, Matt Gentzkow and Jesse Shapiro, noticed something
interesting:twonewspapersemployedstrikinglydifferentlanguagetoreportthesame story. The Washington Times, which has a reputation for beingconservative,headlinedthestory:“Homosexuals‘Marry’inMassachusetts.”TheWashingtonPost,whichhasareputationforbeingliberal,reportedthattherehadbeenavictoryfor“same-sexcouples.”It’snosurprisethatdifferentnewsorganizationscantiltindifferentdirections,
that newspapers can cover the same storywith adifferent focus.Foryears, infact, Gentzkow and Shapiro had been pondering if they might use theireconomics training to help understand media bias. Why do some newsorganizationsseemto takeamore liberalviewandothersamoreconservativeone?ButGentzkowandShapiro didn’t really have any ideas on how theymight
tacklethisquestion;theycouldn’tfigureouthowtheycouldsystematicallyandobjectivelymeasuremediasubjectivity.WhatGentzkowandShapirofoundinteresting, then,about thegaymarriage
storywasnotthatnewsorganizationsdifferedintheircoverage;itwashow thenewspapers’coveragediffered—itcamedowntoadistinctshiftinwordchoice.In2004,“homosexuals,”asusedbytheWashingtonTimes,wasanold-fashionedand disparaging way to describe gay people, whereas “same-sex couples,” asused by the Washington Post, emphasized that gay relationships were justanotherformofromance.Thescholarswonderedwhether languagemightbe thekey tounderstanding
bias.Didliberalsandconservativesconsistentlyusedifferentphrases?Couldthewordsthatnewspapersuseinstoriesbeturnedintodata?WhatmightthisrevealabouttheAmericanpress?Couldwefigureoutwhetherthepresswasliberalorconservative? And could we figure out why? In 2004, these weren’t idlequestions.ThebillionsofwordsinAmericannewspaperswerenolongertrappedonnewsprintormicrofilm.Certainwebsitesnowrecordedeverywordincludedinevery story fornearlyeverynewspaper in theUnitedStates.GentzkowandShapiro could scrape these sites andquickly test the extent towhich language
could measure newspaper bias. And, by doing this, they could sharpen ourunderstandingofhowthenewsmediaworks.But,beforedescribingwhattheyfound,let’sleaveforamomentthestoryof
GentzkowandShapiroandtheirattempttoquantifythelanguageinnewspapers,anddiscusshowscholars,acrossawiderangeoffields,haveutilized thisnewtypeofdata—words—tobetterunderstandhumannature.
Language has, of course, always been a topic of interest to social scientists.However, studying language generally required the close reading of texts, andturninghugeswathsoftextintodatawasn’tfeasible.Now,withcomputersanddigitization, tabulating words across massive sets of documents is easy.Languagehas thusbecomesubject toBigDataanalysis.ThelinksthatGoogleutilizedwerecomposedofwords.SoaretheGooglesearchesthatIstudy.Wordsfeature frequently in this book. But language is so important to the BigDatarevolution,itdeservesitsownsection.Infact,itisbeingusedsomuchnowthatthereisanentirefielddevotedtoit:“textasdata.”Amajordevelopment in this field isGoogleNgrams.A fewyears ago, two
young biologists, Erez Aiden and Jean-Baptiste Michel, had their researchassistants counting words one by one in old, dusty texts to try to find newinsights on how certain usages of words spread. One day, Aiden andMichelheardaboutanewprojectbyGoogle todigitizea largeportionof theworld’sbooks. Almost immediately, the biologists grasped that this would be amucheasierwaytounderstandthehistoryoflanguage.“Werealizedourmethodsweresohopelesslyobsolete,”AidentoldDiscover
magazine. “It was clear that you couldn’t compete with this juggernaut ofdigitization.”Sotheydecidedtocollaboratewiththesearchcompany.Withthehelp of Google engineers, they created a service that searches through themillions of digitized books for a particular word or phrase. It then will tellresearchers how frequently that word or phrase appeared in every year, from1800to2010.Sowhatcanwelearnfromthefrequencywithwhichwordsorphrasesappear
in books in different years?For one thing,we learn about the slowgrowth inpopularityofsausageandtherelativelyrecentandrapidgrowthinpopularityofpizza.
But there are lessons far more profound than that. For instance, GoogleNgramscanteachushownationalidentityformed.OnefascinatingexampleispresentedinAidenandMichel’sbook,Uncharted.First,aquickquestion.DoyouthinktheUnitedStatesiscurrentlyaunitedor
adividedcountry?Ifyouarelikemostpeople,youwouldsaytheUnitedStatesisdivided thesedaysdue to thehigh levelofpoliticalpolarization.Youmightevensaythecountryisaboutasdividedasithaseverbeen.America,afterall,isnowcolor-coded:redstatesareRepublican;bluestatesareDemocratic.But,inUncharted,Aiden andMichel note one fascinating data point that reveals justhow much more divided the United States once was. The data point is thelanguagepeopleusetotalkaboutthecountry.Note the words I used in the previous paragraph when I discussed how
dividedthecountryis.Iwrote,“TheUnitedStatesisdivided.”IreferredtotheUnited States as a singular noun. This is natural; it is proper grammar andstandardusage.Iamsureyoudidn’tevennotice.However,Americans didn’t always speak thisway. In the early days of the
country, Americans referred to the United States using the plural form. Forexample,JohnAdams, inhis1799Stateof theUnionaddress, referred to“theUnited States in their treaties with his Britanic Majesty.” If my book werewrittenin1800,Iwouldhavesaid,“TheUnitedStatesaredivided.”Thislittleusagedifferencehaslongbeenafascinationforhistorians,sinceitsuggeststherewasapointwhenAmericastoppedthinkingofitselfasacollectionofstatesandstartedthinkingofitselfasonenation.
Sowhendidthishappen?Historians,Unchartedinformsus,haveneverbeensure, as there has been no systematic way to test it. But many have longsuspected the cause was the Civil War. In fact, James McPherson, formerpresident of the AmericanHistorical Association and a Pulitzer Prize winner,notedbluntly: “Thewarmarked a transitionof theUnitedStates to a singularnoun.”But it turns out McPherson was wrong. Google Ngrams gave Aiden and
Michelasystematicwaytocheckthis.TheycouldseehowfrequentlyAmericanbooksusedthephrase“TheUnitedStatesare...”versus“TheUnitedStatesis. . .” for every year in the country’s history. The transformation was moregradualanddidn’taccelerateuntilwellaftertheCivilWarended.
Fifteenyears after theCivilWar, therewere stillmoreusesof “TheUnitedStatesare. . .”than“TheUnitedStatesis . . . ,”showingthecountrywasstilldivided linguistically. Military victories happen quicker than changes inmindsets.
Somuchforhowacountryunites.Howdoamanandwomanunite?Wordscanhelphere,too.Forexample,wecanpredictwhetheramanandwomanwillgoonasecond
datebasedonhowtheyspeakonthefirstdate.Thiswas shown by an interdisciplinary team of Stanford andNorthwestern
scientists:DanielMcFarland,Dan Jurafsky, andCraigRawlings.They studied
hundreds of heterosexual speed daters and tried to determine what predictswhethertheywillfeelaconnectionandwantaseconddate.Theyfirstusedtraditionaldata.Theyaskeddatersfortheirheight,weight,and
hobbiesandtestedhowthesefactorscorrelatedwithsomeonereportingasparkof romantic interest.Women, on average, prefermenwho are taller and sharetheirhobbies;men,onaverage,preferwomenwhoareskinnierandsharetheirhobbies.Nothingnewthere.Butthescientistsalsocollectedanewtypeofdata.Theyinstructedthedaters
totaketaperecorderswiththem.Therecordingsofthedateswerethendigitized.Thescientistswere thusable tocode thewordsused, thepresenceof laughter,andthetoneofvoice.Theycouldtestbothhowmenandwomensignaledtheywereinterestedandhowpartnersearnedthatinterest.Sowhatdid the linguisticdata tellus?First,howamanorwomanconveys
thatheorsheisinterested.Oneofthewaysamansignalsthatheisattractedisobvious:helaughsatawoman’sjokes.Anotherislessobvious:whenspeaking,helimitstherangeofhispitch.Thereisresearchthatsuggestsamonotonevoiceis often seen by women as masculine, which implies that men, perhapssubconsciously,exaggeratetheirmasculinitywhentheylikeawoman.The scientists found that awoman signalsher interest byvaryingherpitch,
speakingmoresoftly,andtakingshorterturnstalking.Therearealsomajorcluesabout awoman’s interest basedon the particularwords she uses.Awoman isunlikely to be interested when she uses hedge words and phrases such as“probably”or“Iguess.”Fellas,ifawomanishedgingherstatementsonanytopic—ifshe“sorta”likes
herdrinkor“kinda”feelschillyor“probably”willhaveanotherhorsd’oeuvre—youcanbetthatsheis“sorta”“kinda”“probably”notintoyou.Awomanis likely tobe interestedwhenshe talksaboutherself. It turnsout
that,foramanlookingtoconnect,themostbeautifulwordyoucanhearfromawoman’smouthmaybe“I”:it’sasignsheisfeelingcomfortable.Awomanalsois likely tobe interested if sheuses self-markingphrases such as “Yaknow?”and“Imean.”Why?Thescientistsnotedthatthesephrasesinvitethelistener’sattention. They are friendly and warm and suggest a person is looking toconnect,yaknowwhatImean?Now,howcanmenandwomencommunicateinordertogetadateinterested
inthem?Thedatatellsusthatthereareplentyofwaysamancantalktoraisethechancesawomanlikeshim.Womenlikemenwhofollowtheirlead.Perhapsnotsurprisingly,awomanismorelikelytoreportaconnectionifamanlaughsather jokes and keeps the conversation on topics she introduces rather thanconstantly changing the subject to those hewants to talk about.*Women alsolikemenwhoexpresssupportandsympathy.Ifamansays,“That’sawesome!”or “That’s really cool,” a woman is significantly more likely to report aconnection.Likewiseifheusesphrasessuchas“That’stough”or“Youmustbesad.”For women, there is some bad news here, as the data seems to confirm a
distasteful truth aboutmen.Conversation plays only a small role in how theyrespondtowomen.Physicalappearancetrumpsallelseinpredictingwhetheramanreportsaconnection.Thatsaid,thereisonewordthatawomancanusetoat least slightly improve the odds aman likes her and it’s onewe’ve alreadydiscussed:“I.”Menaremorelikelytoreportclickingwithawomanwhotalksaboutherself.Andaspreviouslynoted,awomanisalsomorelikelytoreportaconnectionafteradatewhereshetalksaboutherself.Thusitisagreatsign,onafirstdate,ifthereissubstantialdiscussionaboutthewoman.Thewomansignalsher comfort and probably appreciates that the man is not hogging theconversation.Andthemanlikesthatthewomanisopeningup.Aseconddateislikely.Finally, thereisoneclear indicatorof troubleinadatetranscript:aquestion
mark.Iftherearelotsofquestionsaskedonadate,itislesslikelythatboththemanandthewomanwill reportaconnection.Thisseemscounterintuitive;youmightthinkthatquestionsareasignofinterest.Butnotsoonafirstdate.Onafirstdate,mostquestionsaresignsofboredom.“Whatareyourhobbies?”“Howmanybrothersandsistersdoyouhave?”Thesearethekindsofthingspeoplesaywhentheconversationstalls.Agreatfirstdatemayincludeasinglequestionattheend:“Willyougooutwithmeagain?”Ifthisistheonlyquestiononthedate,theanswerislikelytobe“Yes.”Andmen andwomendon’t just talk differentlywhen they’re trying towoo
eachother.Theytalkdifferentlyingeneral.Ateamofpsychologistsanalyzedthewordsusedinhundredsofthousandsof
Facebookposts.Theymeasuredhowfrequentlyeverywordisusedbymenand
women. They could then declare which are the most masculine and mostfemininewordsintheEnglishlanguage.Many of these word preferences, alas, were obvious. For example, women
talkabout“shopping”and“myhair”muchmorefrequently thanmendo.Mentalk about “football” and “Xbox”muchmore frequently thanwomen do.Youprobablydidn’tneedateamofpsychologistsanalyzingBigDatatotellyouthat.Someof thefindings,however,weremore interesting.Womenuse theword
“tomorrow”farmoreoftenthanmendo,perhapsbecausemenaren’tsogreatatthinking ahead. Adding the letter “o” to the word “so” is one of the mostfeminine linguistic traits. Among the words most disproportionately used bywomenare“soo,”“sooo,”“soooo,”“sooooo,”and“soooooo.”Maybeitwasmychildhoodexposuretowomenwhoweren’tafraidtothrow
theoccasional f-bomb.But I always thought cursingwasanequal-opportunitytrait.Notso.Amongthewordsusedmuchmorefrequentlybymenthanwomenare“fuck,”“shit,”“fucks,”“bullshit,”“fucking,”and“fuckers.”Here are word clouds showing words used mostly by men and those used
mostly by women. The larger a word appears, the more that word’s use tiltstowardthatgender.
Males
Females
WhatI likeaboutthisstudyisthenewdatainformsusofpatternsthathavelong existed but we hadn’t necessarily been aware of.Men and women havealways spoken indifferentways.But, for tensof thousandsofyears, thisdatadisappeared as soon as the sound waves faded in space. Now this data ispreservedoncomputersandcanbeanalyzedbycomputers.Or perhapswhat I should have said, givenmy gender: “Thewords used to
fuckingdisappear.NowwecantakeabreakfromwatchingfootballandplayingXboxandlearnthisshit.Thatis,ifanyonegivesafuck.”Itisn’tjustmenandwomenwhospeakdifferently.Peopleusedifferentwords
as they age.Thismight even give us some clues as to how the aging processplaysout.Here,fromthesamestudy,arethewordsmostdisproportionatelyusedbypeopleofdifferentagesonFacebook.Icallthisgraphic“Drink.Work.Pray.”Inpeople’s teens, they’redrinking.In their twenties, theyareworking.In theirthirtiesandonward,theyarepraying.
DRINK.WORK.PRAY19-to22-year-olds
Apowerfulnewtool foranalyzing text issomethingcalledsentimentanalysis.Scientistscannowestimatehowhappyorsadaparticularpassageoftextis.How?Teamsofscientistshaveaskedlargenumbersofpeopletocodetensof
thousands ofwords in theEnglish language as positive or negative.Themostpositive words, according to this methodology, include “happy,” “love,” and“awesome.”Themostnegativewordsinclude“sad,”“death,”and“depression.”Theythushavebuiltanindexofthemoodofahugesetofwords.Usingthisindex,theycanmeasuretheaveragemoodofwordsinapassageof
text. If someone writes “I am happy and in love and feeling awesome,”sentimentanalysiswouldcodethatasextremelyhappytext.Ifsomeonewrites“Iamsadthinkingaboutalltheworld’sdeathanddepression,”sentimentanalysiswouldcodethatasextremelysadtext.Otherpiecesoftextwouldbesomewhereinbetween.So what can you learn when you code the mood of text? Facebook data
scientists have shown one exciting possibility. They can estimate a country’sGross National Happiness every day. If people’s status messages tend to bepositive, thecountryisassumedhappyfor theday.If theytendtobenegative,thecountryisassumedsadfortheday.Among the Facebook data scientists’ findings: Christmas is one of the
happiestdaysof theyear.Now, Iwas skepticalof this analysis—andamabitskepticalof thiswholeproject.Generally,I thinkmanypeoplearesecretlysad
on Christmas because they are lonely or fighting with their family. Moregenerally, I tend not to trust Facebook status updates, for reasons that I willdiscuss in the next chapter—namely, our propensity to lie about our lives onsocialmedia.IfyouarealoneandmiserableonChristmas,doyoureallywanttobotherall
ofyourfriendsbypostingabouthowunhappyyouare?Isuspecttherearemanypeople spending a joyless Christmas who still post on Facebook about howgratefultheyarefortheir“wonderful,awesome,amazing,happylife.”TheythengetcodedassubstantiallyraisingAmerica’sGrossNationalHappiness.IfwearegoingtoreallycodeGrossNationalHappiness,weshouldusemoresourcesthanjustFacebookstatusupdates.That said, the finding thatChristmas is, onbalance, a joyousoccasiondoes
seemlegitimatelytobetrue.GooglesearchesfordepressionandGallupsurveysalsotellusthatChristmasisamongthehappiestdaysoftheyear.And,contrarytoanurbanmyth,suicidesdroparoundtheholidays.EveniftherearesomesadandlonelypeopleonChristmas,therearemanymoremerryones.These days,when people sit down to read,most of the time it is to peruse
status updates on Facebook. But, once upon a time, not so long ago, humanbeings read stories, sometimes inbooks.Sentiment analysis can teachus a lothere,too.Ateamofscientists,ledbyAndyReagan,nowattheUniversityofCalifornia
atBerkeleySchoolof Information,downloaded the textof thousandsofbooksandmovie scripts. They could then code how happy or sad each point of thestorywas.Consider,forexample,thebookHarryPotterandtheDeathlyHallows.Here,
fromthatteamofscientists,ishowthemoodofthestorychanges,alongwithadescriptionofkeyplotpoints.
Notethatthemanyrisesandfallsinmoodthatthesentimentanalysisdetectscorrespondtokeyevents.Most stories have simpler structures. Take, for example, Shakespeare’s
tragedyKing John. In this play, nothing goes right. King John of England isasked to renouncehis throne.He is excommunicated for disobeying the pope.Warbreaksout.Hisnephewdies,perhapsbysuicide.Otherpeopledie.Finally,Johnispoisonedbyadisgruntledmonk.Andhereisthesentimentanalysisastheplayprogresses.
In other words, just from the words, the computer was able to detect thatthingsgofrombadtoworsetoworst.
Orconsiderthemovie127Hours.Abasicplotsummaryof thismovie isasfollows:
A mountaineer goes to Utah’s Canyonlands National Park to hike. Hebefriendsotherhikersbutthenpartswayswiththem.Suddenly,heslipsandknockslooseaboulder,whichtrapshishandandwrist.Heattemptsvariousescapes, but each one fails.He becomes depressed. Finally, he amputateshis arm and escapes. He gets married, starts a family, and continuesclimbing, although nowhemakes sure to leave a notewhenever he goesoff.
Andhereisthesentimentanalysisasthemovieprogresses,againbyReagan’steamofscientists.
Sowhatdowelearnfromthemoodofthousandsofthesestories?Thecomputerscientistsfoundthatahugepercentageofstoriesfitintooneof
sixrelativelysimplestructures.Theyare,borrowingachartfromReagan’steam:
RagstoRiches(rise)RichestoRags(fall)ManinaHole(fall,thenrise)Icarus(rise,thenfall)Cinderella(rise,thenfall,thenrise)
Oedipus(fall,thenrise,thenfall)
Theremightbesmalltwistsandturnsnotcapturedbythissimplescheme.Forexample, 127 Hours ranks as a Man in a Hole story, even though there aremomentsalongthewaydownwhensentimentstemporarilyimprove.Thelarge,overarching structure ofmost stories fits into one of the six categories.HarryPotterandtheDeathlyHallowsisanexception.There are a lot of additional questionswemight answer. For example, how
has the structure of stories changed through time? Have stories gotten morecomplicated through the years?Do cultures differ in the types of stories theytell?What types of stories do people likemost? Do different story structuresappealtomenandwomen?Whataboutpeopleindifferentcountries?Ultimately, text as data may give us unprecedented insights into what
audiencesactuallywant,whichmaybedifferentfromwhatauthorsorexecutivesthinktheywant.Alreadytherearesomecluesthatpointinthisdirection.Consider a study by two Wharton School professors, Jonah Berger and
KatherineL.Milkman,onwhattypesofstoriesgetshared.TheytestedwhetherpositivestoriesornegativestoriesweremorelikelytomaketheNewYorkTimes’most-emailed list. They downloaded every Times article over a three-monthperiod. Using sentiment analysis, the professors coded the mood of articles.Examplesofpositivestoriesincluded“Wide-EyedNewArrivalsFallinginLovewith the City” and “Tony Award for Philanthropy.” Stories such as “WebRumors Tied to Korean Actress’ Suicide” and “Germany: Baby Polar Bear’sFeederDies”proved,notsurprisingly,tobenegative.Theprofessorsalsohadinformationaboutwherethestorywasplaced.Wasit
on the home page?On the top right?The top left?And they had informationaboutwhenthestorycameout.LateTuesdaynight?Mondaymorning?Theycouldcomparetwoarticles—oneofthempositive,oneofthemnegative
—that appeared in a similarplaceon theTimes site and cameout at a similartimeandseewhichonewasmorelikelytobeemailed.Sowhatgetsshared,positiveornegativearticles?Positivearticles.Astheauthorsconclude,“Contentismorelikelytobecome
viralthemorepositiveitis.”Note thiswould seem to contrastwith the conventional journalisticwisdom
thatpeopleareattracted toviolentandcatastrophicstories. Itmaybe true thatnews media give people plenty of dark stories. There is something to thenewsroom adage, “If it bleeds, it leads.” The Wharton professors’ study,however, suggests that peoplemay actually wantmore cheery stories. It maysuggest a new adage: “If it smiles, it’s emailed,” though that doesn’t reallyrhyme.
Somuchforsadandhappytext.Howdoyoufigureoutwhatwordsareliberalorconservative?Andwhatdoesthattellusaboutthemodernnewsmedia?Thisis a bit more complicated, which brings us back to Gentzkow and Shapiro.Remember,theyweretheeconomistswhosawgaymarriagedescribeddifferentways in twodifferentnewspapers andwondered if theycoulduse language touncoverpoliticalbias.The first thing these two ambitious young scholars did was examine
transcriptsoftheCongressionalRecord.Sincethisrecordwasalreadydigitized,theycoulddownloadeverywordusedbyeveryDemocratic congressperson in2005andeverywordusedbyeveryRepublicancongressperson in2005.Theycould then see if certain phraseswere significantlymore likely to be used byDemocratsorRepublicans.Somewereindeed.Hereareafewexamplesineachcategory.
PHRASESUSEDFARMOREBYDEMOCRATS
PHRASESUSEDFARMOREBYREPUBLICANS
Estatetax DeathtaxPrivatizesocialsecurity ReformsocialsecurityRosaParks SaddamHusseinWorkersrights PrivatepropertyrightsPoorpeople Governmentspending
Whatexplainsthesedifferencesinlanguage?SometimesDemocratsandRepublicansusedifferentphrasingtodescribethe
sameconcept.In2005,Republicanstriedtocutthefederalinheritancetax.They
tendedtodescribeitasa“deathtax”(whichsoundslikeanimpositionuponthenewlydeceased).Democratsdescribeditasan“estatetax”(whichsoundslikeatax on thewealthy). Similarly,Republicans tried tomoveSocial Security intoindividual retirement accounts. To Republicans, this was a “reform.” ToDemocrats,thiswasamoredangerous-sounding“privatization.”Sometimes differences in language are a question of emphasis.Republicans
and Democrats presumably both have great respect for Rosa Parks, the civilrights hero. But Democrats talked about her more frequently. Likewise,Democrats and Republicans presumably both think that Saddam Hussein, theformer leader of Iraq, was an evil dictator. But Republicans repeatedlymentioned him in their attempt to justify the Iraq War. Similarly, “workers’rights” and concern for “poor people” are core principles of the DemocraticParty. “Private property rights” and cutting “government spending” are coreprinciplesofRepublicans.Andthesedifferencesinlanguageusearesubstantial.Forexample, in2005,
congressional Republicans used the phrase “death tax” 365 times and “estatetax”only46times.ForcongressionalDemocrats,thepatternwasreversed.Theyusedthephrase“deathtax”only35timesand“estatetax”195times.And if thesewordscan telluswhetheracongressperson isaDemocratora
Republican, the scholars realized, they could also tell uswhether a newspapertiltsleftorright.JustasRepublicancongresspeoplemightbemorelikelytousethephrase“deathtax”topersuadepeopletoopposeit,conservativenewspapersmight do the same. The relatively liberal Washington Post used the phrase“estate tax”13.7 timesmore frequently than theyused thephrase“death tax.”TheconservativeWashingtonTimesused“deathtax”and“estatetax”aboutthesameamount.Thanks to thewondersof the internet,GentzkowandShapirocouldanalyze
the language used in a large number of the nation’s newspapers.The scholarsutilized two websites, newslibrary.com and proquest.com, which together haddigitized433newspapers.Theythencountedhowfrequentlyonethousandsuchpolitically charged phrases were used in newspapers in order to measure thepapers’politicalslant.Themostliberalnewspaper,bythismeasure,provedtobethe Philadelphia Daily News; the most conservative: the Billings (Montana)Gazette.
When you have the first comprehensive measure of media bias for such awide swath of outlets, you can answer perhaps the most important questionaboutthepress:whydosomepublicationsleanleftandothersright?Theeconomistsquicklyhomed inononekey factor: thepoliticsof agiven
area.Ifanareaisgenerallyliberal,asPhiladelphiaandDetroitare,thedominantnewspaper there tends to be liberal. If an area is more conservative, as areBillingsandAmarillo,Texas,thedominantpapertheretendstobeconservative.Inotherwords, the evidence strongly suggests that newspapers are inclined togivetheirreaderswhattheywant.Youmightthinkapaper’sownerwouldhavesomeinfluenceontheslantofits
coverage,butasa rule,whoownsapaperhas lesseffect thanwemight thinkupon its political bias.Notewhat happenswhen the same person or companyowns papers in differentmarkets.Consider theNewYorkTimesCompany. ItownswhatGentzkowandShapirofindtobetheliberal-leaningNewYorkTimes,based in New York City, where roughly 70 percent of the population isDemocratic.Italsoowned,atthetimeofthestudy,theconservative-leaning,bytheir measure, Spartanburg Herald-Journal, in Spartanburg, South Carolina,whereroughly70percentofthepopulationisRepublican.Thereareexceptions,of course: RupertMurdoch’sNewsCorporation ownswhat just about anyonewould find to be the conservative New York Post. But, overall, the findingssuggestthatthemarketdeterminesnewspapers’slantsfarmorethanownersdo.The study has a profound impact on howwe think about the newsmedia.
Many people, particularly Marxists, have viewed American journalism ascontrolled by rich people or corporations with the goal of influencing themasses, perhaps to push people toward their political views. Gentzkow andShapiro’spapersuggests,however,thatthisisnotthepredominantmotivationofowners. The owners of the American press, instead, are primarily giving themasseswhattheywantsothattheownerscanbecomeevenricher.Oh, and one more question—a big, controversial, and perhaps even more
provocative question. Do the American newsmedia, on average, slant left orright?Arethemediaonaverageliberalorconservative?Gentzkow and Shapiro found that newspapers slant left. The average
newspaperismoresimilar,inthewordsituses,toaDemocraticcongresspersonthanitistoaRepublicancongressperson.
“Aha!”conservativereadersmaybereadytoscream,“I toldyouso!”Manyconservatives have long suspected newspapers have been biased to try tomanipulatethemassestosupportleft-wingviewpoints.Not so, say the authors. In fact, the liberal bias is well calibrated to what
newspaperreaderswant.Newspaperreadership,onaverage,tiltsabitleft.(Theyhave data on that.) And newspapers, on average, tilt a bit left to give theirreaderstheviewpointstheydemand.Thereisnograndconspiracy.Thereisjustcapitalism.The news media, Gentzkow and Shapiro’s results imply, often operate like
every other industry on the planet. Just as supermarkets figure out what icecream people want and fill their shelves with it, newspapers figure out whatviewpointspeoplewantandfilltheirpageswithit.“It’sjustabusiness,”Shapirotoldme.Thatiswhatyoucanlearnwhenyoubreakdownandquantifymattersasconvolutedasnews,analysis,andopinionintotheircomponentparts:words.
PICTURESASDATA
Traditionally, when academics or businesspeoplewanted data, they conductedsurveys.Thedatacameneatly formed,drawn fromnumbersorcheckedboxeson questionnaires. This is no longer the case. The days of structured, clean,simple,survey-baseddataareover.Inthisnewage,themessytracesweleaveaswegothroughlifearebecomingtheprimarysourceofdata.Aswe’vealreadyseen,wordsaredata.Clicksaredata.Linksaredata.Typos
aredata.Bananas indreamsaredata.Toneofvoice isdata.Wheezing isdata.Heartbeats are data. Spleen size is data. Searches are, I argue, the mostrevelatorydata.Pictures,itturnsout,aredata,too.Just aswords,whichwereonce confined tobooks andperiodicalsondusty
shelves,havenowbeendigitized,pictureshavebeenliberatedfromalbumsandcardboardboxes.Theytoohavebeentransformedintobitsandreleasedintothecloud.And as text can give us history lessons—showing us, for example, thechanging ways people have spoken—pictures can give us history lessons—showingus,forexample,thechangingwayspeoplehaveposed.Consideraningeniousstudybya teamoffourcomputerscientistsatBrown
andBerkeley.Theytookadvantageofaneatdigital-eradevelopment:manyhighschoolshavescannedtheirhistoricalyearbooksandmadethemavailableonline.Across the internet, the researchers found 949 scanned yearbooks fromAmerican high schools spanning the years 1905–2013. This included tens ofthousandsofseniorportraits.Usingcomputersoftware,theywereabletocreatean “average” face out of the pictures fromevery decade. In otherwords, theycouldfigureouttheaveragelocationandconfigurationofpeople’snoses,eyes,lips, and hair. Here are the average faces from across the last century plus,brokendownbygender:
Noticeanything?Americans—andparticularlywomen—startedsmiling.Theywentfromnearlystone-facedatthestartofthetwentiethcenturytobeamingbytheend.Sowhythechange?DidAmericansgethappier?Nope.Otherscholarshavehelpedanswerthisquestion.Thereasonis,atleast
to me, fascinating. When photographs were first invented, people thought ofthemlikepaintings.Therewasnothingelsetocomparethemto.Thus,subjectsin photos copied subjects in paintings. And since people sitting for portraitscouldn’t hold a smile for the many hours the painting took, they adopted aseriouslook.Subjectsinphotosadoptedthesamelook.Whatfinallygotthemtochange?Business,profit,andmarketing,ofcourse.
In the mid-twentieth century, Kodak, the film and camera company, wasfrustratedby the limitednumberof pictures peoplewere taking anddevised astrategytogetthemtotakemore.Kodak’sadvertisingbeganassociatingphotos
with happiness. The goal was to get people in the habit of taking a picturewhenever theywanted toshowotherswhatagood time theywerehaving.Allthose smilingyearbookphotos are a resultof that successful campaign (as aremostofthephotosyouseeonFacebookandInstagramtoday).Butphotosasdatacantellusmuchmorethanwhenhighschoolseniorsbegan
tosay“cheese.”Surprisingly,imagesmaybeabletotellushowtheeconomyisdoing.Consider one provocatively titled academic paper: “Measuring Economic
GrowthfromOuterSpace.”Whenapaperhasatitlelikethat,youcanbetI’mgoing to read it. The authors of this paper—J. Vernon Henderson, AdamStoreygard, and David N. Weil—begin by noting that in many developingcountries, existing measures of gross domestic product (GDP) are inefficient.Thisisbecauselargeportionsofeconomicactivityhappenoffthebooks,andthegovernmentagenciesmeanttomeasureeconomicoutputhavelimitedresources.Theauthors’ratherunconventionalidea?TheycouldhelpmeasureGDPbased
onhowmuchlightthereisinthesecountriesatnight.Theygotthatinformationfrom photographs taken by a U.S. Air Force satellite that circles the earthfourteentimesperday.WhymightlightatnightbeagoodmeasureofGDP?Well,inverypoorparts
of the world, people struggle to pay for electricity. And as a result, wheneconomicconditionsarebad,householdsandvillageswilldramatically reducetheamountoflighttheyallowthemselvesatnight.Night light dropped sharply in Indonesia during the 1998 Asian financial
crisis. In South Korea, night light increased 72 percent from 1992 to 2008,correspondingtoaremarkablystrongeconomicperformanceoverthisperiod.InNorthKorea, over the same time, night light actually fell, corresponding to adismaleconomicperformanceduringthistime.In1998,insouthernMadagascar,alargeaccumulationofrubiesandsapphires
wasdiscovered.ThetownofIlakakawentfromlittlemorethanatruckstoptoamajortradingcenter.TherewasvirtuallynonightlightinIlakakapriorto1998.Inthenextfiveyears,therewasanexplosionoflightatnight.The authors admit their night light data is far from a perfect measure of
economicoutput.Youmostdefinitelycannotknowexactlyhowaneconomyisdoing just fromhowmuch lightsatellitescanpickupatnight.Theauthorsdo
not recommend using thismeasure at all for developed countries, such as theUnitedStates,wheretheexistingeconomicdataismoreaccurate.Andtobefair,evenindevelopingcountries,theyfindthatnightlightisonlyaboutasusefulastheofficialmeasures.Butcombiningboththeflawedgovernmentdatawiththeimperfectnightlightdatagivesabetterestimatethaneithersourcealonecouldprovide. You can, in other words, improve your understanding of developingeconomiesusingpicturestakenfromouterspace.JosephReisinger,acomputersciencePh.D.withasoftvoice,sharesthenight
light authors’ frustration with the existing datasets on the economies indevelopingcountries. InApril2014,Reisingernotes,Nigeriaupdated itsGDPestimate, taking into account new sectors they may have missed in previousestimates.TheirestimatedGDPwasnow90percenthigher.“They’re the largest economy in Africa,” Reisinger said, his voice slowly
rising.“Wedon’tevenknowthemostbasicthingwewouldwanttoknowaboutthatcountry.”Hewantedtofindawaytogetasharperlookateconomicperformance.His
solutionisquiteanexampleofhowtoreimaginewhatconstitutesdataandthevalueofdoingso.Reisingerfoundedacompany,Premise,whichemploysagroupofworkersin
developing countries, armed with smartphones. The employees’ job? To takepicturesofinterestinggoings-onthatmighthaveeconomicimport.The employees might get snapshots outside gas stations or of fruit bins in
supermarkets.Theytakepicturesofthesamelocationsoverandoveragain.ThepicturesaresentbacktoPremise,whosesecondgroupofemployees—computerscientists—turn the photos into data. The company’s analysts can codeeverything from the length of lines in gas stations to how many apples areavailable inasupermarket to theripenessof theseapples to theprice listedontheapples’bin.Basedonphotographsofallsortsofactivity,Premisecanbeginto put together estimates of economic output and inflation. In developingcountries,longlinesingasstationsarealeadingindicatorofeconomictrouble.Soareunavailableorunripeapples.Premise’son-the-groundpicturesofChinahelped themdiscover food inflation there in 2011 and food deflation in 2012,longbeforetheofficialdatacamein.Premisesells this information tobanksorhedge fundsandalsocollaborates
withtheWorldBank.Likemany good ideas, Premise’s is a gift that keeps on giving. TheWorld
Bankwasrecentlyinterestedinthesizeoftheundergroundcigaretteeconomyinthe Philippines. In particular, they wanted to know the effects of thegovernment’s recent efforts, which included random raids, to crack down onmanufacturers that produced cigarettes without paying a tax. Premise’s cleveridea?Takephotosofcigaretteboxesseenonthestreet.Seehowmanyofthemhave tax stamps,which all legitimate cigarettes do.They have found that thispartoftheundergroundeconomy,whilelargein2015,gotsignificantlysmallerin2016.Thegovernment’seffortsworked,althoughseeingsomethingusuallysohidden—illegalcigarettes—requirednewdata.
Aswe’veseen,whatconstitutesdatahasbeenwildlyreimagined in thedigitalageandalotofinsightshavebeenfoundinthisnewinformation.Learningwhatdrivesmediabias,whatmakesagoodfirstdate,andhowdevelopingeconomiesarereallydoingisjustthebeginning.Not incidentally, a lot of money has also been made from such new data,
startingwithMessrs.Brin’sandPage’stensofbillions.JosephReisingerhasn’tdone badly himself. Observers estimate that Premise is now making tens ofmillionsofdollarsinannualrevenue.Investorsrecentlypoured$50millionintothe company. This means some investors consider Premise among the mostvaluableenterprisesintheworldprimarilyinthebusinessoftakingandsellingphotos,inthesameleagueasPlayboy.Thereis,inotherwords,outsizevalue,forscholarsandentrepreneursalike,in
utilizingallthenewtypesofdatanowavailable,inthinkingbroadlyaboutwhatcountsasdata.Thesedays,adatascientistmustnotlimitherselftoanarrowortraditional view of data. These days, photographs of supermarket lines arevaluabledata.Thefullnessofsupermarketbinsisdata.Theripenessofapplesisdata.Photosfromouterspacearedata.Thecurvatureoflipsisdata.Everythingisdata!Andwithallthisnewdata,wecanfinallyseethroughpeople’slies.
4
DIGITALTRUTHSERUM
Everybodylies.Peoplelieabouthowmanydrinkstheyhadonthewayhome.Theylieabout
howoften theygo to thegym,howmuch thosenew shoes cost,whether theyreadthatbook.Theycallinsickwhenthey’renot.Theysaythey’llbeintouchwhentheywon’t.Theysayit’snotaboutyouwhenitis.Theysaytheyloveyouwhentheydon’t.Theysaythey’rehappywhileinthedumps.Theysaytheylikewomenwhentheyreallylikemen.Peoplelietofriends.Theylietobosses.Theylietokids.Theylietoparents.
They lie to doctors. They lie to husbands. They lie to wives. They lie tothemselves.Andtheydamnsurelietosurveys.Here’smybriefsurveyforyou:
Haveyouevercheatedonanexam?__________Haveyoueverfantasizedaboutkillingsomeone?_________
Were you tempted to lie?Many people underreport embarrassing behaviorsandthoughtsonsurveys.Theywanttolookgood,eventhoughmostsurveysareanonymous.Thisiscalledsocialdesirabilitybias.Animportantpaperin1950providedpowerfulevidenceofhowsurveyscan
fallvictimtosuchbias.Researcherscollecteddata,fromofficialsources,onthe
residentsofDenver:whatpercentageofthemvoted,gavetocharity,andownedalibrarycard.Theythensurveyedtheresidentstoseeifthepercentageswouldmatch.Theresultswere,atthetime,shocking.Whattheresidentsreportedtothesurveys was very different from the data the researchers had gathered. Eventhough nobody gave their names, people, in large numbers, exaggerated theirvoterregistrationstatus,votingbehavior,andcharitablegiving.
Has anything changed in sixty-five years? In the age of the internet, notowningalibrarycardisnolongerembarrassing.But,whilewhat’sembarrassingordesirablemayhavechanged,people’s tendency todeceivepollsters remainsstrong.A recent survey asked University of Maryland graduates various questions
about theircollegeexperience.Theanswerswerecomparedtoofficial records.Peopleconsistentlygavewronginformation,inwaysthatmadethemlookgood.Fewerthan2percentreportedthattheygraduatedwithlowerthana2.5GPA.(Inreality, about 11 percent did.) And 44 percent said they had donated to theuniversityinthepastyear.(Inreality,about28percentdid.)Anditiscertainlypossiblethatlyingplayedaroleinthefailureofthepollsto
predict Donald Trump’s 2016 victory. Polls, on average, underestimated hissupportbyabout2percentagepoints.Somepeoplemayhavebeenembarrassedto say theywere planning to support him.Somemayhave claimed theywereundecidedwhentheywerereallygoingTrump’swayallalong.Whydopeoplemisinformanonymoussurveys?IaskedRogerTourangeau,a
research professor emeritus at the University of Michigan and perhaps theworld’s foremost expert on social desirability bias. Our weakness for “white
lies”isanimportantpartoftheproblem,heexplained.“Aboutone-thirdofthetime,peoplelieinreallife,”hesuggests.“Thehabitscarryovertosurveys.”Thenthere’sthatoddhabitwesometimeshaveoflyingtoourselves.“Thereis
an unwillingness to admit to yourself that, say, you were a screw-up as astudent,”saysTourangeau.Lyingtooneselfmayexplainwhysomanypeoplesaytheyareaboveaverage.
Howbigisthisproblem?Morethan40percentofonecompany’sengineerssaidtheyareinthetop5percent.Morethan90percentofcollegeprofessorssaytheydoabove-averagework.One-quarterofhighschoolseniorsthinktheyareinthetop1percentintheirabilitytogetalongwithotherpeople.Ifyouaredeludingyourself,youcan’tbehonestinasurvey.Anotherfactorthatplaysintoourlyingtosurveysisourstrongdesiretomake
agoodimpressiononthestrangerconductingtheinterview,ifthereissomeoneconducting the interview, that is.AsTourangeauputs it, “Apersonwho lookslikeyourfavoriteauntwalksin....Doyouwanttotellyourfavoriteauntyouusedmarijuanalastmonth?”*Doyouwanttoadmitthatyoudidn’tgivemoneytoyourgoodoldalmamater?For this reason, themore impersonal theconditions, themorehonestpeople
will be. For eliciting truthful answers, internet surveys are better than phonesurveys,whicharebetterthanin-personsurveys.Peoplewilladmitmoreiftheyarealonethanifothersareintheroomwiththem.However, on sensitive topics, every survey method will elicit substantial
misreporting. Tourangeau here used a word that is often thrown around byeconomists:“incentive.”Peoplehavenoincentivetotellsurveysthetruth.How,therefore,canwelearnwhatourfellowhumansarereallythinkingand
doing?Insomeinstances, thereareofficialdatasourceswecanreferencetoget the
truth.Evenifpeoplelieabouttheircharitabledonations,forexample,wecangetrealnumbersaboutgivinginanareafromthecharitiesthemselves.Butwhenwearetryingtolearnaboutbehaviorsthatarenottabulatedinofficialrecordsorweare trying to learn what people are thinking—their true beliefs, feelings, anddesires—thereisnoothersourceofinformationexceptwhatpeoplemaydeigntotellsurveys.Untilnow,thatis.This is the second power of BigData: certain online sources get people to
admit thingstheywouldnotadmitanywhereelse.Theyserveasadigital truthserum.Think ofGoogle searches.Remember the conditions thatmake peoplemorehonest.Online?Check.Alone?Check.Nopersonadministeringasurvey?Check.And there’s another huge advantage that Google searches have in getting
people to tell the truth: incentives. If you enjoy racist jokes, you have zeroincentive to share that un-PC fact with a survey. You do, however, have anincentivetosearchforthebestnewracistjokesonline.Ifyouthinkyoumaybesufferingfromdepression,youdon’thaveanincentivetoadmitthistoasurvey.YoudohaveanincentivetoaskGoogleforsymptomsandpotentialtreatments.Evenifyouarelyingtoyourself,Googlemayneverthelessknowthetruth.A
couple of days before the election, you and some of your neighbors maylegitimatelythinkyouwilldrivetoapollingplaceandcastballots.But,ifyouandtheyhaven’tsearchedforanyinformationonhowtovoteorwheretovote,data scientists likemecan figureout that turnout inyourareawill actuallybelow.Similarly,maybeyouhaven’tadmittedtoyourselfthatyoumaysufferfromdepression,evenasyou’reGooglingaboutcryingjagsanddifficultygettingoutofbed.Youwould showup,however, in anarea’sdepression-related searchesthatIanalyzedearlierinthisbook.Thinkof your own experience usingGoogle. I amguessing youhave upon
occasiontypedthingsintothatsearchboxthatrevealabehaviororthoughtthatyou would hesitate to admit in polite company. In fact, the evidence isoverwhelmingthatalargemajorityofAmericansaretellingGooglesomeverypersonal things. Americans, for instance, search for “porn” more than theysearchfor“weather.”This isdifficult,bytheway, toreconcilewith thesurveydata since only about 25 percent ofmen and 8 percent ofwomen admit theywatchpornography.YoumayhavealsonoticedacertainhonestyinGooglesearcheswhenlooking
at theway this search engine automatically tries to complete your queries. Itssuggestions are based on the most common searches that other people havemade.Soauto-completecluesusintowhatpeopleareGoogling.Infact,auto-completecanbeabitmisleading.Googlewon’tsuggestcertainwordsitdeemsinappropriate, such as “cock,” “fuck,” and “porn.” This means auto-completetellsusthatpeople’sGooglethoughtsarelessracythantheyactuallyare.Even
so,somesensitivestuffoftenstillcomesup.Ifyou type“Why is . . .” the first twoGoogle auto-completes currently are
“Whyistheskyblue?”and“Whyistherealeapday?”suggestingthesearethetwomost commonways tocomplete this search.The third: “Why ismypoopgreen?”AndGoogleauto-completecangetdisturbing.Today,ifyoutypein“Isit normal towant to . . . ,” the first suggestion is “kill.” If you type in “Is itnormaltowanttokill...,”thefirstsuggestionis“myfamily.”Needmoreevidence thatGooglesearchescangiveadifferentpictureof the
worldthantheoneweusuallysee?Considersearchesrelatedtoregretsaroundthedecisiontohaveornottohavechildren.Beforedeciding,somepeoplefeartheymightmakethewrongchoice.And,almostalways,thequestioniswhetherthey will regret not having kids. People are seven times more likely to askGoogle whether they will regret not having children than whether they willregrethavingchildren.Aftermaking their decision—either to reproduce (or adopt) or not—people
sometimes confess to Google that they rue their choice. This may come assomethingofashockbutpost-decision, thenumbersare reversed.Adultswithchildrenare3.6timesmorelikelytotellGoogletheyregrettheirdecisionthanareadultswithoutchildren.Onecaveat thatshouldbekept inmind throughout thischapter:Googlecan
displayabiastowardunseemlythoughts,thoughtspeoplefeeltheycan’tdiscusswith anyone else. Nonetheless, if we are trying to uncover hidden thoughts,Google’sabilitytoferretthemoutcanbeuseful.Andthelargedisparitybetweenregretsonhavingversusnothavingkidsseemstobetellingusthattheunseemlythoughtinthiscaseisasignificantone.Let’s pause for amoment to considerwhat it evenmeans tomake a search
suchas“Iregrethavingchildren.”Googlepresentsitselfasasourcefromwhichwe can seek information directly, on topics like the weather, who won lastnight’sgame,orwhentheStatueofLibertywaserected.ButsometimeswetypeouruncensoredthoughtsintoGoogle,withoutmuchhopethatitwillbeabletohelpus.Inthiscase,thesearchwindowservesasakindofconfessional.There are thousands of searches every year, for example, for “I hate cold
weather,”“Peopleareannoying,”and“Iamsad.”Ofcourse,thosethousandsofGooglesearchesfor“Iamsad”representonlyatinyoffractionofthehundreds
ofmillionsofpeoplewhofeelsadinagivenyear.Searchesexpressingthoughts,ratherthanlookingforinformation,myresearchhasfound,areonlymadebyasmallsampleofeveryoneforwhomthatthoughtcomestomind.Similarly,myresearchsuggeststhattheseventhousandsearchesbyAmericanseveryyearfor“Iregrethavingchildren”representasmallsampleofthosewhohavehadthatthought.Kidsareobviouslyahugejoyformany,probablymost,people.And,despite
mymom’s fear that“youandyourstupiddataanalysis”aregoing to limithernumberofgrandchildren,thisresearchhasnotchangedmydesiretohavekids.Butthatunseemlyregretisinteresting—andanotheraspectofhumanitythatwetendnot tosee in the traditionaldatasets.Ourculture isconstantlyfloodinguswith images ofwonderful, happy families.Most peoplewould never considerhavingchildrenassomethingtheymightregret.Butsomedo.Theymayadmitthistonoone—exceptGoogle.
THETRUTHABOUTSEX
HowmanyAmericanmen are gay? This is a legendary question in sexualityresearch.Yet it has been among the toughest questions for social scientists toanswer. Psychologists no longer believe Alfred Kinsey’s famous estimate—basedonsurveysthatoversampledprisonersandprostitutes—that10percentofAmericanmenaregay.Representativesurveysnowtellusabout2to3percentare.Butsexualpreferencehaslongbeenamongthesubjectsuponwhichpeoplehave tended to lie. I think I can use BigData to give a better answer to thisquestionthanwehaveeverhad.First,moreonthatsurveydata.Surveystellustherearefarmoregaymenin
tolerantstatesthanintolerantstates.Forexample,accordingtoaGallupsurvey,the proportion of the population that is gay is almost twice as high in RhodeIsland,thestatewiththehighestsupportforgaymarriage,thanMississippi,thestatewiththelowestsupportforgaymarriage.There are two likely explanations for this. First, gaymenborn in intolerant
statesmaymovetotolerantstates.Second,gaymeninintolerantstatesmaynotdivulgethattheyaregay;theyareevenmorelikelytolie.Some insight into explanation number one—gay mobility—can be gleaned
fromanotherBigDatasource:Facebook,whichallowsuserstolistwhatgenderthey are interested in. About 2.5 percent of male Facebook users who list agenderofinterestsaytheyareinterestedinmen;thatcorrespondsroughlywithwhat thesurveys indicate.AndFacebook tooshowsbigdifferences in thegaypopulation in states with high versus low tolerance: Facebook has the gaypopulationmorethantwiceashighinRhodeIslandasinMississippi.Facebook also can provide information on how peoplemove around. Iwas
able to code the hometown of a sample of openly gay Facebook users. Thisallowedmetodirectlyestimatehowmanygaymenmoveoutofintolerantstatesinto more tolerant parts of the country. The answer? There is clearly somemobility—fromOklahomaCity to San Francisco, for example. But I estimatethatmen packing up their JudyGarlandCDs and heading to someplacemoreopen-minded can explain less than half of the difference in the openly gaypopulationintolerantversusintolerantstates.*
Inaddition,Facebookallowsustofocusinonhighschoolstudents.Thisisaspecialgroup,becausehighschoolboysrarelygettochoosewheretheylive.Ifmobility explained the state-by-state differences in the openly gay population,thesedifferencesshouldnotappearamonghighschoolusers.Sowhatdoesthehigh school data say? There are far fewer openly gay high school boys inintolerant states. Only two in one thousand male high school students inMississippiareopenlygay.Soitain’tjustmobility.If a similarnumberofgaymenareborn in every state andmobility cannot
fully explainwhy some stateshave somanymoreopenlygaymen, the closetmustbeplayingabigrole.WhichbringsusbacktoGoogle,withwhichsomanypeoplehaveprovedwillingtosharesomuch.Might therebeaway tousepornsearches to testhowmanygaymen there
really are in different states? Indeed, there is.Countrywide, I estimate—usingdatafromGooglesearchesandGoogleAdWords—thatabout5percentofmaleporn searches are for gay-male porn. (Thesewould include searches for suchtermsas“RocketTube,”apopulargaypornographicsite,aswellas“gayporn.”)Andhowdoes thisvary indifferentpartsof the country?Overall, there are
more gay porn searches in tolerant states compared to intolerant states. Thismakessense,giventhatsomegaymenmoveoutofintolerantplacesintotolerantplaces.Butthedifferencesarenotnearlyaslargeasthedifferencessuggestedby
either surveysorFacebook. InMississippi, I estimate that4.8percentofmalepornsearchesareforgayporn,farhigherthanthenumberssuggestedbyeithersurveys or Facebook and reasonably close to the 5.2 percent of pornographysearchesthatareforgayporninRhodeIsland.SohowmanyAmericanmenaregay?Thismeasureofpornographysearches
bymen—roughly5percent are same-sex—seemsa reasonable estimateof thetrue sizeof thegaypopulation in theUnitedStates.And there is another, lessstraightforward way to get at this number. It requires some data science.Wecouldutilize therelationshipbetweentoleranceandtheopenlygaypopulation.Bearwithmeabithere.Mypreliminary research indicates that in a given state every 20 percentage
pointsof support forgaymarriagemeansaboutoneandahalf timesasmanymenfromthatstatewillidentifyopenlyasgayonFacebook.Basedonthis,wecan estimate how many men born in a hypothetically fully tolerant place—where, say, 100 percent of people supported gaymarriage—would be openlygay.My estimate is about 5 percent would be, which fits the data from pornsearches nicely. The closest we might have to growing up in a fully tolerantenvironment ishigh schoolboys inCalifornia’sBayArea.About4percentofthemareopenlygayonFacebook.Thatseemsinlinewithmycalculation.I should note that I have not yet been able to come upwith an estimate of
same-sexattractionforwomen.Thepornographynumbersarelessusefulhere,since far fewer women watch pornography, making the sample lessrepresentative.Andofthosewhodo,evenwomenwhoareprimarilyattractedtomeninreallifeseemtoenjoyviewinglesbianporn.Fully20percentofvideoswatchedbywomenonPornHubarelesbian.FivepercentofAmericanmenbeinggayisanestimate,ofcourse.Somemen
are bisexual; some—especially when young—are not sure what they are.Obviously,youcan’tcountthisaspreciselyasyoumightthenumberofpeoplewhovoteorattendamovie.But one consequence of my estimate is clear: an awful lot of men in the
UnitedStates,particularlyinintolerantstates,arestill inthecloset.Theydon’treveal their sexual preferences on Facebook. They don’t admit it on surveys.Andinmanycases,theymayevenbemarriedtowomen.It turnsout thatwivessuspect theirhusbandsofbeinggayratherfrequently.
They demonstrate that suspicion in the surprisingly common search: “Is myhusbandgay?”“Gay”is10percentmorelikelytocompletesearchesthatbegin“Ismyhusband . . .” than the second-placeword, “cheating.” It is eight timesmore common than “an alcoholic” and ten times more common than“depressed.”Most tellingly perhaps, searches questioning a husband’s sexuality are far
more prevalent in the least tolerant regions. The states with the highestpercentageofwomenaskingthisquestionareSouthCarolinaandLouisiana.Infact, in twenty-one of the twenty-five states where this question is mostfrequentlyasked,supportforgaymarriageislowerthanthenationalaverage.Googleandpornsitesaren’ttheonlyusefuldataresourceswhenitcomesto
men’ssexuality.ThereismoreevidenceavailableinBigDataonwhatitmeansto live in thecloset. IanalyzedadsonCraigslist formales lookingfor“casualencounters.”Thepercentageoftheseadsthatareseekingcasualencounterswithmentendstobelargerinlesstolerantstates.AmongthestateswiththehighestpercentagesareKentucky,Louisiana,andAlabama.Andforevenmoreofaglimpseintothecloset,let’sreturntoGooglesearch
data and get a little more granular. One of the most common searches madeimmediatelybeforeorafter“gayporn”is“gaytest.”(Thesetestspresumetotellmenwhetherornottheyarehomosexual.)Andsearchesfor“gaytest”areabouttwiceasprevalentintheleasttolerantstates.Whatdoesitmeantogobackandforthbetweensearchingfor“gayporn”and
searchingfor“gaytest”?Presumably,itsuggestsafairlyconfusedifnottorturedmind. It’s reasonable to suspect that someof thesemenarehoping toconfirmthattheirinterestingayporndoesnotactuallymeanthey’regay.TheGoogle search data does not allow us to see a particular user’s search
history over time. However, in 2006, AOL released a sample of their users’searches to academic researchers. Here are some of one anonymous user’ssearchesoverasix-dayperiod.
Friday03:49:55 freegaypicks[sic]
Friday03:59:37 lockerroomgaypicks
Friday04:00:14 gaypicks
Friday04:00:35 gaysexpicks
Friday05:08:23 alonggayquiz
Friday05:10:00 agoodgaytest
Friday05:25:07 gaytestsforaconfusedman
Friday05:26:38 gaytests
Friday05:27:22 amigaytests
Friday05:29:18 gaypicks
Friday05:30:01 nakedmenpicks
Friday05:32:27 freenudemenpicks
Friday05:38:19 hotgaysexpicks
Friday05:41:34 hotmanbuttsex
Wednesday13:37:37 amigaytests
Wednesday13:41:20 gaytests
Wednesday13:47:49 hotmanbuttsex
Wednesday13:50:31 freegaysexvidio[sic]
Thiscertainlyreadslikeamanwhoisnotcomfortablewithhissexuality.AndtheGoogledatatellsustherearestillmanymenlikehim.Mostofthem,infact,liveinstatesthatarelesstolerantofsame-sexrelationships.For an even closer look at the people behind these numbers, I asked a
psychiatrist inMississippi,whospecializesinhelpingclosetedgaymen,ifanyofhispatientsmightwant to talk tome.Onemanreachedout.He toldmehewasaretiredprofessor,inhissixties,andmarriedtothesamewomanformore
thanfortyyears.About ten years ago, overwhelmedwith stress, he saw the psychiatrist and
finally acknowledged his sexuality.He has always known hewas attracted tomen,hesays,butthoughtthatthiswasuniversalandsomethingthatallmenjusthid. Shortly after beginning therapy, he had his first, and only, gay sexualencounter,withastudentofhiswhowasinhis late twenties,anexperiencehedescribesas“wonderful.”Heandhiswifedonothavesex.Hesaysthathewouldfeelguiltyeverending
hismarriageoropenlydatingaman.Heregretsvirtuallyeveryoneofhismajorlifedecisions.Theretiredprofessorandhiswifewillgoanothernightwithoutromanticlove,
withoutsex.Despiteenormousprogress,thepersistenceofintolerancewillcausemillionsofotherAmericanstodothesame.
Youmaynotbeshocked to learn that5percentofmenaregayand thatmanyremaininthecloset.Therehavebeentimeswhenmostpeoplewouldhavebeenshocked. And there are still places where many people would be shocked aswell.“In Iran we don’t have homosexuals like in your country,” Mahmoud
Ahmadinejad, thenpresidentofIran, insistedin2007.“InIranwedonothavethis phenomenon.” Likewise, Anatoly Pakhomov, mayor of Sochi, Russia,shortly before his city hosted the 2014Winter Olympics, said of gay people,“We do not have them in our city.” Yet internet behavior reveals significantinterestingayporninSochiandIran.Thisraisesanobviousquestion:arethereanycommonsexualinterestsinthe
United States today that are still considered shocking? It depends what youconsidercommonandhoweasilyshockedyouare.MostofthetopsearchesonPornHubarenotsurprising—theyincludeterms
like“teen,”“threesome,”and“blowjob”formen,phraseslike“passionatelovemaking,”“nipplesucking,”and“maneatingpussy”forwomen.Leaving themainstream,PornHubdatadoes tellusaboutsomefetishes that
youmightnothaveeverguessedexisted.Therearewomenwhosearchfor“analapples” and “humping stuffed animals.” There are men who search for “snotfetish”and“nudecrucifixion.”Butthesesearchesarerare—onlyabouttenevery
monthevenonthishugepornsite.AnotherrelatedpointthatbecomesquiteclearwhenreviewingPornHubdata:
there’s someoneout there for everyone.Women,not surprisingly, often searchfor “tall” guys, “dark” guys, and “handsome” guys. But they also sometimessearch for “short” guys, “pale” guys, and “ugly” guys.There arewomenwhosearch for “disabled” guys, “chubby guy with small dick,” and “fat ugly oldman.” Men frequently search for “thin” women, women with “big tits,” andwomenwith “blonde” hair. But they also sometimes search for “fat” women,womenwith“tinytits,”andwomenwith“greenhair.”Therearemenwhosearchfor“bald”women,“midget”women,andwomenwith“nonipples.”Thisdatacan be cheering for those who are not tall, dark, and handsome or thin, big-breasted,andblonde.*
Whataboutothersearchesthatarebothcommonandsurprising?Amongthe150 most common searches by men, the most surprising for me are theincestuous ones I discussed in the chapter on Freud. Other little-discussedobjects of men’s desire are “shemales” (77th most common search) and“granny” (110th most common search). Overall, about 1.4 percent of men’sPornHubsearchesare forwomenwithpenises.About0.6percent (0.4percentfor men under the age of thirty-four) are for the elderly. Only 1 in 24,000PornHubsearchesbymenareexplicitlyforpreteens;thatmayhavesomethingtodowith the fact thatPornHub, for obvious reasons, bans all formsof childpornographyandpossessingitisillegal.AmongthetopPornHubsearchesbywomenisagenreofpornographythat,I
warn you, will disturb many readers: sex featuring violence against women.Fully25percentoffemalesearchesforstraightpornemphasizethepainand/orhumiliation of the woman—“painful anal crying,” “public disgrace,” and“extreme brutal gangbang,” for example. Five percent look for nonconsensualsex—“rape”or“forced”sex—eventhoughthesevideosarebannedonPornHub.Andsearchratesforallthesetermsareatleasttwiceascommonamongwomenas among men. If there is a genre of porn in which violence is perpetratedagainst awoman,myanalysisof thedata shows that it almost always appealsdisproportionatelytowomen.Of course,when trying to come to termswith this, it is really important to
remember that there is a difference between fantasy and real life. Yes, of the
minority of women who visit PornHub, there is a subset who search—unsuccessfully—for rape imagery. To state the obvious, this does not meanwomenwanttoberapedinreallifeanditcertainlydoesn’tmakerapeanylesshorrificacrime.Whattheporndatadoestellusisthatsometimespeoplehavefantasies they wish they didn’t have and which they may never mention toothers.
Closetsarenotjustrepositoriesoffantasies.Whenitcomestosex,peoplekeepmanysecrets—abouthowmuchtheyarehaving,forexample.In the introduction, I noted that Americans report using far more condoms
than are sold every year. You might therefore think this means they are justsaying they use condoms more often during sex than they actually do. Theevidence suggests they also exaggerate how frequently they are having sex tobeginwith.About11percentofwomenbetween theagesof fifteenandforty-four say they are sexually active, not currently pregnant, and not usingcontraception.Evenwith relatively conservative assumptions about howmanytimestheyarehavingsex,scientistswouldexpect10percentofthemtobecomepregnanteverymonth.ButthiswouldalreadybemorethanthetotalnumberofpregnanciesintheUnitedStates(whichis1in113womenofchildbearingage).Inoursex-obsessedcultureitcanbehardtoadmitthatyouarejustnothavingthatmuch.But ifyou’re looking forunderstandingoradvice,youhave,onceagain, an
incentive to tell Google. OnGoogle, there are sixteen timesmore complaintsaboutaspousenotwantingsexthanaboutamarriedpartnernotbeingwillingtotalk.Therearefiveandahalftimesmorecomplaintsaboutanunmarriedpartnernotwantingsexthananunmarriedpartnerrefusingtotextback.AndGoogle searches suggest a surprising culprit formany of these sexless
relationships.Thereare twiceasmanycomplaints thataboyfriendwon’thavesex than that a girlfriend won’t have sex. By far, the number one searchcomplaintaboutaboyfriendis“Myboyfriendwon’thavesexwithme.”(Googlesearches are not broken downby gender, but, since the previous analysis saidthat95percentofmenarestraight,wecanguessthatnottoomany“boyfriend”searchesarecomingfrommen.)Howshouldweinterpretthis?Doesthisreallyimplythatboyfriendswithhold
sex more than girlfriends? Not necessarily. As mentioned earlier, Googlesearches canbebiased in favorof stuff people areuptight talking about.Menmay feelmore comfortable telling their friends about their girlfriend’s lack ofsexualinterestthanwomenaretellingtheirfriendsabouttheirboyfriend’s.Still,eveniftheGoogledatadoesnotimplythatboyfriendsarereallytwiceaslikelytoavoidsexasgirlfriends,itdoessuggestthatboyfriendsavoidingsexismorecommonthanpeopleleton.Googledataalsosuggestsareasonpeoplemaybeavoidingsexsofrequently:
enormousanxiety,withmuchofitmisplaced.Startwithmen’sanxieties.Itisn’tnews thatmenworryabouthowwell-endowed theyare,but thedegreeof thisworryisratherprofound.Men Google more questions about their sexual organ than any other body
part: more than about their lungs, liver, feet, ears, nose, throat, and braincombined.Men conduct more searches for how to make their penises biggerthanhowtotuneaguitar,makeanomelet,orchangeatire.Men’stopGoogledconcernaboutsteroidsisn’twhethertheymaydamagetheirhealthbutwhethertakingthemmightdiminishthesizeoftheirpenis.Men’stopGoogledquestionrelatedtohowtheirbodyormindwouldchangeastheyagedwaswhethertheirpeniswouldgetsmaller.Side note:One of themore common questions forGoogle regardingmen’s
genitaliais“Howbigismypenis?”ThatmenturntoGoogle,ratherthanaruler,with this question is, inmy opinion, a quintessential expression of our digitalera.*
Dowomencareaboutpenissize?Rarely,accordingtoGooglesearches.Forevery search women make about a partner’s phallus, men make roughly 170searches about their own. True, on the rare occasions women do expressconcerns about a partner’s penis, it is frequently about its size, but notnecessarilythatit’ssmall.Morethan40percentofcomplaintsaboutapartner’spenissizesaythatit’stoobig.“Pain”isthemostGoogledwordusedinsearcheswiththephrase“___duringsex.”(“Bleeding,”“peeing,”“crying,”and“farting”roundoutthetopfive.)Yetonly1percentofmen’ssearcheslookingtochangetheirpenissizeareseekinginformationonhowtomakeitsmaller.Men’s second-most-common sex question is how to make their sexual
encounters longer.Onceagain, the insecuritiesofmendonot appear tomatch
theconcernsofwomen.Thereareroughlythesamenumberofsearchesaskinghow tomakeaboyfriendclimaxmorequicklyasclimaxmoreslowly. In fact,the most common concern women have related to a boyfriend’s orgasm isn’taboutwhenithappenedbutwhyitisn’thappeningatall.Wedon’toftentalkaboutbodyimageissueswhenitcomestomen.Andwhile
it’s true that overall interest in personal appearance skews female, it’s not aslopsided as stereotypes would suggest. According to my analysis of GoogleAdWords, which measures the websites people visit, interest in beauty andfitnessis42percentmale,weightlossis33percentmale,andcosmeticsurgeryis39percentmale.Amongallsearcheswith“howto”relatedtobreasts,about20percentaskhowtogetridofmanbreasts.But,evenifthenumberofmenwholackconfidenceintheirbodiesishigher
than most people would think, women still outpace them when it comes toinsecurityabouthowtheylook.Sowhatcanthisdigitaltruthserumrevealaboutwomen’sself-doubt?EveryyearintheUnitedStates,therearemorethansevenmillionsearcheslookingintobreastimplants.Officialstatisticstellusthatabout300,000womengothroughwiththeprocedureannually.Women also show a great deal of insecurity about their behinds, although
manywomenhaverecentlyflip-floppedonwhatitistheydon’tlikeaboutthem.In 2004, in some parts of the United States, the most common search
regardingchangingone’sbuttwashowtomake itsmaller.Thedesire tomakeone’sbottombiggerwasoverwhelminglyconcentratedinareaswithlargeblackpopulations.Beginningin2010,however,thedesireforbiggerbuttsgrewintherestoftheUnitedStates.Thisinterest,ifnottheposteriordistributionitself,hastripledinfouryears.In2014,thereweremoresearchesaskinghowtomakeyourbutt bigger than smaller in every state. These days, for every five searcheslookingintobreast implantsintheUnitedStates, thereisonelookingintobuttimplants.(Thankyou,KimKardashian!)Does women’s growing preference for a larger bottom match men’s
preferences?Interestingly,yes.“Bigbuttporn”searches,whichalsousedtobeconcentrated in black communities, have recently shot up in popularitythroughouttheUnitedStates.Whatelsedomenwantinawoman’sbody?Asmentionedearlier,andasmost
willfindblindinglyobvious,menshowapreferenceforlargebreasts.About12
percentofnongenericpornographicsearchesarelookingforbigbreasts.Thisisnearlytwentytimeshigherthanthesearchvolumeforsmall-breastporn.That said, it is not clear that this means men want women to get breast
implants.About3percentofbig-breastpornsearchesexplicitlysaytheywanttoseenaturalbreasts.Googlesearchesaboutone’swifeandbreastimplantsareevenlysplitbetween
askinghowtopersuadehertogetimplantsandperplexityastowhyshewantsthem.Orconsiderthemostcommonsearchaboutagirlfriend’sbreasts:“Ilovemy
girlfriend’s boobs.” It is not clear what men are hoping to find from Googlewhenmakingthissearch.Women, like men, have questions about their genitals. In fact, they have
nearlyasmanyquestionsabout theirvaginasasmenhaveabout theirpenises.Women’s worries about their vaginas are often health related. But at least 30percentoftheirquestionstakeupotherconcerns.Womenwanttoknowhowtoshave it, tighten it, andmake it taste better.A strikingly common concern, astoucheduponearlier,ishowtoimproveitsodor.Women are most frequently concerned that their vaginas smell like fish,
followedbyvinegar,onions,ammonia,garlic,cheese,bodyodor,urine,bread,bleach,feces,sweat,metal,feet,garbage,androttenmeat.In general, men do not make many Google searches involving a partner’s
genitalia.Menmake roughly the samenumberof searchesaboutagirlfriend’svaginaaswomendoaboutaboyfriend’spenis.Whenmendosearchaboutapartner’svagina,itisusuallytocomplainabout
whatwomenworryaboutmost: theodor.Mostly,menare trying to figureouthowtotellawomanaboutabadodorwithouthurtingherfeelings.Sometimes,however, men’s questions about odor reveal their own insecurities. Menoccasionallyask forways touse thesmell todetectcheating—if it smells likecondoms,forexample,oranotherman’ssemen.Whatshouldwemakeofallthissecretinsecurity?Thereisclearlysomegood
newshere.Googlegivesuslegitimatereasonstoworrylessthanwedo.Manyofour deepest fears about how our sexual partners perceive us are unjustified.Alone,attheircomputers,withnoincentivetolie,partnersrevealthemselvestobe fairly nonsuperficial and forgiving. In fact,we are all so busy judging our
ownbodiesthatthereislittleenergyleftovertojudgeotherpeople’s.Thereisalsoprobablyaconnectionbetweentwoofthebigconcernsrevealed
in the sexual searches on Google: lack of sex and an insecurity about one’ssexual attractiveness and performance.Maybe these are related.Maybe if weworriedlessaboutsex,we’dhavemoreofit.WhatelsecanGoogle searches tellusabout sex?Wecandoabattleof the
sexes, to seewho ismost generous.Take all searches looking forways to getbetteratperformingoralsexontheoppositegender.Domenlookformoretipsor women? Who is more sexually generous, men or women? Women, duh.Adding up all the possibilities, I estimate the ratio is 2:1 in favor of womenlookingforadviceonhowtobetterperformoralsexontheirpartner.Andwhenmendolookfortipsonhowtogiveoralsex, theyarefrequently
not looking forwaysofpleasinganotherperson.Menmakeasmany searcheslooking forways to performoral sexon themselves as theydohow to give awomananorgasm.(ThisisamongmyfavoritefactsinGooglesearchdata.)
THETRUTHABOUTHATEANDPREJUDICE
Sexandromancearehardlytheonlytopicscloakedinshameand,therefore,notthe only topics about which people keep secrets. Many people are, for goodreason,inclinedtokeeptheirprejudicestothemselves.Isupposeyoucouldcallit progress thatmanypeople today feel theywill be judged if theyadmit theyjudgeotherpeoplebasedon their ethnicity, sexualorientation,or religion.ButmanyAmericansstilldo.(Thisisanothersection,Iwarnreaders,thatincludesdisturbingmaterial.)You can see this on Google, where users sometimes ask questions such as
“Whyareblackpeoplerude?”or“WhyareJewsevil?”Below,inorder,arethetopfivenegativewordsusedinsearchesaboutvariousgroups.
A few patterns among these stereotypes stand out. For example, AfricanAmericansaretheonlygroupthatfacesa“rude”stereotype.Nearlyeverygroupisavictimofa“stupid”stereotype;theonlytwothatarenot:JewsandMuslims.The “evil” stereotype is applied to Jews, Muslims, and gays but not blackpeople,Mexicans,Asians,andChristians.Muslims are the only group stereotyped as terrorists. When a Muslim
American plays into this stereotype, the response can be instantaneous andvicious. Google search data can give us a minute-by-minute peek into sucheruptionsofhate-fueledrage.Considerwhat happened shortly after themass shooting inSanBernardino,
California, onDecember2, 2015.Thatmorning,RizwanFarookandTashfeenMalik entered a meeting of Farook’s coworkers armed with semiautomaticpistols and semiautomatic rifles andmurdered fourteen people. That evening,literally minutes after the media first reported one of the shooters’ Muslim-sounding name, a disturbing number of Californians had decided what theywantedtodowithMuslims:killthem.ThetopGooglesearchinCaliforniawiththeword“Muslims”initatthetime
was “kill Muslims.” And overall, Americans searched for the phrase “killMuslims”withaboutthesamefrequencythattheysearchedfor“martinirecipe,”“migraine symptoms,” and “Cowboys roster.” In the days following the SanBernardinoattack,foreveryAmericanconcernedwith“Islamophobia,”anotherwas searching for “killMuslims.”While hate searcheswere approximately20
percent of all searches aboutMuslims before the attack,more than half of allsearchvolumeaboutMuslimsbecamehatefulinthehoursthatfollowedit.And thisminute-by-minute searchdata can tellushowdifficult it canbe to
calmthisrage.Fourdaysaftertheshooting,then-presidentObamagaveaprime-time address to the country. He wanted to reassure Americans that thegovernment could both stop terrorism and, perhapsmore important, quiet thisdangerousIslamophobia.Obamaappealedtoourbetterangels,speakingoftheimportanceofinclusion
and tolerance.The rhetoricwaspowerful andmoving.TheLosAngelesTimespraisedObamafor“[warning]againstallowingfeartocloudourjudgment.”TheNew York Times called the speech both “tough” and “calming.” The websiteThink Progress praised it as “a necessary tool of good governance, gearedtowards saving the lives of Muslim Americans.” Obama’s speech, in otherwords,wasjudgedamajorsuccess.Butwasit?Google search data suggests otherwise. Together with Evan Soltas, then at
Princeton, I examined the data. In his speech, the president said, “It is theresponsibility of allAmericans—of every faith—to reject discrimination.”Butsearches calling Muslims “terrorists,” “bad,” “violent,” and “evil” doubledduring and shortly after the speech. President Obama also said, “It is ourresponsibility to reject religious tests onwhowe admit into this country.”Butnegative searches about Syrian refugees, a mostly Muslim group thendesperatelylookingforasafehaven,rose60percent,whilesearchesaskinghowto help Syrian refugees dropped 35 percent. Obama askedAmericans to “notforgetthatfreedomismorepowerfulthanfear.”Yetsearchesfor“killMuslims”tripledduringhisspeech.Infact,justabouteverynegativesearchwecouldthinkto test regardingMuslims shot up during and after Obama’s speech, and justabouteverypositivesearchwecouldthinktotestdeclined.Inotherwords,Obamaseemedtosayall theright things.All the traditional
media congratulated Obama on his healing words. But new data from theinternet,offeringdigitaltruthserum,suggestedthatthespeechactuallybackfiredinitsmaingoal.Insteadofcalmingtheangrymob,aseverybodythoughthewasdoing,theinternetdatatellsusthatObamaactuallyinflamedit.Thingsthatwethink areworking can have the exact opposite effect from the onewe expect.Sometimesweneedinternetdata tocorrectour instinct topatourselvesonthe
back.So what should Obama have said to quell this particular form of hatred
currentlysovirulentinAmerica?We’llcirclebacktothatlater.Rightnowwe’regoingtotakealookatanage-oldveinofprejudiceintheUnitedStates,theformof hate that in fact stands out above the rest, the one that has been themostdestructiveandthetopicoftheresearchthatbeganthisbook.InmyworkwithGooglesearchdata, thesinglemost tellingfactIhavefoundregardinghateontheinternetisthepopularityoftheword“nigger.”Either singular or in its plural form, theword “nigger” is included in seven
million American searches every year. (Again, the word used in rap songs isalmostalways“nigga,”not“nigger,”sothere’snosignificantimpactfromhip-hoplyricstoaccountfor.)Searchesfor“niggerjokes”areseventeentimesmorecommon than searches for “kike jokes,” “gook jokes,” “spic jokes,” “chinkjokes,”and“fagjokes”combined.When are searches for “nigger(s)”—or “nigger jokes”—most common?
WheneverAfrican-Americans are in the news.Among the periodswhen suchsearcheswerehighestwastheimmediateaftermathofHurricaneKatrina,whentelevision and newspapers showed images of desperate black people in NewOrleans struggling for their survival. They also shot up during Obama’s firstelection.And searches for “nigger jokes” rise on average about 30percent onMartinLutherKingJr.Day.The frightening ubiquity of this racial slur throws into doubt some current
understandingsofracism.AnytheoryofracismhastoexplainabigpuzzleinAmerica.Ontheonehand,
theoverwhelmingmajorityofblackAmericansthinktheysufferfromprejudice—and they have ample evidence of discrimination in police stops, jobinterviews, and jury decisions. On the other hand, very fewwhite Americanswilladmittobeingracist.The dominant explanation among political scientists recently has been that
thisisdue,inlargepart,towidespreadimplicitprejudice.WhiteAmericansmaymeanwell,thistheorygoes,buttheyhaveasubconsciousbias,whichinfluencestheirtreatmentofblackAmericans.Academicsinventedaningeniouswaytotestforsuchabias.Itiscalledtheimplicit-associationtest.The tests have consistently shown that it takes most people milliseconds
longer to associateblack faceswithpositivewords, suchas “good,” thanwithnegativewords, such as “awful.”Forwhite faces, the pattern is reversed.Theextratimeittakesisevidenceofsomeone’simplicitprejudice—aprejudicethepersonmaynotevenbeawareof.There is, though, an alternative explanation for the discrimination that
African-Americansfeelandwhitesdeny:hiddenexplicit racism.Suppose thereis a reasonably widespread conscious racism of which people are very muchawarebut towhich theywon’tconfess—certainlynot inasurvey.That’swhatthesearchdataseemstobesaying.Thereisnothingimplicitaboutsearchingfor“niggerjokes.”Andit’shardtoimaginethatAmericansareGooglingtheword“nigger” with the same frequency as “migraine” and “economist” withoutexplicitracismhavingamajorimpactonAfrican-Americans.PriortotheGoogledata,wedidn’thaveaconvincingmeasureofthisvirulentanimus.Nowwedo.Weare,therefore,inapositiontoseewhatitexplains.It explains, as discussed earlier,whyObama’s vote totals in 2008 and2012
were depressed inmany regions. It also correlateswith the black-whitewagegap,asateamofeconomistsrecentlyreported.TheareasthatIhadfoundmakethemostracistsearches,inotherwords,underpayblackpeople.AndthenthereisthephenomenonofDonaldTrump’scandidacy.Asnotedintheintroduction,when Nate Silver, the polling guru, looked for the geographic variable thatcorrelated most strongly with support in the 2016 Republican primary forTrump, he found it in themap of racism I had developed. That variable wassearchesfor“nigger(s).”Scholars have recently put together a state-by-state measure of implicit
prejudiceagainstblackpeople,whichhasenabledmetocomparetheeffectsofexplicitracism,asmeasuredbyGooglesearches,andimplicitbias.Forexample,I tested how much each worked against Obama in both of his presidentialelections. Using regression analysis, I found that, to predict where Obamaunderperformed, an area’s racist Google searches explained a lot. An area’sperformanceonimplicit-associationtestsaddedlittle.Tobeprovocativeandtoencouragemoreresearchinthisarea,letmeputforth
thefollowingconjecture,readytobetestedbyscholarsacrossarangeoffields.TheprimaryexplanationfordiscriminationagainstAfricanAmericanstodayisnot the fact that the peoplewho agree to participate in lab experimentsmake
subconscious associations between negative words and black people; it is thefact that millions of white Americans continue to do things like search for“niggerjokes.”
The discrimination black people regularly experience in the United Statesappearstobefueledmorewidelybyexplicit,ifhidden,hostility.But,forothergroups, subconscious prejudice may have a more fundamental impact. Forexample,IwasabletouseGooglesearchestofindevidenceofimplicitprejudiceagainstanothersegmentofthepopulation:younggirls.Andwho,mightyouask,wouldbeharboringbiasagainstgirls?Theirparents.It’shardlysurprising thatparentsofyoungchildrenareoftenexcitedby the
thoughtthattheirkidsmightbegifted.Infact,ofallGooglesearchesstarting“Ismy2-year-old,”themostcommonnextwordis“gifted.”Butthisquestionisnotasked equally about young boys and young girls. Parents are two and a halftimes more likely to ask “Is my son gifted?” than “Is my daughter gifted?”Parentsshowasimilarbiaswhenusingotherphrasesrelatedtointelligencethattheymayshyawayfromsayingaloud,like,“Ismysonagenius?”Are parents picking up on legitimate differences between young girls and
boys?Perhapsyoungboysaremorelikelythanyounggirlstousebigwordsorotherwise show objective signs of giftedness? Nope. If anything, it’s theopposite. At young ages, girls have consistently been shown to have largervocabulariesandusemorecomplexsentences.InAmericanschools,girlsare9percentmorelikelythanboystobeingiftedprograms.Despiteallthis,parentslooking around the dinner table appear to seemore gifted boys than girls.* Infact, on every search term related to intelligence I tested, including thoseindicatingitsabsence,parentsweremorelikelytobeinquiringabouttheirsonsratherthantheirdaughters.Therearealsomoresearchesfor“ismysonbehind”or“stupid”thancomparablesearchesfordaughters.Butsearcheswithnegativewordslike“behind”and“stupid”arelessspecificallyskewedtowardsonsthansearcheswithpositivewords,suchas“gifted”or“genius.”What then are parents’ overriding concerns regarding their daughters?
Primarily, anything related to appearance. Consider questions about a child’sweight. Parents Google “Is my daughter overweight?” roughly twice as
frequentlyas theyGoogle“Ismysonoverweight?”Parentsareabout twiceaslikelytoaskhowtogettheirdaughterstoloseweightastheyaretoaskhowtoget their sons to do the same. Just as with giftedness, this gender bias is notgroundedinreality.About28percentofgirlsareoverweight,while35percentof boys are. Even though scales measure more overweight boys than girls,parents see—or worry about—overweight girls much more frequently thanoverweightboys.Parentsarealsooneandahalftimesmorelikelytoaskwhethertheirdaughter
isbeautifulthanwhethertheirsonishandsome.Andtheyarenearlythreetimesmorelikelytoaskwhethertheirdaughterisuglythanwhethertheirsonisugly.(HowGoogleisexpectedtoknowwhetherachildisbeautifuloruglyishardtosay.)Ingeneral,parentsseemmorelikelytousepositivewordsinquestionsabout
sons. They aremore apt to askwhether a son is “happy” and less apt to askwhetherasonis“depressed.”Liberal readers may imagine that these biases are more common in
conservativepartsofthecountry,butIdidn’tfindanyevidenceofthat.Infact,Idid not find a significant relationship between any of these biases and thepolitical or culturalmakeup of a state.Nor is there evidence that these biaseshave decreased since 2004, the year for which Google search data is firstavailable. Itwould seem thisbias againstgirls ismorewidespreadanddeeplyingrainedthanwe’dcaretobelieve.
Sexismisnottheonlyplaceourstereotypesaboutprejudicemaybeoff.Vikingmaiden88 is twenty-six years old. She enjoys reading history and
writingpoetry.HersignaturequoteisfromShakespeare.IgleanedallthisfromherprofileandpostsonStormfront.org,America’smostpopularonlinehatesite.I also learned thatVikingmaiden88 has enjoyed the content on the site of thenewspaperIworkfor,theNewYorkTimes.ShewroteanenthusiasticpostaboutaparticularTimesfeature.I recently analyzed tens of thousands of such Stormfront profiles, inwhich
registered members can enter their location, birth date, interests, and otherinformation.Stormfront was founded in 1995 by Don Black, a former Ku Klux Klan
leader.Itsmostpopular“socialgroups”are“UnionofNationalSocialists”and“Fans and Supporters of Adolf Hitler.” Over the past year, according toQuantcast,roughly200,000to400,000Americansvisitedthesiteeverymonth.ArecentSouthernPovertyLawCenterreportlinkednearlyonehundredmurdersinthepastfiveyearstoregisteredStormfrontmembers.StormfrontmembersarenotwhomIwouldhaveguessed.They tend to be young, at least according to self-reported birth dates. The
mostcommonageatwhichpeoplejointhesiteisnineteen.Andfourtimesmorenineteen-year-oldssignupthanforty-year-olds.Internetandsocialnetworkusersleanyoung,butnotnearlythatyoung.Profiles do not have a field for gender. But I looked at all the posts and
completeprofilesof a randomsampleofAmericanusers, and it turnsout thatyoucanworkoutthegenderofmostofthemembership:Iestimatethatabout30percentofStormfrontmembersarefemale.ThestateswiththemostmemberspercapitaareMontana,Alaska,andIdaho.
Thesestatestendtobeoverwhelminglywhite.Doesthismeanthatgrowingupwithlittlediversityfostershate?Probably not. Rather, since those states have a higher proportion of non-
Jewishwhitepeople,theyhavemorepotentialmembersforagroupthatattacksJewsandnonwhites.ThepercentageofStormfront’stargetaudiencethatjoinsisactuallyhigherinareaswithmoreminorities.Thisisparticularlytruewhenyoulook atStormfront’smemberswho are eighteen andyounger and therefore donotthemselveschoosewheretheylive.Among this age group, California, a state with one of the largest minority
populations,hasamembershiprate25percenthigherthanthenationalaverage.One of the most popular social groups on the site is “In Support of Anti-
Semitism.” The percentage of members who join this group is positivelycorrelatedwithastate’sJewishpopulation.NewYork,thestatewiththehighestJewishpopulation,hasabove-averagepercapitamembershipinthisgroup.In 2001, Dna88 joined Stormfront, describing himself as a “good looking,
raciallyaware” thirty-year-old Internetdeveloper living in“JewYorkCity.” Inthe next four months, he wrote more than two hundred posts, like “JewishCrimesAgainstHumanity”and“JewishBloodMoney,”anddirectedpeopletoawebsite, jewwatch.com, which claims to be a “scholarly library” on “Zionist
criminality.”Stormfrontmemberscomplainaboutminorities’speakingdifferentlanguages
andcommittingcrimes.ButwhatIfoundmost interestingwerethecomplaintsaboutcompetitioninthedatingmarket.AmancallinghimselfWilliamLyonMackenzieKing, after a formerprime
minister of Canada who once suggested that “Canada should remain a whiteman’s country,” wrote in 2003 that he struggled to “contain” his “rage” afterseeingawhitewoman“carryingaroundherhalfblackuglymongrelniglet.”Inherprofile,Whitepride26,aforty-one-year-oldstudentinLosAngeles,says,“Idislikeblacks,Latinos,andsometimesAsians,especiallywhenmen find themmoreattractive”than“awhitefemale.”Certainpoliticaldevelopmentsplayarole.Thedaythatsawthebiggestsingle
increaseinmembershipinStormfront’shistory,byfar,wasNovember5,2008,the day after Barack Obama was elected president. There was, however, noincreased interest in Stormfront duringDonald Trump’s candidacy and only asmall rise immediatelyafterhewon.Trumprodeawaveofwhitenationalism.Thereisnoevidenceherethathecreatedawaveofwhitenationalism.Obama’selection led toa surge in thewhitenationalistmovement.Trump’s
electionseemstobearesponsetothat.Onethingthatdoesnotseemtomatter:economics.Therewasnorelationship
between monthly membership registration and a state’s unemployment rate.States disproportionately affected by theGreatRecession saw no comparativeincreaseinGooglesearchesforStormfront.But perhaps what wasmost interesting—and surprising—were some of the
topicsofconversationStormfrontmembershave.Theyaresimilar to thosemyfriends and I talk about. Maybe it was my own naïveté, but I would haveimagined white nationalists inhabiting a different universe from that of myfriends andme. Instead theyhave long threads praisingGameofThrones anddiscussing thecomparativemeritsofonlinedatingsites, likePlentyOfFishandOkCupid.And the key fact that shows that Stormfront users are inhabiting similar
universes as people like me and my friends: the popularity of theNew YorkTimesamongStormfrontusers.Itisn’tjustVikingMaiden88hangingaroundtheTimes site.Thesite ispopularamongmanyof itsmembers. In fact,whenyou
compareStormfrontuserstopeoplewhovisittheYahooNewssite,itturnsoutthattheStormfrontcrowdistwiceaslikelytovisitnytimes.com.Members of a hate site perusing the oh-so-liberal nytimes.com?How could
thispossiblybe?IfasubstantialnumberofStormfrontmembersgettheirnewsfromnytimes.com,itmeansourconventionalwisdomaboutwhitenationalistsiswrong.Italsomeansourconventionalwisdomabouthowtheinternetworksiswrong.
THETRUTHABOUTTHEINTERNET
The internet,mosteverybodyagrees, isdrivingAmericansapart,causingmostpeople to hole up in sites geared toward people like them. Here’s how CassSunsteinofHarvardLawSchooldescribedthesituation:“Ourcommunicationsmarket is rapidlymoving[towardasituationwhere]people restrict themselvesto their own points of view—liberals watching and reading mostly or onlyliberals; moderates, moderates; conservatives, conservatives; Neo-Nazis, Neo-Nazis.”This viewmakes sense.After all, the internet gives us a virtually unlimited
numberofoptionsfromwhichwecanconsumethenews.IcanreadwhateverIwant.Youcanreadwhateveryouwant.VikingMaiden88canreadwhatevershewants.Andpeople,iflefttotheirowndevices,tendtoseekoutviewpointsthatconfirmwhat theybelieve.Thus, surely, the internetmustbe creatingextremepoliticalsegregation.Thereisoneproblemwiththisstandardview.Thedatatellsusthatitissimply
nottrue.Theevidenceagainst thispieceofconventionalwisdomcomes froma2011
study byMatt Gentzkow and Jesse Shapiro, two economists whose work wediscussedearlier.Gentzkow and Shapiro collected data on the browsing behavior of a large
sampleofAmericans.Theirdatasetalsoincludedtheideology—self-reported—of their subjects: whether people considered themselves more liberal orconservative. They used this data to measure the political segregation on theinternet.How?Theyperformedaninterestingthoughtexperiment.
Suppose you randomly sampled two Americans who happen to both bevisiting the same news website. What is the probability one of them will beliberal and theotherconservative?Howfrequently, inotherwords,do liberalsandconservatives“meet”onnewssites?Tothinkaboutthisfurther,supposeliberalsandconservativesontheinternet
never got their online news from the same place. In other words, liberalsexclusivelyvisitedliberalwebsites,conservativesexclusivelyconservativeones.Ifthiswerethecase,thechancesthattwoAmericansonagivennewssitehaveopposing political viewswould be 0 percent. The internet would be perfectlysegregated.Liberalsandconservativeswouldnevermix.Suppose,incontrast,thatliberalsandconservativesdidnotdifferatallinhow
they got their news. In otherwords, a liberal and a conservativewere equallylikelytovisitanyparticularnewssite.Ifthiswerethecase,thechancesthattwoAmericans on a given news website have opposing political views would beroughly50percent.Theinternetwouldbeperfectlydesegregated.Liberalsandconservativeswouldperfectlymix.Sowhat does the data tell us? In theUnitedStates, according toGentzkow
and Shapiro, the chances that two people visiting the same news site havedifferentpoliticalviews is about45percent. Inotherwords, the internet is farcloser to perfect desegregation than perfect segregation. Liberals andconservativesare“meeting”eachotherontheweballthetime.What really puts the lack of segregation on the internet in perspective is
comparing it to segregation in other parts of our lives.GentzkowandShapirocouldrepeattheiranalysisforvariousofflineinteractions.Whatarethechancesthat two familymembers have different political views?Two neighbors? Twocolleagues?Twofriends?UsingdatafromtheGeneralSocialSurvey,GentzkowandShapirofoundthat
allthesenumberswerelowerthanthechancesthattwopeopleonthesamenewswebsitehavedifferentpolitics.
PROBABILITYTHATSOMEONEYOUMEETHASOPPOSINGPOLITICALVIEWS
OnaNewsWebsite 45.2%
Coworker 41.6%
OfflineNeighbor 40.3
FamilyMember 37%
Friend 34.7%
Inotherwords,youaremore likely to comeacross someonewithopposingviewsonlinethanyouareoffline.Why isn’t the internet more segregated? There are two factors that limit
politicalsegregationontheinternet.First,somewhatsurprisingly,theinternetnewsindustryisdominatedbyafew
massive sites. We usually think of the internet as appealing to the fringes.Indeed, there are sites for everybody, no matter your viewpoints. There arelanding spots for pro-gun and anti-gun crusaders, cigar rights and dollar coinactivists,anarchistsandwhitenationalists.Butthesesitestogetheraccountforasmallfractionof the internet’snewstraffic.Infact, in2009,foursites—YahooNews,AOLNews,msnbc.com,andcnn.com—collectedmorethanhalfofnewsviews.YahooNewsremainsthemostpopularnewssiteamongAmericans,withclose to 90million uniquemonthly visitors—or some 600 times Stormfront’saudience. Mass media sites like Yahoo News appeal to a broad, politicallydiverseaudience.The second reason the internet isn’t all that segregated is thatmany people
withstrongpoliticalopinionsvisitsitesoftheoppositeviewpoint,ifonlytogetangry and argue.Political junkies donot limit themselves only to sites gearedtoward them. Someone who visits thinkprogress.org and moveon.org—twoextremely liberal sites—is more likely than the average internet user to visitfoxnews.com, a right-leaning site. Someone who visits rushlimbaugh.com orglennbeck.com—two extremely conservative sites—is more likely than theaverageinternetusertovisitnytimes.com,amoreliberalsite.Gentzkow and Shapiro’s studywas based on data from 2004–09, relatively
early in the history of the internet. Might the internet have grown morecompartmentalized since then?Have socialmedia and, inparticular,Facebookalteredtheirconclusion?Clearly,ifourfriendstendtoshareourpoliticalviews,theriseofsocialmediashouldmeanariseofechochambers.Right?
Again, the story is not so simple.While it is true that people’s friends onFacebookaremorelikelythannottosharetheirpoliticalviews,ateamofdatascientists—Eytan Bakshy, Solomon Messing, and Lada Adamic—have foundthatasurprisingamountoftheinformationpeoplegetonFacebookcomesfrompeoplewithopposingviews.Howcanthisbe?Don’tourfriendstendtoshareourpoliticalviews?Indeed,
they do. But there is one crucial reason that Facebook may lead to a morediverse political discussion than offline socializing. People, on average, havesubstantiallymorefriendsonFacebookthantheydooffline.Andtheseweaktiesfacilitated by Facebook are more likely to be people with opposite politicalviews.In otherwords, Facebook exposes us toweak social connections—the high
schoolacquaintance,thecrazythirdcousin,thefriendofthefriendofthefriendyousortof,kindof,maybeknow.Thesearepeopleyoumightnevergobowlingwithortoabarbecuewith.Youmightnotinvitethemovertoadinnerparty.ButyoudoFacebookfriendthem.Andyoudoseetheirlinkstoarticleswithviewsyoumighthaveneverotherwiseconsidered.Insum,theinternetactuallybringspeopleofdifferentpoliticalviewstogether.
Theaverageliberalmayspendhermorningwithherliberalhusbandandliberalkids; her afternoon with her liberal coworkers; her commute surrounded byliberalbumperstickers;hereveningwithherliberalyogaclassmates.Whenshecomes home and peruses a few conservative comments on cnn.com or gets aFacebook link fromherRepublicanhigh schoolacquaintance, thismaybeherhighestconservativeexposureoftheday.I probably never encounterwhite nationalists inmy favorite coffee shop in
Brooklyn.ButVikingMaiden88andIbothfrequenttheNewYorkTimessite.
THETRUTHABOUTCHILDABUSEANDABORTION
The internet can give us insights into not just disturbing attitudes but alsodisturbing behaviors. Indeed, Google data may be effective at alerting us tocrisesthataremissedbyall theusualsources.People,afterall, turntoGooglewhentheyareintrouble.ConsiderchildabuseduringtheGreatRecession.
Whenthismajoreconomicdownturnstartedinlate2007,manyexpertswerenaturally worried about the effect it might have on children. After all, manyparents would be stressed and depressed, and these aremajor risk factors formaltreatment.Childabusemightskyrocket.Thentheofficialdatacamein,anditseemedthattheworrywasunfounded.
Childprotectiveserviceagenciesreportedthattheyweregettingfewercasesofabuse. Further, these drops were largest in states that were hardest hit by therecession. “The doom-and-gloom predictions haven’t come true,” RichardGelles, a child welfare expert at the University of Pennsylvania, told theAssociatedPressin2011.Yes,ascounterintuitiveasitmayhaveseemed,childabuseseemedtohaveplummetedduringtherecession.Butdidchildabusereallydropwithsomanyadultsoutofworkandextremely
distressed?Ihadtroublebelievingthis.SoIturnedtoGoogledata.It turns out, somekidsmake some tragic, andheart-wrenching, searches on
Google—suchas“mymombeatme”or“mydadhitme.”And thesesearchespresentadifferent—andagonizing—pictureofwhathappenedduringthistime.The number of searches like this shot up during theGreat Recession, closelytrackingtheunemploymentrate.Here’swhat I thinkhappened: itwas the reportingofchildabusecases that
declined, not the child abuse itself.After all, it is estimated that only a smallpercentageofchildabusecasesarereportedtoauthoritiesanyway.Andduringarecession,manyofthepeoplewhotendtoreportchildabusecases(teachersandpoliceofficers,forexample)andhandlecases(childprotectiveserviceworkers)aremorelikelytobeoverworkedoroutofwork.Thereweremany stories during the economicdownturn of people trying to
reportpotentialcasesfacinglongwaittimesandgivingup.Indeed, there ismore evidence, this time not fromGoogle, that child abuse
actuallyroseduringtherecession.Whenachilddiesduetoabuseorneglectithastobereported.Suchdeaths,althoughrare,didriseinstatesthatwerehardesthitbytherecession.And there is some evidence fromGoogle thatmore peoplewere suspecting
abuse inhard-hit areas.Controlling forpre-recession ratesandnational trends,states that had comparatively suffered themost had increased search rates forchild abuse and neglect. For every percentage point increase in the
unemploymentrate,therewasanassociated3percentincreaseinthesearchratefor “child abuse” or “child neglect.” Presumably, most of these people neversuccessfully reported the abuse, as these states had the biggest drops in thereporting.Searchesbysufferingkids increase.Therateofchilddeathsspike.Searches
bypeoplesuspectingabusegoupinhard-hitstates.Butreportingofcasesgoesdown.ArecessionseemstocausemorekidstotellGooglethattheirparentsarehittingorbeatingthemandmorepeopletosuspectthattheyseeabuse.Buttheoverworkedagenciesareabletohandlefewercases.I thinkit’ssafe tosaythat theGreatRecessiondidmakechildabuseworse,
althoughthetraditionalmeasuresdidnotshowit.
AnytimeIsuspectpeoplemaybesufferingoffthebooksnow,IturntoGoogledata.Oneofthepotentialbenefitsofthisnewdata,andknowinghowtointerpretit, is the possibility of helping vulnerable people who might otherwise gooverlookedbyauthorities.So when the Supreme Court was recently looking into the effects of laws
makingitmoredifficulttogetanabortion,Iturnedtothequerydata.Isuspectedwomen affected by this legislation might look into off-the-books ways toterminateapregnancy.Theydid.Andthesesearcheswerehighestinstatesthathadpassedlawsrestrictingabortions.Thesearchdatahereisbothusefulandtroubling.In2015,intheUnitedStates,thereweremorethan700,000Googlesearches
lookingintoself-inducedabortions.Bycomparison,thereweresome3.4millionsearchesforabortionclinicsthatyear.Thissuggeststhatasignificantpercentageofwomenconsideringanabortionhavecontemplateddoingitthemselves.Women searched, about 160,000 times, for ways of getting abortion pills
through unofficial channels—“buy abortion pills online” and “free abortionpills.”TheyaskedGoogleaboutabortionbyherbslikeparsleyorbyvitaminC.Thereweresome4,000searcheslookingfordirectionsoncoathangerabortions,includingabout1,300fortheexactphrase“howtodoacoathangerabortion.”Therewere also a few hundred looking into abortion through bleaching one’suterusandpunchingone’sstomach.What drives interest in self-induced abortion?Thegeography and timingof
theGoogle searches point to a likely culprit:when it’s hard to get an officialabortion,womenlookintooff-the-booksapproaches.Search rates for self-inducedabortionwere fairly steady from2004 through
2007.Theybegantoriseinlate2008,coincidingwiththefinancialcrisisandtherecessionthatfollowed.Theytookabigleapin2011,jumping40percent.TheGuttmacherInstitute,areproductiverightsorganization,singlesout2011asthebeginning of the country’s recent crackdown on abortion; ninety-two stateprovisionsthatrestrictaccesstoabortionwereenacted.LookingbycomparisonatCanada,whichhasnotseenacrackdownonreproductiverights,therewasnocomparableincreaseinsearchesforself-inducedabortionsduringthistime.ThestatewiththehighestrateofGooglesearchesforself-inducedabortionsis
Mississippi,astatewithroughlythreemillionpeopleand,now,justoneabortionclinic. Eight of the ten states with the highest search rates for self-inducedabortionsareconsideredbytheGuttmacherInstitutetobehostileorveryhostiletoabortion.Noneofthetenstateswiththelowestsearchratesforself-inducedabortionareineithercategory.Of course, we cannot know from Google searches how many women
successfullygive themselvesabortions,butevidencesuggests thatasignificantnumbermay.Onewaytoilluminatethisistocompareabortionandbirthdata.In2011,thelastyearwithcompletestate-levelabortiondata,womenlivingin
stateswithfewabortionclinicshadmanyfewerlegalabortions.Compare the ten stateswith themost abortion clinics per capita (a list that
includes New York and California) to the ten states with the fewest abortionclinicspercapita(alistthatincludesMississippiandOklahoma).Womenlivinginstateswiththefewestabortionclinicshad54percentfewerlegalabortions—adifference of eleven abortions for every thousandwomenbetween the ages offifteen and forty-four.Women living in stateswith the fewest abortion clinicsalsohadmore livebirths.However, thedifferencewasnotenoughtomakeupfor the lower number of abortions. Therewere sixmore live births for everythousandwomenofchildbearingage.Inotherwords,thereappeartohavebeensomemissingpregnanciesinparts
ofthecountrywhereitwashardesttogetanabortion.Theofficialsourcesdon’ttelluswhathappenedtothosefivemissingbirthsforeachthousandwomeninstateswhereitishardtogetanabortion.
Googleprovidessomeprettygoodclues.Wecan’tblindlytrustgovernmentdata.Thegovernmentmaytellusthatchild
abuseorabortionhasfallenandpoliticiansmaycelebratethisachievement.Buttheresultswethinkwe’reseeingmaybeanartifactofflawsinthemethodsofdatacollection.Thetruthmaybedifferent—and,sometimes,fardarker.
THETRUTHABOUTYOURFACEBOOKFRIENDS
ThisbookisaboutBigData,ingeneral.ButthischapterhasmostlyemphasizedGooglesearches,whichIhavearguedrevealahiddenworldverydifferentfromtheonewe thinkwesee.SoareotherBigDatasourcesdigital truthserum,aswell? The fact is, many Big Data sources, such as Facebook, are often theoppositeofdigitaltruthserum.On socialmedia, as in surveys, you have no incentive to tell the truth. On
socialmedia,muchmoresothaninsurveys,youhavealargeincentivetomakeyourself lookgood.Youronlinepresence is not anonymous, after all.You arecourting an audience and telling your friends, family members, colleagues,acquaintances,andstrangerswhoyouare.Toseehowbiaseddatapulledfromsocialmediacanbe,considertherelative
popularityoftheAtlantic,arespected,highbrowmonthlymagazine,versus theNational Enquirer, a gossipy, often-sensational magazine. Both publicationshavesimilaraveragecirculations, sellinga fewhundred thousandcopies. (TheNationalEnquirer isaweekly,soitactuallysellsmoretotalcopies.)TherearealsoacomparablenumberofGooglesearchesforeachmagazine.However,onFacebook,roughly1.5millionpeopleeitherliketheAtlanticor
discuss articles from theAtlantic on their profiles.Only about 50,000 like theEnquirerordiscussitscontents.
ATLANTICVS.NATIONALENQUIRERPOPULARITYCOMPAREDBYDIFFERENTSOURCES
Circulation Roughly1Atlanticforevery1NationalEnquirer
Google 1Atlanticforevery1NationalEnquirer
Searches
FacebookLikes
27Atlanticforevery1NationalEnquirer
Forassessingmagazinepopularity,circulationdataisthegroundtruth.Googledatacomesclose tomatching it.AndFacebookdata isoverwhelminglybiasedagainstthetrashytabloid,makingittheworstdatafordeterminingwhatpeoplereallylike.And as with reading preferences, so with life. On Facebook, we show our
cultivatedselves,notourtrueselves.IuseFacebookdatainthisbook,infactinthischapter,butalwayswiththiscaveatinmind.
To gain a better understanding of what social media misses, let’s return topornographyforamoment.First,weneedtoaddressthecommonbeliefthattheinternet is dominated by smut. This isn’t true. Themajority of content on theinternet isnonpornographic.For instance,of the top tenmostvisitedwebsites,notoneispornographic.Sothepopularityofporn,whileenormous,shouldnotbeoverstated.Yet, that said, taking a close look at how we like and share pornography
makes it clear that Facebook, Instagram, and Twitter only provide a limitedwindowintowhat’strulypopularontheinternet.Therearelargesubsetsofthewebthatoperatewithmassivepopularitybutlittlesocialpresence.The most popular video of all time, as of this writing, is Psy’s “Gangnam
Style,”agoofypopmusicvideothatsatirizestrendyKoreans.It’sbeenviewedabout 2.3 billion times on YouTube alone since its debut in 2012. And itspopularity is clear no matter what site you are on. It’s been shared acrossdifferentsocialmediaplatformstensofmillionsoftimes.Themostpopularpornographicvideoofalltimemaybe“GreatBody,Great
Sex, Great Blowjob.” It’s been viewed more than 80 million times. In otherwords,foreverythirtyviewsof“GangnamStyle,”therehasbeenaboutatleastoneviewof“GreatBody,GreatSex,GreatBlowjob.”Ifsocialmediagaveusanaccurate view of the videos people watched, “Great Body, Great Sex, GreatBlowjob”shouldbepostedmillionsoftimes.Butthisvideohasbeensharedon
socialmediaonlya fewdozen timesandalwaysbypornstars,notbyaverageusers.Peopleclearlydonotfeeltheneedtoadvertisetheirinterestinthisvideototheirfriends.Facebook is digital brag-to-my-friends-about-how-good-my-life-is serum. In
Facebookworld, theaverageadultseemstobehappilymarried,vacationingintheCaribbean,andperusing theAtlantic. In the realworld, a lotofpeopleareangry,onsupermarketcheckoutlines,peekingattheNationalEnquirer,ignoringthe phone calls from their spouse, whom they haven’t slept with in years. InFacebook world, family life seems perfect. In the real world, family life ismessy.Itcanoccasionallybesomessythatasmallnumberofpeopleevenregrethavingchildren.InFacebookworld,itseemseveryyoungadultisatacoolpartySaturdaynight. In the realworld,mostarehomealone,binge-watchingshowsonNetflix.InFacebookworld,agirlfriendpoststwenty-sixhappypicturesfromhergetawaywithherboyfriend.Intherealworld,immediatelyafterpostingthis,sheGoogles“myboyfriendwon’thavesexwithme.”And,perhapsatthesametime,theboyfriendwatches“GreatBody,GreatSex,GreatBlowjob.”
DIGITALTRUTH DIGITALLIES
•Searches •Socialmediaposts•Views •Socialmedialikes•Clicks •Datingprofiles•Swipes
THETRUTHABOUTYOURCUSTOMERS
IntheearlymorningofSeptember5,2006,Facebookintroducedamajorupdatetoitshomepage.TheearlyversionsofFacebookhadonlyalloweduserstoclickon profiles of their friends to learn what they were doing. The website,consideredabigsuccess,hadatthetime9.4millionusers.Butaftermonthsofhardwork,engineershadcreatedsomething theycalled
“NewsFeed,”whichwould provide userswith updates on the activities of alltheirfriends.Users immediately reported that they hated News Feed. Ben Parr, a
Northwestern undergraduate, created “Students Against Facebook news feed.”Hesaid that“newsfeedis just toocreepy, toostalker-esque,andafeature thathas togo.”Withinafewdays, thegrouphad700,000membersechoingParr’ssentiment. One University of Michigan junior told theMichigan Daily, “I’mreallycreepedoutbythenewFacebook.Itmakesmefeellikeastalker.”DavidKirkpatrick tells this story in his authorized account of thewebsite’s
history, The Facebook Effect: The Inside Story of the Company That IsConnectingtheWorld.HedubstheintroductionofNewsFeed“thebiggestcrisisFacebook has ever faced.” But Kirkpatrick reports that when he interviewedMarkZuckerberg,cofounderandheadoftherapidlygrowingcompany,theCEOwasunfazed.The reason? Zuckerberg had access to digital truth serum: numbers on
people’sclicksandvisitstoFacebook.AsKirkpatrickwrites:
Zuckerberginfactknewthatpeople likedtheNewsFeed,nomatterwhattheywere saying in thegroups.Hehad thedata toprove it.Peoplewerespending more time on Facebook, on average, than before News Feedlaunched.Andtheyweredoingmorethere—dramaticallymore.InAugust,usersviewed12billionpageson the service.ButbyOctober,withNewsFeedunderway,theyviewed22billion.
And that was not all the evidence at Zuckerberg’s disposal. Even the viralpopularity of the anti–News Feed group was evidence of the power of NewsFeed.Thegroupwasabletogrowsorapidlypreciselybecausesomanypeoplehadheardthattheirfriendshadjoined—andtheylearnedthisthroughtheirNewsFeed.In otherwords,while peoplewere joining in a big public uproar over how
unhappy they were about seeing all the details of their friends’ lives onFacebook, they were coming back to Facebook to see all the details of theirfriends’lives.NewsFeedstayed.Facebooknowhasmorethanonebilliondailyactiveusers.InhisbookZerotoOne,PeterThiel,anearlyinvestorinFacebook,saysthat
greatbusinessesarebuiltonsecrets,eithersecretsaboutnatureorsecretsaboutpeople. JeffSeder, asdiscussed inChapter3, found thenatural secret that leftventricle size predicted horse performance.Google found the natural secret of
howpowerfultheinformationinlinkscanbe.Thieldefines“secretsaboutpeople”as“thingsthatpeopledon’tknowabout
themselvesorthingstheyhidebecausetheydon’twantotherstoknow.”Thesekindsofbusinesses,inotherwords,arebuiltonpeople’slies.YoucouldarguethatallofFacebookisfoundedonanunpleasantsecretabout
people that Zuckerberg learned while at Harvard. Zuckerberg, early in hissophomore year, created a website for his fellow students called Facemash.Modeledonasitecalled“AmIHotorNot?,”Facemashwouldpresentpicturesof two Harvard students and then have other students judge who was betterlooking.Thesophomore’ssitewasgreetedwithoutrage.TheHarvardCrimson, inan
editorial, accusedyoungZuckerberg of “catering to theworst side” of people.HispanicandAfrican-Americangroupsaccusedhimofsexismandracism.Yet,before Harvard administrators shut down Zuckerberg’s internet access—just afewhoursafterthesitewasfounded—450peoplehadviewedthesiteandvoted22,000 timesondifferent images.Zuckerberghad learnedan important secret:people can claim they’re furious, they candecry something as distasteful, andyetthey’llstillclick.And he learned one more thing: for all their professions of seriousness,
responsibility,andrespectforothers’privacy,people,evenHarvardstudents,hadagreatinterestinevaluatingpeople’slooks.Theviewsandvotestoldhimthat.Andlater—sinceFacemashprovedtoocontroversial—hetookthisknowledgeofjusthowinterestedpeoplecouldbeinsuperficialfactsaboutotherstheysortofknewandharnesseditintothemostsuccessfulcompanyofhisgeneration.Netflix learned a similar lesson early on in its life cycle: don’t trust what
peopletellyou;trustwhattheydo.Originally, the company allowed users to create a queue of movies they
wantedtowatchinthefuturebutdidn’thavetimeforatthemoment.Thisway,whentheyhadmoretime,Netflixcouldremindthemofthosemovies.However,Netflixnoticedsomethingoddin thedata.Userswerefillingtheir
queueswithplentyofmovies.Butdays later,when theywereremindedof themoviesonthequeue,theyrarelyclicked.Whatwas theproblem?Askuserswhatmovies theyplan towatch ina few
days, and they will fill the queue with aspirational, highbrow films, such as
black-and-whiteWorldWar II documentaries or serious foreign films. A fewdayslater,however,theywillwanttowatchthesamemoviestheyusuallywanttowatch:lowbrowcomediesorromancefilms.Peoplewereconsistentlylyingtothemselves.Facedwiththisdisparity,Netflixstoppedaskingpeopletotellthemwhatthey
wanted to see in the future and started building amodel based onmillions ofclicksandviewsfromsimilarcustomers.Thecompanybegangreetingitsuserswithsuggestedlistsoffilmsbasednotonwhattheyclaimedtolikebutonwhatthedatasaidtheywerelikelytoview.Theresult:customersvisitedNetflixmorefrequentlyandwatchedmoremovies.“The algorithms know you better than you know yourself,” says Xavier
Amatriain,aformerdatascientistatNetflix.
CANWEHANDLETHETRUTH?
Youmayfindpartsofthischapterdepressing.Digitaltruthserumhasrevealedan abiding interest in judging people based on their looks; the continuedexistenceofmillionsofclosetedgaymen;ameaningfulpercentageofwomenfantasizingaboutrape;widespreadanimusagainstAfrican-Americans;ahiddenchild abuse and self-induced abortion crisis; and an outbreak of violentIslamophobic rage that only got worse when the president appealed fortolerance.Not exactly cheery stuff. Often, after I give a talk onmy research,people come up to me and say, “Seth, it’s all very interesting. But it’s sodepressing.”I can’t pretend there isn’t a darkness in some of this data. If people
consistently telluswhat they thinkwewant tohear,wewillgenerallybe toldthingsthataremorecomfortingthanthetruth.Digitaltruthserum,onaverage,willshowusthattheworldisworsethanwehavethought.Dowe need to know this? Learning aboutGoogle searches, porn data, and
whoclicksonwhatmightnotmakeyouthink,“Thisisgreat.Wecanunderstandwho we really are.” You might instead think, “This is horrible. We canunderstandwhowereallyare.”But the truth helps—and not just forMarkZuckerberg or others looking to
attractclicksorcustomers.Thereareatleastthreewaysthatthisknowledgecanimproveourlives.First, there can be comfort in knowing that you are not alone in your
insecurities and embarrassing behavior. It can be nice to know others areinsecure about their bodies. It is probably nice formany people—particularlythosewhoaren’thavingmuchsex—toknowthewholeworld isn’t fornicatinglikerabbits.AnditmaybevaluableforahighschoolboyinMississippiwithacrushon thequarterback toknow that,despite the lownumbersofopenlygaymenaroundhim,plentyofothersfeelthesamekindsofattraction.There’s another area—one I haven’t yet discussed—whereGoogle searches
canhelpshowyouarenotalone.Whenyouwereyoung,ateachermayhavetoldyouthat,ifyouhaveaquestion,youshouldraiseyourhandandaskitbecauseifyou’reconfused,othersare,too.Ifyouwereanythinglikeme,youignoredyourteacher’sadviceandsattheresilently,afraidtoopenyourmouth.Yourquestions
were too dumb, you thought; everyone else’s were more profound. Theanonymous, aggregateGoogle data can tell us once and for all how right ourteacherswere.Plentyofbasic,sub-profoundquestionslurkinotherminds,too.ConsiderthetopquestionsAmericanshadduringObama’s2014Stateofthe
Unionspeech.(Seethecolorphotoatendofthebook.)
YOU’RENOTTHEONLYONEWONDERING:TOPGOOGLEDQUESTIONSDURINGTHESTATEOFTHEUNION
HowoldisObama?WhoissittingnexttoBiden?WhyisBoehnerwearingagreentie?WhyisBoehnerorange?
Now, you might read these questions and think they speak poorly of ourdemocracy.Tobemoreconcernedabout thecolorofsomeone’s tieorhisskintone insteadof thecontentof thepresident’sspeechdoesn’t reflectwellonus.To not know who John Boehner, then the Speaker of the House ofRepresentatives,isalsodoesn’tsaymuchforourpoliticalengagement.Ipreferinsteadtothinkofsuchquestionsasdemonstratingthewisdomofour
teachers. These are the types of questions people usually don’t raise, becausetheysoundtoosilly.Butlotsofpeoplehavethem—andGooglethem.In fact, I thinkBigData cangive a twenty-first-centuryupdate to a famous
self-helpquote:“Nevercompareyourinsidestoeveryoneelse’soutsides.”A Big Data update may be: “Never compare your Google searches to
everyoneelse’ssocialmediaposts.”Compare,forexample,thewaythatpeopledescribetheirhusbandsonpublic
socialmediaandinanonymoussearches.
TOPWAYSPEOPLEDESCRIBETHEIRHUSBANDS
SOCIALMEDIAPOSTS SEARCHESthebest gay
mybestfriend ajerkamazing amazingthegreatest annoyingsocute mean
Sinceweseeotherpeople’ssocialmediapostsbutnottheirsearches,wetendtoexaggeratehowmanywomenconsistentlythinktheirhusbandsare“thebest,”“the greatest,” and “so cute.”*We tend to minimize howmany women thinktheirhusbandsare“ajerk,”“annoying,”and“mean.”Byanalyzinganonymousandaggregatedata,wemayallunderstandthatwe’renottheonlyoneswhofindmarriage, and life, difficult.Wemay learn to stop comparing our searches toeveryoneelse’ssocialmediaposts.Thesecondbenefitofdigitaltruthserumisthatitalertsustopeoplewhoare
suffering. The Human Rights Campaign has asked me to work with them inhelpingeducatemenincertainstatesaboutthepossibilityofcomingoutofthecloset.TheyarelookingtousetheanonymousandaggregateGooglesearchdatato help them decide where best to target their resources. Similarly, childprotective service agencies have contacted me to learn in what parts of thecountrytheremaybefarmorechildabusethantheyarerecording.Onesurprising topic Iwasalsocontactedabout:vaginalodors.WhenI first
wroteaboutthisintheNewYorkTimes,ofallplaces,Ididsoinanironictone.Thesectionmademe,andothers,chuckle.However, when I later explored some of themessage boards that come up
whensomeonemakesthesesearchestheyincludednumerouspostsfromyounggirlsconvincedthattheirliveswereruinedduetoanxietyaboutvaginalodor.It’snojoke.Sexedexpertshavecontactedme,askinghowtheycanbestincorporatesomeoftheinternetdatatoreducetheparanoiaamongyounggirls.WhileIfeelabitoutofmydepthonallthesematters,theyareserious,andI
believedatasciencecanhelp.The final—and, I think,most powerful—value in this digital truth serum is
indeed its ability to lead us from problems to solutions. With moreunderstanding, we might find ways to reduce the world’s supply of nastyattitudes.Let’s return to Obama’s speech about Islamophobia. Recall that every time
ObamaarguedthatpeopleshouldrespectMuslimsmore,theverypeoplehewastryingtoreachbecamemoreenraged.Googlesearches,however, reveal that therewasone line thatdid trigger the
type of response then-presidentObamamight havewanted.He said, “MuslimAmericansareourfriendsandourneighbors,ourco-workers,oursportsheroesand, yes, they are ourmen andwomen in uniform,who arewilling to die indefenseofourcountry.”After this line, for the first time inmore thanayear, the topGooglednoun
after “Muslim” was not “terrorists,” “extremists,” or “refugees.” It was“athletes,”followedby“soldiers.”And,infact,“athletes”keptthetopspotforafulldayafterward.When we lecture angry people, the search data implies that their fury can
grow. But subtly provoking people’s curiosity, giving new information, andoffering new images of the group that is stoking their rage may turn theirthoughtsindifferent,morepositivedirections.Twomonthsafterthatoriginalspeech,Obamagaveanothertelevisedspeech
on Islamophobia, this time at a mosque. Perhaps someone in the president’soffice had read Soltas’s and my Times column, which discussed what hadworkedandwhatdidn’t.Forthecontentofthisspeechwasnoticeablydifferent.Obamaspentlittletimeinsistingonthevalueoftolerance.Instead,hefocused
overwhelminglyonprovokingpeople’scuriosityandchangingtheirperceptionsofMuslimAmericans.Many of the slaves fromAfricawereMuslim,Obamatoldus;ThomasJeffersonandJohnAdamshadtheirowncopiesoftheKoran;thefirstmosqueonU.S.soilwasinNorthDakota;aMuslimAmericandesignedskyscrapers in Chicago. Obama again spoke of Muslim athletes and armedservice members but also talked of Muslim police officers and firefighters,teachersanddoctors.And my analysis of the Google searches suggests this speech was more
successful than thepreviousone.Manyof thehateful, ragefulsearchesagainstMuslimsdroppedinthehoursafterthepresident’saddress.There are other potential ways to use search data to learn what causes, or
reduces,hate.Forexample,wemightlookathowracistsearcheschangeafterablack quarterback is drafted in a city or how sexist searches change after awoman is elected to office.Wemight seehow racism responds to community
policingorhowsexismrespondstonewsexualharassmentlaws.Learningofoursubconsciousprejudicescanalsobeuseful.Forexample,we
might all make an extra effort to delight in little girls’ minds and show lessconcernwiththeirappearance.Googlesearchdataandotherwellspringsoftruthon the internet give us an unprecedented look into the darkest corners of thehuman psyche. This is at times, I admit, difficult to face. But it can also beempowering.Wecanusethedatatofightthedarkness.Collectingrichdataontheworld’sproblemsisthefirststeptowardfixingthem.
5
ZOOMINGIN
My brother, Noah, is four years younger than I. Most people, upon firstmeeting us, find us eerily similar.We both talk too loudly, are balding in thesameway,andhavegreatdifficultykeepingourapartmentstidy.Buttherearedifferences:Icountpennies.Noahbuysthebest.IloveLeonard
CohenandBobDylan.ForNoah,it’sCakeandBeck.Perhaps the most notable difference between us is our attitude toward
baseball. I am obsessed with baseball and, in particular, my love of the NewYork Mets has always been a core part of my identity. Noah finds baseballimpossiblyboring,andhishatredof thesporthas longbeenacorepartofhisidentity.*
SethStephens-Davidowitz
Baseball-o-Phile
NoahStephens-DavidowitzBaseball-o-Phobe
Howcantwoguyswithsuchsimilargenes,raisedbythesameparents,inthesame town, have such opposite feelings about baseball?What determines theadults we become?More fundamentally, what’swrong with Noah? There’s agrowing field within developmental psychology that mines massive adultdatabasesandcorrelates themwithkeychildhoodevents. It canhelpus tacklethis and related questions. We might call this increasing use of Big Data toanswerpsychologicalquestionsBigPsych.Toseehowthisworks, let’sconsiderastudyIconductedonhowchildhood
experiencesinfluencewhichbaseballteamyousupport—orwhetheryousupportanyteamatall.Forthisstudy,IusedFacebookdataon“likes”ofbaseballteams.(InthepreviouschapterInotedthatFacebookdatacanbedeeplymisleadingonsensitivetopics.Withthisstudy,Iamassumingthatnobody,notevenaPhilliesfan, is embarrassed to acknowledge a rooting interest in a particular team onFacebook.)To beginwith, I downloaded the number ofmales of every agewho “like”
eachofNewYork’stwobaseballteams.HerearethepercentthatareMetsfans,byyearofbirth.
Thehigherthepoint,themoreMetsfans.Thepopularityoftheteamrisesandfalls then rises and falls again,with theMetsbeingverypopular among thoseborn in 1962 and 1978. I’m guessing baseball fans might have an idea as towhat’sgoingonhere.TheMetshavewon just twoWorldSeries: in1969and1986. Thesemenwere roughly seven to eight years oldwhen theMetswon.Thus a huge predictor ofMets fandom, for boys at least, iswhether theMetswonaWorldSerieswhentheywerearoundtheagesevenoreight.In fact,wecanextend thisanalysis. Idownloaded informationonFacebook
showing how many fans of every age “like” every one of a comprehensiveselectionofMajorLeagueBaseballteams.I found that there are also an unusually high number of male Baltimore
Oriolesfansbornin1962andmalePittsburghPiratesfansbornin1963.Thosemen were eight-year-old boys when these teams were champions. Indeed,calculatingtheageofpeakfandomforalltheteamsIstudied,thenfiguringouthowoldthesefanswouldhavebeen,gavemethischart:
Once again we see that the most important year in a man’s life, for thepurposesofcementinghisfavoritebaseballteamasanadult,iswhenheismoreorlesseightyearsold.Overall,fivetofifteenisthekeyperiodtowinoveraboy.Winningwhenamanisnineteenortwentyisaboutone-eighthasimportant indeterminingwhohewillrootforaswinningwhenheiseight.Bythen,hewillalreadyeitherloveateamforlifeorhewon’t.Youmightbeasking,whataboutwomenbaseballfans?Thepatternsaremuch
lesssharp,butthepeakageappearstobetwenty-twoyearsold.Thisismyfavoritestudy.Itrelatestotwoofmymostbelovedtopics:baseball
and thesourcesofmyadultdiscontent. Iwasfirmlyhooked in1986andhavebeen suffering along—rooting for the Mets—ever since. Noah had the goodsensetobebornfouryearslaterandwassparedthispain.Now,baseball is not themost important topic in theworld, or somyPh.D.
advisorsrepeatedlytoldme.Butthismethodologymighthelpustacklesimilarquestions, including how people develop their political preferences, sexualproclivities, musical taste, and financial habits. (I would be particularlyinterestedontheoriginsofmybrother’swackyideasonthelattertwosubjects.)Mypredictionisthatwewillfindthatmanyofouradultbehaviorsandinterests,eventhosethatweconsiderfundamentaltowhoweare,canbeexplainedbythearbitraryfactsofwhenwewerebornandwhatwasgoingonincertainkeyyearswhilewewereyoung.Indeed, some work has already been done on the origin of political
preferences.YairGhitza,chiefscientistatCatalist,adataanalysiscompany,andAndrew Gelman, a political scientist and statistician at Columbia University,triedtotesttheconventionalideathatmostpeoplestartoutliberalandbecome
increasingly conservative as they age. This is the view expressed in a famousquoteoftenattributedtoWinstonChurchill:“Anymanwhoisunder30,andisnot a liberal, has no heart; and any man who is over 30, and is not aconservative,hasnobrains.”GhitzaandGelmanporedthroughsixtyyearsofsurveydata,takingadvantage
ofmorethan300,000observationsonvotingpreferences.Theyfound,contraryto Churchill’s claim, that teenagers sometimes tilt liberal and sometimes tiltconservative.Asdothemiddle-agedandtheelderly.These researchersdiscovered thatpoliticalviewsactually form inawaynot
dissimilar to thewayour sports teampreferencesdo.There isacrucialperiodthat imprintsonpeople for life.Between thekeyagesof fourteenand twenty-four,numerousAmericanswillformtheirviewsbasedonthepopularityofthecurrent president.ApopularRepublicanorunpopularDemocratwill influencemanyyoungadultstobecomeRepublicans.AnunpopularRepublicanorpopularDemocratputsthisimpressionablegroupintheDemocraticcolumn.Andtheseviews,inthesekeyyears,will,onaverage,lastalifetime.To see how thisworks, compareAmericans born in 1941 and those born a
decadelater.Those in the first group came of age during the presidency of Dwight D.
Eisenhower,apopularRepublican.Intheearly1960s,despitebeingunderthirty,thisgenerationstronglytiltedtowardtheRepublicanParty.AndmembersofthisgenerationhaveconsistentlytiltedRepublicanastheyhaveaged.Americans born ten years later—baby boomers—came of age during the
presidencies of John F. Kennedy, an extremely popular Democrat; Lyndon B.Johnson, an initially popular Democrat; and RichardM. Nixon, a Republicanwho eventually resigned in disgrace. Members of this generation have tiltedliberaltheirentirelives.With all this data, the researchers were able to determine the single most
importantyearfordevelopingpoliticalviews:ageeighteen.And they found that these imprint effects are substantial. Their model
estimatesthattheEisenhowerexperienceresultedinabouta10percentagepointlifetime boost forRepublicans amongAmericans born in 1941.TheKennedy,Johnson,andNixonexperiencegaveDemocratsa7percentagepointadvantageamongAmericansbornin1952.
I’vemadeitclearthatIamskepticalofsurveydata,butIamimpressedwiththelargenumberofresponsesexaminedhere.Infact,thisstudycouldnothavebeen done with one small survey. The researchers needed the hundreds ofthousands of observations, aggregated from many surveys, to see howpreferenceschangeaspeopleage.Datasizewasalsocrucialformybaseballstudy.Ineededtozoominnotonly
on fansof each teambutonpeopleof everyage.Millionsofobservationsarerequired todo thisandFacebookandotherdigitalsources routinelyoffersuchnumbers.ThisiswherethebignessofBigDatareallycomesintoplay.Youneedalotof
pixelsinaphotoinordertobeabletozoominwithclarityononesmallportionofit.Similarly,youneedalotofobservationsinadatasetinordertobeabletozoominwithclarityononesmallsubsetofthatdata—forexample,howpopulartheMetsareamongmenbornin1978.Asmallsurveyofacoupleofthousandpeoplewon’thavealargeenoughsampleofsuchmen.ThisisthethirdpowerofBigData:BigDataallowsustomeaningfullyzoom
inonsmallsegmentsofadatasettogainnewinsightsonwhoweare.Andwecanzoominonotherdimensionsbesidesage.Ifwehaveenoughdata,wecansee how people in particular towns and cities behave. And we can see howpeoplecarryonhour-by-hourorevenminute-by-minute.Inthischapter,humanbehaviorgetsitsclose-up.
WHAT’SREALLYGOINGONINOURCOUNTIES,CITIES,ANDTOWNS?
Inhindsight it’s surprising.ButwhenRajChetty, then aprofessor atHarvard,and a small research team first got a hold of a rather large dataset—allAmericans’taxrecordssince1996—theywerenotcertainanythingwouldcomeof it. The IRS had handed over the data because they thought the researchersmightbeabletouseittohelpclarifytheeffectsoftaxpolicy.TheinitialattemptsChettyandhisteammadetousethisBigDataled,infact,
to numerous dead ends. Their investigations of the consequences of state andfederaltaxpoliciesreachedmostlythesameconclusionseverybodyelsehadjustbyusing surveys.PerhapsChetty’s answers,using thehundredsofmillionsof
IRS data points, were a bit more precise. But getting the same answers aseverybody else, with a little more precision, is not a major social scienceaccomplishment.Itisnotthetypeofworkthattopjournalsareeagertopublish.Plus,organizingandanalyzingall the IRSdatawas time-consuming.Chetty
andhisteam—drowningindata—weretakingmoretimethaneverybodyelsetofindthesameanswersaseverybodyelse.ItwasbeginningtolookliketheBigDataskepticswereright.Youdidn’tneed
dataforhundredsofmillionsofAmericanstounderstandtaxpolicy;asurveyoften thousand people was plenty. Chetty and his team were understandablydiscouraged.Andthen,finally,theresearchersrealizedtheirmistake.“BigDataisnotjust
aboutdoingthesamethingyouwouldhavedonewithsurveysexceptwithmoredata,” Chetty explains. They were asking little data questions of the massivecollectionofdata theyhadbeenhanded.“BigData really shouldallowyou touse completely different designs than what you would have with a survey,”Chettyadds.“Youcan,forexample,zoominongeographies.”Inotherwords,withdataonhundredsofmillionsofpeople,Chettyandhis
team could spot patterns among cities, towns, and neighborhoods, large andsmall.As a graduate student at Harvard, I was in a seminar room when Chetty
presented his initial results using the tax records of every American. Socialscientistsreferintheirworktoobservations—howmanydatapointstheyhave.Ifasocialscientistisworkingwithasurveyofeighthundredpeople,hewouldsay, “Wehaveeighthundredobservations.” Ifhe isworkingwitha laboratoryexperimentwithseventypeople,hewouldsay,“Wehaveseventyobservations.”“Wehaveone-point-twobillionobservations,”Chettysaid,straight-faced.The
audiencegigglednervously.Chettyandhiscoauthorsbegan,inthatseminarroomandtheninaseriesof
papers,togiveusimportantnewinsightsintohowAmericaworks.Considerthisquestion:isAmericaalandofopportunity?Doyouhaveashot,
ifyourparentsarenotrich,tobecomerichyourself?The traditional way to answer this question is to look at a representative
sampleofAmericansandcomparethistosimilardatafromothercountries.Here is the data for a variety of countries on equality of opportunity. The
questionasked:what is thechance thatapersonwithparents in thebottom20percent of the income distribution reaches the top 20 percent of the incomedistribution?
CHANCESAPERSONWITHPOORPARENTSWILLBECOMERICH(SELECTEDCOUNTRIES)
UnitedStates 7.5
UnitedKingdom 9.0
Denmark 11.7
Canada 13.5
Asyoucansee,Americadoesnotscorewell.But this simple analysismisses the real story. Chetty’s team zoomed in on
geography.TheyfoundtheoddsdifferahugeamountdependingonwhereintheUnitedStatesyouwereborn.
CHANCESAPERSONWITHPOORPARENTSWILLBECOMERICH(SELECTEDPARTSOFTHEUNITEDSTATES)
SanJose,CA 12.9
Washington,DC 10.5
UnitedStatesAverage 7.5
Chicago,IL 6.5
Charlotte,NC 4.4
InsomepartsoftheUnitedStates,thechanceofapoorkidsucceedingisashigh as in any developed country in the world. In other parts of the UnitedStates, the chance of a poor kid succeeding is lower than in any developedcountryintheworld.These patterns would never be seen in a small survey, which might only
include a few people in Charlotte and San Jose, and which therefore wouldpreventyoufromzoominginlikethis.Infact,Chetty’steamcouldzoominevenfurther.Becausetheyhadsomuch
data—data on every singleAmerican—they could even zoom in on the smallgroups of people who moved from city to city to see how that might haveaffectedtheirprospects:thosewhomovedfromNewYorkCitytoLosAngeles,Milwaukee to Atlanta, San Jose to Charlotte. This allowed them to test forcausation,notjustcorrelation(adistinctionI’lldiscussinthenextchapter).And,yes, moving to the right city in one’s formative years made a significantdifference.SoisAmericaa“landofopportunity”?The answer is neither yes nor no.The answer is: someparts are, and some
partsaren’t.Astheauthorswrite,“TheU.S.isbetterdescribedasacollectionofsocieties,
some of which are ‘lands of opportunity’ with high rates of mobility acrossgenerations,andothersinwhichfewchildrenescapepoverty.”So what is it about parts of the United States where there is high income
mobility? What makes some places better at equaling the playing field, ofallowing a poor kid to have a pretty good life? Areas that spend more oneducation provide a better chance to poor kids. Places with more religiouspeople and lower crime do better. Places with more black people do worse.Interestingly, thishasaneffectonnot just theblackkidsbutonthewhitekidslivingthereaswell.Placeswithlotsofsinglemothersdoworse.Thiseffecttooholdsnotjustforkidsofsinglemothersbutforkidsofmarriedparentslivinginplaceswithlotsofsinglemothers.Someoftheseresultssuggestthatapoorkid’speersmatter.Ifhisfriendshaveadifficultbackgroundandlittleopportunity,hemaystrugglemoretoescapepoverty.ThedatatellsusthatsomepartsofAmericaarebetteratgivingkidsachance
toescapepoverty.Sowhatplacesarebestatgivingpeopleachance toescapethegrimreaper?
Weliketothinkofdeathasthegreatequalizer.Nobody,afterall,canavoidit.Notthepaupernortheking,thehomelessmannorMarkZuckerberg.Everybodydies.
Butifthewealthycan’tavoiddeath,datatellsusthattheycannowdelayit.American women in the top 1 percent of income live, on average, ten yearslonger thanAmericanwomenin thebottom1percentof income.Formen, thegapisfifteenyears.HowdothesepatternsvaryindifferentpartsoftheUnitedStates?Doesyour
lifeexpectancyvarybasedonwhereyoulive?Isthisvariationdifferentforrichandpoorpeople?Again,byzoominginongeography,RajChetty’steamfoundtheanswers.Interestingly,forthewealthiestAmericans,lifeexpectancyishardlyaffected
bywhere they live. Ifyouhaveexcessesofmoney,youcanexpect tomake itroughlyeighty-nineyearsasawomanandabouteighty-sevenyearsasaman.Rich people everywhere tend to develop healthier habits—on average, theyexercisemore,eatbetter,smokeless,andarelesslikelytosufferfromobesity.Rich people can afford the treadmill, the organic avocados, the yoga classes.AndtheycanbuythesethingsinanycorneroftheUnitedStates.Forthepoor,thestoryisdifferent.ForthepoorestAmericans,lifeexpectancy
varies tremendously depending onwhere they live. In fact, living in the rightplacecanaddfiveyearstoapoorperson’slifeexpectancy.So why do some places seem to allow the impoverished to live so much
longer?Whatattributesdocitieswherepoorpeoplelivethelongestshare?Here are four attributes of a city—three of themdo not correlatewith poor
people’slifeexpectancy,andoneofthemdoes.Seeifyoucanguesswhichonematters.
WHATMAKESPOORPEOPLEINACITYLIVEMUCHLONGER?
Thecityhasahighlevelofreligiosity.Thecityhaslowlevelsofpollution.Thecityhasahigherpercentageofresidentscoveredbyhealthinsurance.
Alotofrichpeopleliveinthecity.
Thefirstthree—religion,environment,andhealthinsurance—donotcorrelatewith longer lifespans for thepoor.Thevariable thatdoesmatter,according to
Chettyandtheotherswhoworkedonthisstudy?Howmanyrichpeopleliveinacity.Morerichpeopleinacitymeansthepoortherelivelonger.PoorpeopleinNewYorkCity,forexample,livealotlongerthanpoorpeopleinDetroit.Whyisthepresenceofrichpeoplesuchapowerfulpredictorofpoorpeople’s
life expectancy? One hypothesis—and this is speculative—was put forth byDavidCutler,oneoftheauthorsofthestudyandoneofmyadvisors.Contagiousbehaviormaybedrivingsomeofthis.There is a large amount of research showing that habits are contagious. So
poorpeople livingnear richpeoplemaypickupa lotof theirhabits.Someofthese habits—say, pretentious vocabulary—aren’t likely to affect one’s health.Others—working out—will definitely have a positive impact. Indeed, poorpeoplelivingnearrichpeopleexercisemore,smokeless,andarelesslikelytosufferfromobesity.
My personal favorite study by Raj Chetty’s team, which had access to thatmassivecollectionofIRSdata,wastheirinquiryintowhysomepeoplecheatontheirtaxeswhileothersdonot.Explainingthisstudyisabitmorecomplicated.Thekeyisknowingthat there isaneasywayforself-employedpeoplewith
one child to maximize the money they receive from the government. If youreport that you had taxable income of exactly $9,000 in a given year, thegovernment will write you a check for $1,377—that amount represents theEarned IncomeTaxCredit, a grant to supplement the earnings of theworkingpoor, minus your payroll taxes. Report any more than that, and your payrolltaxeswillgoup.Reportany less than that,and theEarned IncomeTaxCreditdrops.Ataxableincomeof$9,000isthesweetspot.And, wouldn’t you know it, $9,000 is the most common taxable income
reportedbyself-employedpeoplewithonechild.DidtheseAmericansadjusttheirworkschedulestomakesuretheyearnedthe
perfectincome?Nope.Whentheseworkerswererandomlyaudited—averyrareoccurrence—itwasalmostalwaysfoundthattheymadenowherenear$9,000—theyearnedeithersubstantiallylessorsubstantiallymore.In other words, they cheated on their taxes by pretending they made the
amountthatwouldgivethemthefattestcheckfromthegovernment.Sohowtypicalwasthistypeoftaxfraudandwhoamongtheself-employed
withonechildwasmostlikelytocommitit?Itturnsout,Chettyandcolleaguesreported, that there were huge differences across the United States in howcommonthistypeofcheatingwas.InMiami,amongpeopleinthiscategory,anastonishing30percentreportedtheymade$9,000.InPhiladelphia,just2percentdid.What predictswho is going to cheat?What is it about places that have the
greaternumberofcheatersandthosethathavelowernumbers?Wecancorrelateratesof cheatingwithother city-leveldemographics and it turnsout that therearetwostrongpredictors:ahighconcentrationofpeopleintheareaqualifyingfortheEarnedIncomeTaxCreditandahighconcentrationoftaxprofessionalsintheneighborhood.What do these factors indicate?Chetty and the authors had an explanation.
Thekeymotivatorforcheatingonyourtaxesinthismannerwasinformation.Most self-employed one-kid taxpayers simply did not know that themagic
numberforgettingabigfatcheckfromthegovernmentwas$9,000.Butlivingnear others who might—either their neighbors or tax assisters—dramaticallyincreasedtheoddsthattheywouldlearnaboutit.In fact,Chetty’s team found evenmore evidence that knowledge drove this
kindofcheating.WhenAmericansmovedfromanareawherethisvarietyoftaxfraudwaslowtoanareawhereitwashigh,theylearnedandadoptedthetrick.Through time, cheating spread from region to region throughout the UnitedStates.Likeavirus,cheatingontaxesiscontagious.Now stop for a moment and think about how revealing this study is. It
demonstratedthat,whenitcomestofiguringoutwhowillcheatontheirtaxes,thekeyisn’tdeterminingwhoishonestandwhoisdishonest.Itisdeterminingwhoknowshowtocheatandwhodoesn’t.Sowhen someone tellsyou theywouldnever cheaton their taxes, there’s a
pretty good chance that they are—you guessed it—lying. Chetty’s researchsuggeststhatmanywouldiftheyknewhow.If youwant to cheat on your taxes (and I amnot recommending this), you
shouldliveneartaxprofessionalsorliveneartaxcheaterswhocanshowyoutheway. If youwant tohavekidswho areworld-famous,where shouldyou live?This ability to zoom in on data and get really granular can help answer thisquestion,too.
Iwas curiouswhere themost successfulAmericans come from, so one day IdecidedtodownloadWikipedia.(Youcandothatsortofthingnowadays.)Withalittlecoding,Ihadadatasetofmorethan150,000Americansdeemed
byWikipedia’s editors to be notable enough to warrant an entry. The datasetincludedcountyofbirth,dateofbirth,occupation,andgender.Imergeditwithcounty-levelbirthdatagatheredbytheNationalCenterforHealthStatistics.Forevery county in the United States, I calculated the odds of making it intoWikipediaifyouwerebornthere.IsbeingprofiledinWikipediaameaningfulmarkerofnotableachievement?
Therearecertainlylimitations.Wikipedia’seditorsskewyoungandmale,whichmaybias thesample.Andsometypesofnotabilityarenotparticularlyworthy.TedBundy, for example, rates aWikipedia entry because he killed dozens ofyoungwomen.Thatsaid, Iwasable to removecriminalswithoutaffecting theresultsmuch.I limited the study to baby boomers (those born between 1946 and 1964)
becausetheyhavehadnearlyafulllifetimetobecomenotable.Roughlyonein2,058American-bornbabyboomersweredeemednotableenough towarrantaWikipedia entry. About 30 percent made it through achievements in art orentertainment,29percentthroughsports,9percentviapolitics,and3percentinacademiaorscience.The first striking fact I noticed in the data was the enormous geographic
variation in the likelihood of becoming a big success, at least onWikipedia’sterms.Yourchancesofachievingnotabilitywerehighlydependentonwhereyouwereborn.Roughly one in 1,209 baby boomers born in California reachedWikipedia.
Onlyonein4,496babyboomersborninWestVirginiadid.Zoominbycountyandtheresultsbecomemoretelling.Roughlyonein748babyboomersborninSuffolkCounty,Massachusetts,whereBostonis located,madeit toWikipedia.Insomeothercounties,thesuccessratewastwentytimeslower.Whydosomepartsofthecountryappeartobesomuchbetteratchurningout
America’smoversandshakers?Icloselyexaminedthetopcounties.Itturnsoutthatnearlyallofthemfitintooneoftwocategories.First, and this surprised me, many of these counties contained a sizable
college town. Justaboutevery time I saw thenameofacounty that Ihadnot
heardofnear the topof the list, likeWashtenaw,Michigan, I foundout that itwasdominatedbyaclassiccollegetown,inthiscaseAnnArbor.ThecountiesgracedbyMadison,Wisconsin;Athens,Georgia;Columbia,Missouri;Berkeley,California; Chapel Hill, North Carolina; Gainesville, Florida; Lexington,Kentucky;andIthaca,NewYork,areallinthetop3percent.Why is this? Some of it is may well be due to the gene pool: sons and
daughtersofprofessorsandgraduatestudentstendtobesmart(atraitthat,inthegameofbigsuccess,canbemightyuseful).And, indeed,havingmorecollegegraduatesinanareaisastrongpredictorofthesuccessofthepeoplebornthere.But there is most likely something more going on: early exposure to
innovation. One of the fields where college towns are most successful inproducingtopdogsismusic.Akidinacollegetownwillbeexposedtouniqueconcerts, unusual radio stations, and even independent record stores.And thisisn’t limited to the arts.College towns also incubatemore than their expectedshareofnotablebusinesspeople.Maybeearlyexposure tocutting-edgeart andideashelpsthem,too.The success of college towns does not just cross regions. It crosses race.
African-Americans were noticeably underrepresented on Wikipedia innonathleticfields,especiallybusinessandscience.Thisundoubtedlyhasalottodowithdiscrimination.Butonesmallcounty,wherethe1950populationwas84percentblack,producednotablebabyboomersataratenearthoseofthehighestcounties.Offewerthan13,000boomersborninMaconCounty,Alabama,fifteenmade
it toWikipedia—oronein852.Everysingleoneofthemisblack.Fourteenofthem were from the town of Tuskegee, home of Tuskegee University, ahistoricallyblack college foundedbyBookerT.Washington.The list includedjudges,writers, and scientists. In fact, a black child born inTuskegee had thesameprobabilityof becominganotable in a fieldoutsideof sports as awhitechildborninsomeofthehighest-scoring,majority-whitecollegetowns.Thesecondattributemostlikelytomakeacounty’snativessuccessfulwasthe
presenceinthatcountyofabigcity.BeingborninSanFranciscoCounty,LosAngelesCounty,orNewYorkCityallofferedamongthehighestprobabilitiesofmaking it to Wikipedia. (I grouped New York City’s five counties togetherbecausemanyWikipediaentriesdidnotspecifyaboroughofbirth.)
Urbanareastendtobewellsuppliedwithmodelsofsuccess.Toseethevalueofbeingnearsuccessfulpractitionersofacraftwhenyoung,compareNewYorkCity, Boston, and Los Angeles. Among the three, New York City producesnotablejournalistsat thehighestrate;Bostonproducesnotablescientistsat thehighest rate; and Los Angeles produces notable actors at the highest rate.Remember,weare talkingaboutpeoplewhowereborn there,notpeoplewhomoved there. And this holds true even after subtracting people with notableparentsinthatfield.Suburbancounties,unlesstheycontainedmajorcollegetowns,performedfar
worse than their urban counterparts. My parents, like many boomers, movedaway from crowded sidewalks to tree-shaded streets—in this case fromManhattan to Bergen County, New Jersey—to raise their three children. Thiswas potentially a mistake, at least from the perspective of having notablechildren.AchildborninNewYorkCityis80percentmorelikelytomakeitintoWikipediathanakidborninBergenCounty.Thesearejustcorrelations,buttheydosuggest thatgrowingupnearbigideasisbetter thangrowingupwithabigbackyard.ThestarkeffectsidentifiedheremightbeevenstrongerifIhadbetterdataon
places lived throughout childhood, since many people grow up in differentcountiesthantheonewheretheywereborn.Thesuccessofcollegetownsandbigcitiesisstrikingwhenyoujustlookat
the data. But I also delved more deeply to undertake a more sophisticatedempiricalanalysis.Doingsoshowedthattherewasanothervariablethatwasastrongpredictorof
aperson’ssecuringanentryinWikipedia:theproportionofimmigrantsinyourcountyofbirth.Thegreaterthepercentageofforeign-bornresidentsinanarea,thehigher theproportionofchildrenborn therewhogoon tonotable success.(Take that, Donald Trump!) If two places have similar urban and collegepopulations, the one with more immigrants will produce more prominentAmericans.Whatexplainsthis?Alotofitseemstobedirectlyattributabletothechildrenofimmigrants.Idid
anexhaustivesearchofthebiographiesofthehundredmostfamouswhitebabyboomers, according to the Massachusetts Institute of Technology’s Pantheonproject, which is also working with Wikipedia data. Most of these were
entertainers.Atleastthirteenhadforeign-bornmothers,includingOliverStone,SandraBullock,andJulianneMoore.Thisrate ismore than three timeshigherthan the national average during this period. (Many had fathers who wereimmigrants, including Steve Jobs and John Belushi, but this data was moredifficult to compare to national averages, since information on fathers is notalwaysincludedonbirthcertificates.)Whataboutvariablesthatdon’timpactsuccess?OnethatIfoundmorethana
littlesurprisingwashowmuchmoneyastatespendsoneducation.Instateswithsimilarpercentagesofitsresidentslivinginurbanareas,educationspendingdidnotcorrelatewithratesofproducingnotablewriters,artists,orbusinessleaders.It is interesting to compare myWikipedia study to one of Chetty’s team’s
studiesdiscussedearlier.RecallthatChetty’steamwastryingtofigureoutwhatareasaregoodatallowingpeopletoreachtheuppermiddleclass.Mystudywastryingtofigureoutwhatareasaregoodatallowingpeople toreachfame.Theresultsarestrikinglydifferent.Spendinga lotoneducationhelpskidsreachtheuppermiddleclass.Itdoes
little tohelp thembecomeanotablewriter, artist, orbusiness leader.Manyofthesehugesuccesseshatedschool.Somedroppedout.NewYorkCity,Chetty’steamfound,isnotaparticularlygoodplacetoraisea
childifyouwanttoensurehereachestheuppermiddleclass.Itisagreatplace,mystudyfound,ifyouwanttogivehimachanceatfame.Whenyou lookat the factors thatdrivesuccess, the largevariationbetween
countiesbeginstomakesense.Manycountiescombineallthemainingredientsforsuccess.Return,again, toBoston.Withnumerousuniversities, it isstewingin innovative ideas. It is an urban area with many extremely accomplishedpeopleofferingyoungstersexamplesofhowtomakeit.Anditdrawsplentyofimmigrants,whosechildrenaredriventoapplytheselessons.What if an areahasnoneof thesequalities? Is it destined toproduce fewer
superstars? Not necessarily. There is another path: extreme specialization.Roseau County, Minnesota, a small rural county with few foreigners and nomajoruniversities,isagoodexample.Roughly1in740peoplebornheremadeit intoWikipedia.Their secret?All ninewere professional hockey players, nodoubt helped by the county’s world-class youth and high school hockeyprograms.
So is thepoint here—assumingyou’re not so interested in raising a hockeystar—tomovetoBostonorTuskegeeifyouwanttogiveyourfuturechildrentheutmost advantage? It can’t hurt. But there are larger lessons here. Usually,economists and sociologists focus on how to avoid bad outcomes, such aspoverty and crime. Yet the goal of a great society is not only to leave fewerpeople behind; it is to help as many people as possible to really stand out.Perhapsthisefforttozoominontheplaceswherehundredsofthousandsofthemost famous Americans were born can give us some initial strategies:encouraging immigration, subsidizing universities, and supporting the arts,amongthem.
Usually,IstudytheUnitedStates.SowhenIthinkofzoominginbygeography,Ithinkofzoominginonourcitiesandtowns—oflookingatplaceslikeMaconCounty,Alabama,andRoseauCounty,Minnesota.Butanotherhuge—andstillgrowing—advantage of data from the internet is that it is easy to collect datafromaroundtheworld.Wecanthenseehowcountriesdiffer.Anddatascientistsgetanopportunitytotiptoeintoanthropology.One somewhat random topic I recently explored: how does pregnancy play
out in different countries around the world? I examined Google searches bypregnantwomen.ThefirstthingIfoundwasastrikingsimilarityinthephysicalsymptomsaboutwhichwomencomplain.I testedhowoftenvarioussymptomsweresearchedincombinationwiththe
word“pregnant.”Forexample,howoftenis“pregnant”searchedinconjunctionwith “nausea,” “back pain,” or “constipation”?Canada’s symptomswere veryclosetothoseintheUnitedStates.SymptomsincountrieslikeBritain,Australia,andIndiawereallroughlysimilar,too.Pregnantwomenaround theworldapparentlyalsocrave thesame things. In
theUnitedStates,thetopGooglesearchinthiscategoryis“cravingiceduringpregnancy.”The next four are salt, sweets, fruit, and spicy food. InAustralia,thosecravingsdon’tdifferallthatmuch:thelistfeaturessalt,sweets,chocolate,ice,andfruit.WhataboutIndia?Asimilarstory:spicyfood,sweets,chocolate,salt,andicecream.Infact,thetopfiveareverysimilarinallofthecountriesIlookedat.Preliminaryevidencesuggeststhatnopartoftheworldhasstumbledupona
diet or environment that drastically changes the physical experience ofpregnancy.Butthethoughtsthatsurroundpregnancymostdefinitelydodiffer.Start with questions about what pregnant women can safely do. The top
questionsintheUnitedStates:canpregnantwomen“eatshrimp,”“drinkwine,”“drinkcoffee,”or“takeTylenol”?Whenitcomestosuchconcerns,othercountriesdon’thavemuchincommon
with the United States or one another. Whether pregnant women can “drinkwine” is not among the top ten questions in Canada, Australia, or Britain.Australia’sconcernsaremostlyrelatedtoeatingdairyproductswhilepregnant,particularlycreamcheese. InNigeria,where30percentof thepopulationusestheinternet,thetopquestioniswhetherpregnantwomencandrinkcoldwater.Are these worries legitimate? It depends. There is strong evidence that
pregnantwomenareatan increased riskof listeria fromunpasteurizedcheese.Links have been established between drinking toomuch alcohol and negativeoutcomes for thechild. Insomepartsof theworld, it isbelieved thatdrinkingcoldwatercangiveyourbabypneumonia;Idon’tknowofanymedicalsupportforthis.The huge differences in questions posed around the world are most likely
causedbytheoverwhelmingfloodofinformationcomingfromdisparatesourcesineachcountry:legitimatescientificstudies,so-soscientificstudies,oldwives’tales,andneighborhoodchatter.Itisdifficultforwomentoknowwhattofocuson—orwhattoGoogle.Wecanseeanothercleardifferencewhenwelookatthetopsearchesfor“how
to___duringpregnancy?”IntheUnitedStates,Australia,andCanada,thetopsearchis“howtopreventstretchmarksduringpregnancy.”ButinGhana,India,andNigeria,preventingstretchmarksisnoteveninthetopfive.Thesecountriestendtobemoreconcernedwithhowtohavesexorhowtosleep.
Thereisundoubtedlymoretolearnfromzoominginonaspectsofhealthandculture indifferentcornersof theworld.Butmypreliminaryanalysis suggeststhatBigDatawill tellus thathumansareeven lesspowerful thanwe realizedwhen it comes to transcending our biology.Yetwe come upwith remarkablydifferentinterpretationsofwhatitallmeans.
HOWWEFILLOURMINUTESANDHOURS
“The adventures of a young man whose principal interests are rape, ultra-violence,andBeethoven.”That was how Stanley Kubrick’s controversial A Clockwork Orange was
advertised. In the movie, the fictional young protagonist, Alex DeLarge,committed shocking acts of violence with chilling detachment. In one of thefilm’smostnotoriousscenes,herapedawomanwhilebeltingout“Singin’intheRain.”Almostimmediately,therewerereportsofcopycatincidents.Indeed,agroup
ofmenrapedaseventeen-year-oldgirlwhilesingingthesamesong.Themoviewas shut down inmany European countries, and some of themore shocking
sceneswereremovedforaversionshowninAmerica.There are, in fact, many examples of real life imitating art, with men
seeminglyhypnotizedbywhat theyhad just seenon-screen.Ashowingof thegangmovieColorswasfollowedbyaviolentshooting.AshowingofthegangmovieNewJackCitywasfollowedbyriots.Perhapsmostdisturbing,fourdaysafterthereleaseofTheMoneyTrain,men
used lighter fluid to ignite a subway toll booth, almost perfectlymimicking asceneinthefilm.Theonlydifferencebetweenthefictionalandreal-worldarson:Inthemovie,theoperatorescaped.Inreallife,heburnedtodeath.There is also some evidence from psychological experiments that subjects
exposedtoaviolentfilmwillreportmoreangerandhostility,eveniftheydon’tpreciselyimitateoneofthescenes.Inotherwords,anecdotesandexperimentssuggestviolentmoviescanincite
violent behavior. But how big an effect do they really have? Are we talkingabout one or two murders every decade or hundreds of murders every year?Anecdotesandexperimentscan’tanswerthis.To see if Big Data could, two economists, Gordon Dahl and Stefano
DellaVigna,mergedtogetherthreeBigDatasetsfortheyears1995to2004:FBIhourlycrimedata,box-officenumbers,andameasureof theviolence ineverymoviefromkids-in-mind.com.Theinformationtheywereusingwascomplete—everymovieandeverycrime
committed in every hour in cities throughout the United States. This wouldproveimportant.Key to their study was the fact that on some weekends, the most popular
moviewasaviolentone—HannibalorDawnoftheDead, forexample—whileonotherweekends, themostpopularmoviewasnonviolent, suchasRunawayBrideorToyStory.The economists could see exactly how many murders, rapes, and assaults
werecommittedonweekendswhenaprominentviolentmoviewasreleasedandcompare that to the number of murders, rapes, and assaults there were onweekendswhenaprominentpeacefulmoviewasreleased.Sowhatdid theyfind?Whenaviolentmoviewasshown,didcrimerise,as
someexperimentssuggest?Ordiditstaythesame?On weekends with a popular violent movie, the economists found, crime
dropped.Youreadthatright.Onweekendswithapopularviolentmovie,whenmillions
ofAmericanswereexposedtoimagesofmenkillingothermen,crimedropped—significantly.Whenyougetaresult thisstrangeandunexpected,yourfirst thought is that
you’vedonesomethingwrong.Eachauthorcarefullywentoverthecoding.Nomistakes. Your second thought is that there is some other variable that willexplaintheseresults.Theycheckediftimeofyearaffectedtheresults.Itdidn’t.Theycollecteddataonweather,thinkingperhapssomehowthiswasdrivingtherelationship.Itwasn’t.“Wecheckedallourassumptions,everythingweweredoing,”Dahltoldme.
“Wecouldn’tfindanythingwrong.”Despite theanecdotes,despite the labevidence,andasbizarreas it seemed,
showingaviolentmoviesomehowcausedabigdropincrime.Howcouldthispossiblybe?ThekeytofiguringitoutforDahlandDellaVignawasutilizingtheirBigData
tozoomincloser.Surveydatatraditionallyprovidedinformationthatwasannualor at best perhaps monthly. If we are really lucky, we might get data for aweekend. By comparison, as we’ve increasingly been using comprehensivedatasets,ratherthansmall-samplesurveys,wehavebeenabletohomeinbythehourandeventheminute.Thishasallowedustolearnalotmoreabouthumanbehavior.Sometimes fluctuations over time are amusing, if not earth-shattering.
EPCOR, a utility company in Edmonton, Canada, reported minute-by-minutewater consumption data during the 2010 Olympic gold medal hockey matchbetween the United States and Canada, which an estimated 80 percent ofCanadianswatched.Thedatatellsusthatshortlyaftereachperiodended,waterconsumptionshotup.ToiletsacrossEdmontonwereclearlyflushing.Google searches can also be broken down by the minute, revealing some
interestingpatternsintheprocess.Forexample,searchesfor“unblockedgames”soarat8A.M.onweekdaysandstayhighthrough3P.M.,nodoubtinresponsetoschools’ attempts toblock access tomobilegameson school propertywithoutbanningstudents’cellphones.
Search rates for “weather,” “prayer,” and “news” peak before 5:30 A.M.,evidence that most people wake up far earlier than I do. Search rates for“suicide”peakat12:36A.M.andareatthelowestlevelsaround9A.M.,evidencethatmostpeoplearefarlessmiserableinthemorningthanIam.Thedata shows that thehoursbetween2 and4A.M. areprime time forbig
questions:Whatisthemeaningofconsciousness?Doesfreewillexist?Istherelifeonotherplanets?Thepopularityof thesequestions late atnightmaybearesult, in part, of cannabis use. Search rates for “how to roll a joint” peakbetween1and2A.M.And in their large dataset, Dahl and DellaVigna could look at how crime
changed by the hour on thosemoviesweekends. They found that the drop incrimewhenpopularviolentmovieswereshown—relativetootherweekends—beganintheearlyevening.Crimewaslower,inotherwords,beforetheviolentscenesevenstarted,whentheatergoersmayhavejustbeenwalkingin.Can you guesswhy?Think, first, aboutwho is likely to choose to attend a
violentmovie.It’syoungmen—particularlyyoung,aggressivemen.Think, next, about where crimes tend to be committed. Rarely in a movie
theater.Therehavebeenexceptions,mostnotablya2012premeditatedshootingin aColorado theater. But, by and large,men go to theaters unarmed and sit,silently.
Offeryoung,aggressivementhechancetoseeHannibal,andtheywillgotothemovies.Offer young, aggressivemenRunaway Bride as their option, andtheywill takeapassand insteadgoout,perhaps toabar,club,orapoolhall,wheretheincidenceofviolentcrimeishigher.Violentmovieskeeppotentiallyviolentpeopleoffthestreets.Puzzlesolved.Right?Notquiteyet.Therewasonemorestrangethinginthe
data.Theeffectsstartedrightwhen themoviesstartedshowing;however, theydidnot stop after themovie ended and the theater closed.On eveningswhereviolent movies were showing, crime was lower well into the night, frommidnightto6A.M.Even if crime was lower while the young men were in the movie theater,
shouldn’t it riseafter they leftandwereno longerpreoccupied?Theyhad justwatchedaviolentmovie,whichexperimentssaymakespeoplemoreangryandaggressive.Canyouthinkofanyexplanationsforwhycrimestilldroppedafterthemovie
ended?Aftermuch thought, the authors,whowere crime experts, had another“Aha”moment. They knew that alcohol is a major contributor to crime. TheauthorshadsatinenoughmovietheaterstoknowthatvirtuallynotheatersintheUnitedStatesserveliquor.Indeed,theauthorsfoundthatalcohol-relatedcrimesplummetedinlate-nighthoursafterviolentmovies.Of course,Dahl andDellaVigna’s resultswere limited. They could not, for
instance,testthemonths-out,lastingeffects—toseehowlongthedropincrimemight last. And it’s still possible that consistent exposure to violent moviesultimatelyleadstomoreviolence.However,theirstudydoesputtheimmediateimpactofviolentmovies,whichhasbeenthemainthemeintheseexperiments,intoperspective.Perhapsaviolentmoviedoesinfluencesomepeopleandmakethemunusuallyangryandaggressive.However,doyouknowwhatundeniablyinfluences people in a violent direction? Hanging out with other potentiallyviolentmenanddrinking.*
Thismakessensenow.Butitdidn’tmakesensebeforeDahlandDellaVignabegananalyzingpilesofdata.Onemoreimportantpointthatbecomesclearwhenwezoomin:theworldis
complicated. Actions we take today can have distant effects, most of themunintended. Ideas spread—sometimes slowly; other times exponentially, like
viruses.Peoplerespondinunpredictablewaystoincentives.Theseconnectionsandrelationships,thesesurgesandswells,cannotbetraced
with tiny surveys or traditional datamethods. Theworld, quite simply, is toocomplexandtoorichforlittledata.
OURDOPPELGANGERS
In June 2009, David “Big Papi” Ortiz looked like he was done. During theprevious half decade, Boston had fallen in love with their Dominican-bornsluggerwiththefriendlysmileandgappedteeth.He had made five consecutive All-Star games, won an MVP Award, and
helped end Boston’s eighty-six-year championship drought. But in the 2008season, at the age of thirty-two, his numbers fell off.His batting average haddropped68points,hison-basepercentage76points,hissluggingpercentage114points. And at the start of the 2009 season, Ortiz’s numbers were droppingfurther.Here’showBillSimmons,asportswriterandpassionateBostonRedSoxfan,
describedwhatwashappeningintheearlymonthsofthe2009season:“It’sclearthatDavidOrtiznolongerexcelsatbaseball. . . .Beefysluggersare likepornstars,wrestlers,NBA centers and trophywives:When it goes, it goes.”Greatsportsfanstrusttheireyes,andSimmons’seyestoldhimOrtizwasfinished.Infact,Simmonspredictedhewouldbebenchedorreleasedshortly.WasOrtizreallyfinished?Ifyou’retheBostongeneralmanager,in2009,do
you cut him?More generally, how canwe predict how a baseball playerwillperforminthefuture?Evenmoregenerally,howcanweuseBigDatatopredictwhatpeoplewilldointhefuture?A theory that will get you far in data science is this: look at what
sabermetricians (those who have used data to study baseball) have done andexpect it to spreadout toother areasofdata science.Baseballwas among thefirstfieldswithcomprehensivedatasetsonjustabouteverything,andanarmyofsmartpeoplewillingtodevotetheirlivestomakingsenseofthatdata.Now,justabouteveryfieldisthereorgettingthere.Baseballcomesfirst;everyotherfieldfollows.Sabermetricseatstheworld.The simplestway to predict a baseball player’s future is to assume hewill
continueperformingashecurrentlyis.Ifaplayerhasstruggledforthepast1.5years,youmightguessthathewillstruggleforthenext1.5years.Bythismethodology,BostonshouldhavecutDavidOrtiz.However,theremightbemorerelevantinformation.Inthe1980s,BillJames,
whomost consider the founderof sabermetrics, emphasized the importanceofage.Baseballplayers,Jamesfound,peakedearly—ataroundtheageoftwenty-seven.Teamstendedtoignorejusthowmuchplayersdeclineastheyage.Theyoverpaidforagingplayers.Bythismoreadvancedmethodology,BostonshoulddefinitelyhavecutDavid
Ortiz.But this age adjustment might miss something. Not all players follow the
same path through life. Some players might peak at twenty-three, others atthirty-two.Shortplayersmayagedifferentlyfromtallplayers,fatplayersfromskinnyplayers.Baseballstatisticiansfoundthatthereweretypesofplayers,eachfollowing a different aging path. This storywas evenworse forOrtiz: “beefysluggers”indeeddo,onaverage,peakearlyandcollapseshortlypastthirty.IfBostonconsideredhisrecentpast,hisage,andhissize,theyshould,without
adoubt,havecutDavidOrtiz.Then, in 2003, statistician Nate Silver introduced a new model, which he
calledPECOTA, topredictplayerperformance. It proved tobe thebest—and,also, the coolest. Silver searched for players’ doppelgangers. Here’s how itworks.BuildadatabaseofeveryMajorLeagueBaseballplayerever,morethan18,000men.Andincludeeverythingyouknowaboutthoseplayers:theirheight,age, and position; their home runs, batting average, walks, and strikeouts foreach year of their careers. Now, find the twenty ballplayers who look mostsimilartoOrtizrightupuntilthatpointinhiscareer—thosewhoplayedlikehedidwhenhewas24,25,26,27,28,29,30,31,32,and33.Inotherwords,findhisdoppelgangers.ThenseehowOrtiz’sdoppelgangers’careersprogressed.Adoppelgangersearchisanotherexampleofzoomingin.Itzoomsinonthe
smallsubsetofpeoplemostsimilartoagivenperson.And,aswithallzoomingin,itgetsbetterthemoredatayouhave.Itturnsout,Ortiz’sdoppelgangersgavea very different prediction for Ortiz’s future. Ortiz’s doppelgangers includedJorgePosadaandJimThome.Theseplayersstartedtheircareersabitslow;hadamazingburstsintheirlatetwenties,withworld-classpower;andthenstruggled
intheirearlythirties.SilverthenpredictedhowOrtizwoulddobasedonhowthesedoppelgangers
endedupdoing.Andhere’swhathefound:theyregainedtheirpower.Fortrophywives, Simmons may be right: when it goes, it goes. But for Ortiz’sdoppelgangers,whenitwent,itcameback.Thedoppelgangersearch,thebestmethodologyeverusedtopredictbaseball
player performance, said Boston should be patient with Ortiz. And Bostonindeed was patient with their aging slugger. In 2010, Ortiz’s average rose to.270.Hehit32homerunsandmade theAll-Star team.ThisbeganastringoffourconsecutiveAll-StargamesforOrtiz.In2013,battinginhistraditionalthirdspotinthelineup,attheageofthirty-seven,Ortizbatted.688asBostondefeatedSt. Louis, 4 games to 2, in the World Series. Ortiz was voted World SeriesMVP.*
As soon as I finished reading Nate Silver’s approach to predicting thetrajectory of ballplayers, I immediately began thinking aboutwhether Imighthaveadoppelganger,too.Doppelgangersearchesarepromisinginmanyfields,notjustathletics.Could
Ifind thepersonwhoshares themost interestswithme?Maybeif I foundthepersonmost similar to me, we could hang out.Maybe he would know somerestaurantswewouldlike.MaybehecouldintroducemetothingsIhadnoideaImighthaveanaffinityfor.A doppelganger search zooms in on individuals and even on the traits of
individuals.And,aswithallzoomingin,itgetssharperthemoredatayouhave.SupposeIsearchedformydoppelgangerinadatasetoftenorsopeople.Imightfind someone who shared my interest in books. Suppose I searched for mydoppelgangerinadatasetofathousandorsopeople.Imightfindsomeonewhohad a thing for popular physics books. But suppose I searched for mydoppelganger in a dataset of hundreds ofmillions of people.Then Imight beabletofindsomeonewhowasreally,trulysimilartome.One day, I went doppelganger hunting on social media. Using the entire
corpusofTwitter profiles, I looked for thepeopleon theplanetwhohave themostcommoninterestswithme.You can certainly tell a lot aboutmy interests fromwhom I follow onmy
Twitter account.Overall, I follow some 250 people, showingmy passions for
sports,politics,comedy,science,andmoroseJewishfolksingers.So is there anybody out there in the universewho follows all 250 of these
accounts,myTwittertwin?Ofcoursenot.Doppelgangersaren’tidenticaltous,onlysimilar.Noristhereanybodywhofollows200oftheaccountsIfollow.Oreven150.However,Idideventuallyfindanaccountthatfollowedanamazing100ofthe
accounts I follow: Country Music Radio Today. Huh? It turns out, CountryMusicRadioTodaywasabot(itnolongerexists)thatfollowed750,000Twitterprofilesinthehopethattheywouldfollowback.Ihaveanex-girlfriendwhoIsuspectwouldgetakickoutofthisresult.She
oncetoldmeIwasmorelikearobotthanahumanbeing.All joking aside, my initial finding that my doppelganger was a bot that
followed 750,000 random accounts does make an important point aboutdoppelgangersearches.Foradoppelgangersearchtobetrulyaccurate,youdon’twanttofindsomeonewhomerelylikesthesamethingsyoulike.Youalsowanttofindsomeonewhodislikesthethingsyoudislike.MyinterestsareapparentnotjustfromtheaccountsIfollowbutfromthoseI
choosenottofollow.Iaminterestedinsports,politics,comedy,andsciencebutnotfood,fashion,ortheater.MyfollowsshowthatIlikeBernieSandersbutnotElizabethWarren,SarahSilvermanbutnotAmySchumer, theNewYorker butnottheAtlantic,myfriendsNoahPopp,EmilySands,andJoshGottliebbutnotmyfriendSamAsher.(Sorry,Sam.ButyourTwitterfeedisasnooze.)Ofall200millionpeopleonTwitter,whohasthemostsimilarprofiletome?
ItturnsoutmydoppelgangerisVoxwriterDylanMatthews.Thiswaskindofaletdown, for the purposes of improving my media consumption, as I alreadyfollowMatthewsonTwitterandFacebookandcompulsivelyreadhisVoxposts.Solearninghewasmydoppelgangerhasn’treallychangedmylife.Butit’sstillprettycooltoknowthepersonmostsimilartoyouintheworld,especiallyifit’ssomeone you admire. And when I finish this book and stop being a hermit,maybe Matthews and I can hang out and discuss the writings of JamesSurowiecki.The Ortiz doppelganger search was neat for baseball fans. And my
doppelganger searchwas entertaining, at least tome.Butwhat else can thesesearchesreveal?Foronething,doppelgangersearcheshavebeenusedbymany
of the biggest internet companies to dramatically improve their offerings anduserexperience.Amazonusessomethinglikeadoppelgangersearchtosuggestwhatbooksyoumightlike.Theyseewhatpeoplesimilartoyouselectandbasetheirrecommendationsonthat.Pandoradoesthesameinpickingwhatsongsyoumightwanttolistento.And
thisishowNetflixfiguresoutthemoviesyoumightlike.Theimpacthasbeenso profound that when Amazon engineer Greg Linden originally introduceddoppelgangersearchestopredictreaders’bookpreferences,theimprovementinrecommendationswassogoodthatAmazonfounderJeffBezosgottohiskneesandshouted,“I’mnotworthy!”toLinden.Butwhat is really interestingaboutdoppelgangersearches,considering their
power,isnothowthey’recommonlybeingusednow.Itishowfrequentlytheyarenotused.Therearemajorareasoflifethatcouldbevastlyimprovedbythekindofpersonalizationthesesearchesallow.Takeourhealth,forinstance.Isaac Kohane, a computer scientist and medical researcher at Harvard, is
tryingtobringthisprincipletomedicine.Hewantstoorganizeandcollectallofour health information so that instead of using a one-size-fits-all approach,doctorscanfindpatientsjustlikeyou.Thentheycanemploymorepersonalized,morefocuseddiagnosesandtreatments.Kohaneconsidersthisanaturalextensionforthemedicalfieldandnotevena
particularlyradicalone.“Whatisadiagnosis?”Kohaneasks.“Adiagnosisreallyis a statement that you share properties with previously studied populations.When I diagnose you with a heart attack, God forbid, I say you have apathophysiology that I learned fromother peoplemeans youhave had a heartattack.”A diagnosis is, in essence, a primitive kind of doppelganger search. The
problemisthatthedatasetsdoctorsusetomaketheirdiagnosesaresmall.Thesedaysadiagnosisisbasedonadoctor’sexperiencewiththepopulationofpatientsheorshehastreatedandperhapssupplementedbyacademicpapersfromsmallpopulationsthatotherresearchershaveencountered.Aswe’veseen,though,foradoppelgangersearch to reallygetgood, itwouldhave to includemanymorecases.Here is a fieldwhere someBigData could reallyhelp.Sowhat’s taking so
long?Whyisn’t italreadywidelyused?Theproblemlieswithdatacollection.
Mostmedical reportsstillexistonpaper,buried in files,andfor those thatarecomputerized, they’reoften lockedup in incompatible formats.Weoftenhavebetter data, Kohane notes, on baseball than on health. But simple measureswould go a long way. Kohane talks repeatedly of “low-hanging fruit.” Hebelieves, for instance, that merely creating a complete dataset of children’sheight and weight charts and any diseases they might have would berevolutionaryforpediatrics.Eachchild’sgrowthpaththencouldbecomparedtoeveryotherchild’sgrowthpath.Acomputercouldfindchildrenwhowereonasimilartrajectoryandautomaticallyflaganytroublingpatterns.Itmightdetectachild’sheight levelingoffprematurely,which incertainscenarioswould likelypoint to one of two possible causes: hypothyroidism or a brain tumor. Earlydiagnosisinbothcaseswouldbeahugeboon.“Thesearerarebirds,”accordingto Kohane, “one-in-ten-thousand kind of events. Children, by and large, arehealthy. I think we could diagnose them earlier, at least a year earlier. Onehundredpercent,wecould.”JamesHeywoodisanentrepreneurwhohasadifferentapproachtodealwith
difficulties linking medical data. He created a website, PatientsLikeMe.com,where individuals can report their own information—their conditions,treatments, and side effects. He’s already had a lot of success charting thevarying courses diseases can take and how they compare to our commonunderstandingofthem.His goal is to recruit enough people, covering enough conditions, so that
people can find their health doppelganger. Heywood hopes that you can findpeopleofyourageandgender,withyourhistory,reportingsymptomssimilartoyours—andseewhathasworkedforthem.Thatwouldbeaverydifferentkindofmedicine,indeed.
DATASTORIES
Inmanywaystheactofzoominginismorevaluabletomethantheparticularfindingsofaparticularstudy,becauseitoffersanewwayofseeingandtalkingaboutlife.WhenpeoplelearnthatIamadatascientistandawriter,theysometimeswill
share some fact or survey with me. I often find this data boring—static and
lifeless.Ithasnostorytotell.Likewise, friends have tried to get me to join them in reading novels and
biographies.But these hold little interest forme aswell. I always findmyselfasking, “Would that happen in other situations? What’s the more generalprinciple?”Theirstoriesfeelsmallandunrepresentative.What I have tried to present in this book is something that, forme, is like
nothingelse.Itisbasedondataandnumbers;itisillustrativeandfar-reaching.Andyetthedataissorichthatyoucanvisualizethepeopleunderneathit.WhenwezoominoneveryminuteofEdmonton’swaterconsumption,Iseethepeoplegettingupfromtheircouchattheendoftheperiod.Whenwezoominonpeoplemoving fromPhiladelphia toMiami and starting to cheat on their taxes, I seethesepeopletalkingtotheirneighborsintheirapartmentcomplexandlearningabout the tax trick.Whenwezoominonbaseball fansofeveryage, Iseemyownchildhoodandmybrother’schildhoodandmillionsofadultmenstillcryingoverateamthatwonthemoverwhentheywereeightyearsold.Attheriskofonceagainsoundinggrandiose,Ithinktheeconomistsanddata
scientistsfeaturedinthisbookarecreatingnotonlyanewtoolbutanewgenre.WhatIhavetriedtopresentinthischapter,andmuchofthisbook,isdatasobigandsorich,allowingustozoominsoclosethat,withoutlimitingourselvestoany particular, unrepresentative human being, we can still tell complex andevocativestories.
6
ALLTHEWORLD’SALAB
February 27, 2000, started as an ordinary day on Google’s Mountain Viewcampus. The sun was shining, the bikers were pedaling, the masseuses weremassaging, the employeeswere hydratingwith cucumberwater.And then, onthisordinaryday,a fewGoogleengineershadan idea thatunlocked thesecretthattodaydrivesmuchoftheinternet.Theengineersfoundthebestwaytogetyouclicking,comingback,andstayingontheirsites.Before describing what they did, we need to talk about correlation versus
causality,ahugeissueindataanalysis—andonethatwehavenotyetadequatelyaddressed.Themedia bombard uswith correlation-based studies seemingly every day.
Forexample,wehavebeentoldthatthoseofuswhodrinkamoderateamountofalcoholtendtobeinbetterhealth.Thatisacorrelation.Does this mean drinking a moderate amount will improve one’s health—a
causation? Perhaps not. It could be that good health causes people to drink amoderateamount.Socialscientistscallthisreversecausation.Oritcouldbethatthere is an independent factor that causes both moderate drinking and goodhealth. Perhaps spending a lot of time with friends leads to both moderatealcoholconsumptionandgoodhealth.Socialscientistscallthisomitted-variablebias.How,then,canwemoreaccuratelyestablishcausality?Thegoldstandardisa
randomized,controlledexperiment.Here’show itworks.Yourandomlydivide
people into two groups. One, the treatment group, is asked to do or takesomething.Theother, the control group, is not.You then see howeachgroupresponds.Thedifferenceintheoutcomesbetweenthetwogroupsisyourcausaleffect.Forexample,totestwhethermoderatedrinkingcausesgoodhealth,youmight
randomly pick some people to drink one glass of wine per day for a year,randomly choose others to drink no alcohol for a year, and then compare thereportedhealthofbothgroups.Sincepeoplewererandomlyassignedtothetwogroups,thereisnoreasontoexpectonegroupwouldhavebetterinitialhealthorhave socialized more. You can trust that the effects of the wine are causal.Randomized,controlledexperimentsarethemosttrustedevidenceinanyfield.Ifapillcanpassarandomized,controlledexperiment,itcanbedispensedtothegeneral populace. If it cannot pass this test, it won’t make it onto pharmacyshelves.Randomizedexperimentshaveincreasinglybeenusedinthesocialsciencesas
well.EstherDuflo,aFrencheconomistatMIT,hasledthecampaignforgreateruseofexperiments indevelopmentaleconomics,a field that tries to figureoutthebestways tohelp thepoorestpeople in theworld.ConsiderDuflo’s study,with colleagues, of how to improve education in rural India,wheremore thanhalf of middle school students cannot read a simple sentence. One potentialreasonstudentsstrugglesomuchisthatteachersdon’tshowupconsistently.OnagivendayinsomeschoolsinruralIndia,morethan40percentofteachersareabsent.Duflo’s test? She and her colleagues randomly divided schools into two
groups.Inone(thetreatmentgroup),inadditiontotheirbasepay,teacherswerepaidasmallamount—50rupees,orabout$1.15—foreverydaytheyshoweduptowork. In the other, no extra payment for attendancewas given.The resultswereremarkable.Whenteacherswerepaid,teacherabsenteeismdroppedinhalf.Studenttestperformancealsoimprovedsubstantially,withthebiggesteffectsonyounggirls.Bytheendoftheexperiment,girlsinschoolswhereteacherswerepaidtocometoclasswere7percentagepointsmorelikelytobeabletowrite.AccordingtoaNewYorkerarticle,whenBillGateslearnedofDuflo’swork,
hewassoimpressedhetoldher,“Weneedtofundyou.”
THEABCSOFA/BTESTING
Sorandomizedexperimentsarethegoldstandardforprovingcausality,andtheiruse has spread through the social sciences.Which brings us back toGoogle’soffices on February 27, 2000. What did Google do on that day thatrevolutionizedtheinternet?Onthatday,a fewengineersdecided toperformanexperimentonGoogle’s
site. They randomly divided users into two groups. The treatment group wasshown twenty linkson the search resultspages.Thecontrolgroupwas shownthe usual ten.The engineers then compared the satisfaction of the two groupsbasedonhowfrequentlytheyreturnedtoGoogle.This is a revolution? It doesn’t seem so revolutionary. I already noted that
randomized experiments have been used by pharmaceutical companies andsocialscientists.Howcancopyingthembesuchabigdeal?The key point—and this was quickly realized by the Google engineers—is
that experiments in the digital world have a huge advantage relative toexperiments in the offline world. As convincing as offline randomizedexperimentscanbe,theyarealsoresource-intensive.ForDuflo’sstudy,schoolshadtobecontacted,fundinghadtobearranged,someteachershadtobepaid,and all students had to be tested. Offline experiments can cost thousands orhundredsofthousandsofdollarsandtakemonthsoryearstoconduct.Inthedigitalworld,randomizedexperimentscanbecheapandfast.Youdon’t
need to recruit and pay participants. Instead, you can write a line of code torandomly assign them to a group. You don’t need users to fill out surveys.Instead,youcanmeasuremousemovementsandclicks.Youdon’tneedtohand-code and analyze the responses.You can build a program to automatically dothat for you.You don’t have to contact anybody.You don’t even have to telluserstheyarepartofanexperiment.ThisisthefourthpowerofBigData:itmakesrandomizedexperiments,which
canfind trulycausaleffects,much,mucheasier toconduct—anytime,moreorlessanywhere,aslongasyou’reonline.IntheeraofBigDataalltheworld’salab.ThisinsightquicklyspreadthroughGoogleandthentherestofSiliconValley,
whererandomizedcontrolledexperimentshavebeenrenamed“A/Btesting.”In
2011,GoogleengineersranseventhousandA/Btests.Andthisnumberisonlyrising.IfGooglewantstoknowhowtogetmorepeopletoclickonadsontheirsites,
theymay try two shades of blue in ads—one shade forGroupA, another forGroup B. Google can then compare click rates. Of course, the ease of suchtesting can lead to overuse. Some employees felt that because testing was soeffortless,Googlewasoverexperimenting.In2009,onefrustrateddesignerquitafterGooglewentthroughforty-onemarginallydifferentshadesofblueinA/Btesting.Butthisdesigner’sstandinfavorofartoverobsessivemarketresearchhasdonelittletostopthespreadofthemethodology.Facebooknowrunsa thousandA/B testsperday,whichmeans thata small
numberofengineersatFacebookstartmorerandomized,controlledexperimentsinagivendaythantheentirepharmaceuticalindustrystartsinayear.A/B testing has spread beyond the biggest tech firms. A former Google
employee, Dan Siroker, brought this methodology to Barack Obama’s firstpresidentialcampaign,whichA/B-testedhomepagedesigns,emailpitches,anddonationforms.ThenSirokerstartedanewcompany,Optimizely,whichallowsorganizations to perform rapidA/B testing. In 2012, Optimizelywas used byObamaaswellashisopponent,MittRomney,tomaximizesign-ups,volunteers,anddonations.It’salsousedbycompaniesasdiverseasNetflix,TaskRabbit,andNewYorkmagazine.Toseehowvaluabletestingcanbe,considerhowObamausedittogetmore
people engaged with his campaign. Obama’s home page initially included apicture of the candidate and a button below the picture that invited people to“SignUp.”
Wasthisthebestwaytogreetpeople?WiththehelpofSiroker,Obama’steamcould test whether a different picture and button might get more people toactuallysignup.Wouldmorepeopleclick if thehomepage instead featuredapicture of Obama with a more solemn face?Would more people click if thebutton instead said “Join Now”? Obama’s team showed users differentcombinationsofpicturesandbuttonsandmeasuredhowmanyof themclickedthebutton.Seeifyoucanpredictthewinningpictureandwinningbutton.
PicturesTested
ButtonsTested
ThewinnerwasthepictureofObama’sfamilyandthebutton“LearnMore.”Andthevictorywashuge.Byusingthatcombination,Obama’scampaignteamestimateditgot40percentmorepeopletosignup,nettingthecampaignroughly$60millioninadditionalfunding.
WinningCombination
Thereisanothergreatbenefittothefactthatallthisgold-standardtestingcanbe done so cheap and easy: it further frees us from our reliance upon ourintuition,which,asnotedinChapter1,hasitslimitations.AfundamentalreasonforA/Btesting’simportanceisthatpeopleareunpredictable.Ourintuitionoftenfailstopredicthowtheywillrespond.WasyourintuitioncorrectonObama’soptimalwebsite?Here are some more tests for your intuition. The Boston Globe A/B-tests
headlinestofigureoutwhichonesgetthemostpeopletoclickonastory.Trytoguessthewinnersfromthesepairs:
Ipredictyougotmorethanhalfright,perhapsbyconsideringwhatyouwouldclickon.Butyouprobablydidnotguessallofthesecorrectly.Why?Whatdidyoumiss?Whatinsightsintohumanbehaviordidyoulack?
Whatlessonscanyoulearnfromyourmistakes?Weusuallyaskquestionssuchastheseaftermakingbadpredictions.But look how difficult it is to draw general conclusions from the Globe
headlines.Inthefirstheadlinetest,changingasingleword,“this”to“SnotBot,”led to a big win. This might suggest more details win. But in the secondheadline,“deflatedballs,”thedetailedterm,loses.Inthefourthheadline,“makesbank”beatsthenumber$179,000.Thismightsuggestslangtermswin.Buttheslangterm“hookupcontest”losesinthethirdheadline.ThelessonofA/Btesting,toalargedegree,istobewaryofgenerallessons.
ClarkBensonistheCEOofranker.com,anewsandentertainmentsitethatrelies
heavilyonA/B testing tochooseheadlinesandsitedesign.“At theendof theday,youcan’tassumeanything,”Bensonsays.“Testliterallyeverything.”Testing fills in gaps in our understanding of humannature.These gapswill
alwaysexist.Ifweknew,basedonourlifeexperience,whattheanswerwouldbe,testingwouldnotbeofvalue.Butwedon’t,soitis.Another reasonA/B testing is so important is that seemingly small changes
can have big effects. As Benson puts it, “I’m constantly amazed with minor,minorfactorshavingoutsizedvalueintesting.”In December 2012, Google changed its advertisements. They added a
rightward-pointingarrowsurroundedbyasquare.
Noticehowbizarrethisarrowis.Itpointsrightwardtoabsolutelynothing.Infact, when these arrows first appeared, many Google customers were critical.Whyweretheyaddingmeaninglessarrowstothead,theywondered?Well,Google is protective of its business secrets, so they don’t say exactly
howvaluable the arrowswere.But theydid say that these arrowshadwon inA/Btesting.ThereasonGoogleaddedthemisthattheygotalotmorepeopletoclick.Andthisminor,seeminglymeaninglesschangemadeGoogleandtheiradpartnersoodlesofmoney.So how can you find these small tweaks that produce outsize profits? You
have to test lotsof things,evenmany thatseemtrivial. In fact,Google’susershavenotednumeroustimesthatadshavebeenchangedatinybitonlytoreturnto their previous form. They have unwittingly become members of treatmentgroupsinA/Btests—butatthecostonlyofseeingtheseslightvariations.
CenteringExperiment(Didn’tWork)
GreenStarExperiment(Didn’tWork)
NewFontExperiment(Didn’tWork)
Theabovevariationsnevermade it to themasses.They lost.But theywerepartof theprocessofpickingwinners.The road toa clickablearrow ispavedwithuglystars,faultypositionings,andgimmickyfonts.
Itmaybefuntoguesswhatmakespeopleclick.AndifyouareaDemocrat,itmightbenicetoknowthattestinggotObamamoremoney.ButthereisadarksidetoA/Btesting.In his excellent book Irresistible, Adam Alter writes about the rise of
behavioraladdictionsincontemporarysociety.Manypeoplearefindingaspectsoftheinternetincreasinglydifficulttoturnoff.My favorite dataset, Google searches, can give us some clues as to what
people findmost addictive. According to Google,most addictions remain theonespeoplehavestruggledwithformanydecades—drugs,sex,andalcohol,forexample.But the internet isstarting tomake itspresencefelton the list—with“porn”and“Facebook”nowamongthetoptenreportedaddictions.
TOPADDICTIONSREPORTEDTOGOOGLE,2016
DrugsSexPornAlcohol
SugarLoveGamblingFacebook
A/Btestingmaybeplayingaroleinmakingtheinternetsodarnaddictive.TristanHarris,a“designethicist,”wasquoted in Irresistibleexplainingwhy
peoplehavesuchahardtimeresistingcertainsitesontheinternet:“Thereareathousandpeopleontheothersideofthescreenwhosejobitistobreakdowntheself-regulationyouhave.”AndthesepeopleareusingA/Btesting.Through testing,Facebookmay figureout thatmaking aparticular button a
particularcolorgetspeopletocomebacktotheirsitemoreoften.Sotheychangethe button to that color. Then they may figure out that a particular font getspeopletocomebacktotheirsitemoreoften.Sotheychangethetexttothatfont.Then they may figure out that emailing people at a certain time gets themcomingbacktotheirsitemoreoften.Sotheyemailpeopleatthattime.Prettysoon,Facebookbecomesasiteoptimizedtomaximizehowmuchtime
peoplespendonFacebook.Inotherwords,findenoughwinnersofA/Btestsandyouhave an addictive site. It is the type of feedback that cigarette companiesneverhad.A/Btestingisincreasinglyatoolofthegamingindustry.AsAlterdiscusses,
WorldofWarcraftA/B-testsvariousversionsofitsgame.Onemissionmightaskyoutokillsomeone.Anothermightaskyoutosavesomething.Gamedesignerscangivedifferentsamplesofplayers’differentmissionsandthenseewhichoneskeepmorepeopleplaying.Theymight find, forexample, that themission thataskedyou tosaveapersongotpeople to return30percentmoreoften. If theytestmany,manymissions, theystartfindingmoreandmorewinners.These30percentwinsaddup,untiltheyhaveagamethatkeepsmanyadultmenholedupintheirparents’basement.Ifyouarealittledisturbedbythis,Iamwithyou.Andwewilltalkabitmore
abouttheethicalimplicationsofthisandotheraspectsofBigDataneartheendofthisbook.Butforbetterorworse,experimentationisnowacrucialtoolinthedatascientists’ toolkit.Andthere isanotherformofexperimentationsittinginthattoolkit.Ithasbeenusedtoaskavarietyofquestions,includingwhetherTVadsreallywork.
NATURE’SCRUEL—BUTENLIGHTENING—EXPERIMENTS
It’sJanuary22,2012,and theNewEnglandPatriotsareplaying theBaltimoreRavensintheAFCChampionshipgame.There’saminuteleftinthegame.TheRavensaredown,butthey’vegotthe
ball.Thenext sixty secondswill determinewhich teamwill play in theSuperBowl. The next sixty seconds will help seal players’ legacies. And the lastminute of this game will do something that, for an economist, is far moreprofound: the last sixty secondswill help finally tell us, once and for all,Doadvertisementswork?Thenotionthatadsimprovesalesisobviouslycrucialtooureconomy.Butit
ismaddeninglyhardtoprove.Infact,thisisatextbookexampleofexactlyhowdifficultitistodistinguishbetweencorrelationandcausation.There’snodoubt thatproducts that advertise themost alsohave thehighest
sales. TwentiethCentury Fox spent $150millionmarketing themovieAvatar,whichbecamethehighest-grossingfilmofall time.Buthowmuchof the$2.7billioninAvatarticketsaleswasduetotheheavymarketing?Partofthereason20thCenturyFoxspentsomuchmoneyonpromotionwaspresumablythattheyknewtheyhadadesirableproduct.Firmsbelievetheyknowhoweffectivetheiradsare.Economistsareskeptical
theyreallydo.UniversityofChicagoeconomicsprofessorStevenLevitt,whilecollaboratingwith an electronics company, was underwhelmedwhen the firmtried to convince him they knew how much their ads worked. How, Levittwondered,couldtheybesoconfident?Thecompanyexplainedthat,everyyear,inthedaysprecedingFather’sDay,
they ramp up their TV ad spending. Sure enough, every year, before Father’sDay,theyhavethehighestsales.Uh,maybethat’sjustbecausealotofkidsbuyelectronics for their dads, particularly for Father’s Day gifts, regardless ofadvertising.“They got the causality completely backwards,” saysLevitt in a lecture.At
leasttheymighthave.Wedon’tknow.“It’sareallyhardproblem,”Levittadds.As important as this problem is to solve, firms are reluctant to conduct
rigorous experiments. Levitt tried to convince the electronics company toperform a randomized, controlled experiment to precisely learn how effectivetheirTVadswere.SinceA/Btestingisn’tpossibleontelevisionyet,thiswouldrequireseeingwhathappenswithoutadvertisinginsomeareas.
Here’s how the firm responded: “Are you crazy?We can’t not advertise intwentymarkets.TheCEOwouldkillus.”ThatendedLevitt’scollaborationwiththecompany.WhichbringsusbacktothisPatriots-Ravensgame.Howcantheresultsofa
footballgamehelpusdeterminethecausaleffectsofadvertising?Well,itcan’ttellustheeffectsofaparticularadcampaignfromaparticularcompany.Butitcan give evidence on the average effects of advertisements from many largecampaigns.Itturnsout,thereisahiddenadvertisingexperimentingameslikethis.Here’s
how it works. By the time these championship games are played, companieshave purchased, and produced, their Super Bowl advertisements. Whenbusinessesdecidewhichads to run, theydon’tknowwhich teamswillplay inthegame.But the results of the playoffs will have a huge impact on who actually
watchestheSuperBowl.Thetwoteamsthatultimatelyqualifywillbringwiththem an enormous amount of viewers. If New England, which plays nearBoston,wins,farmorepeopleinBostonwillwatchtheSuperBowlthanfolksinBaltimore.Andviceversa.To the firms, it is theequivalentof acoin flip todeterminewhether tensof
thousands of extra people in Baltimore or Boston will be exposed to theiradvertisement, a flip that will happen after their spots are purchased andproduced.Now, back to the field, where Jim Nantz on CBS is announcing the final
resultsofthisexperiment.
HerecomesBillyCundiff,totiethisgame,and,inalllikelihood,sendittoovertime.The last twoyears,sixteenofsixteenonfieldgoals.Thirty-twoyardstotieit.Andthekick.Lookout!Lookout!It’snogood....AndthePatriots take the knee and will now take the journey to Indianapolis.They’reheadingtoSuperBowlForty-Six.
Two weeks later, Super Bowl XLVI would score a 60.3 audience share inBoston and a 50.2 share in Baltimore. Sixty thousandmore people in Bostonwouldwatchthe2012advertisements.Thenextyear, thesame two teamswouldmeet for theAFCChampionship.
This time, Baltimore would win. The extra ad exposures for the 2013 SuperBowladvertisementswouldbeseeninBaltimore.
Hal Varian, chief economist at Google; Michael D. Smith, economist atCarnegieMellon; and I used these two games and all the other Super Bowlsfrom 2004 to 2013 to test whether—and, if so, howmuch—Super Bowl adswork.SpecificallywelookedatwhetherwhenacompanyadvertisesamovieintheSuperBowl,theyseeabigjumpinticketsalesinthecitiesthathadhigherviewershipforthegame.They indeed do. People in cities of teams that qualify for the Super Bowl
attend movies that were advertised during the Super Bowl at a significantlyhigher rate than do those in cities of teams that justmissed qualifying.Morepeopleinthosecitiessawthead.Morepeopleinthosecitiesdecidedtogotothefilm.One alternative explanationmight be that having a team in theSuperBowl
makesyoumorelikelytogoseemovies.However,wetestedagroupofmoviesthat had similar budgets and were released at similar times but that did notadvertiseintheSuperBowl.TherewasnoincreasedattendanceinthecitiesoftheSuperBowlteams.Okay, as you might have guessed, advertisements work. This isn’t too
surprising.But it’s not just that they work. The ads were incredibly effective. In fact,
when we first saw the results, we double- and triple- and quadruple-checkedthem to make sure they were right—because the returns were so large. Theaveragemovie in our sample paid about $3million for a SuperBowl ad slot.They got $8.3 million in increased ticket sales, a 2.8-to-1 return on theirinvestment.Thisresultwasconfirmedbytwoothereconomists,WesleyR.Hartmannand
Daniel Klapper, who independently and earlier came up with a similar idea.These economists studied beer and soft drink ads run during the SuperBowl,
whilealsoutilizingtheincreasedadexposuresinthecitiesofteamsthatqualify.Theyfounda2.5-to-1returnoninvestment.AsexpensiveastheseSuperBowladsare,ourresultsandtheirssuggesttheyaresoeffectiveinuppingdemandthatcompaniesareactuallydramaticallyunderpayingforthem.And what does all of this mean for our friends back at the electronics
companyLevitt hadworkedwith? It’s possible that SuperBowl ads aremorecost-effective than other forms of advertising. But at the very least our studydoessuggestthatallthatFather’sDayadvertisingisprobablyagoodidea.
One virtue of the Super Bowl experiment is that it wasn’t necessary tointentionallyassignanyonetotreatmentorcontrolgroups.Ithappenedbasedonthe lucky bounces in a football game. It happened, in other words, naturally.Why is that an advantage? Because nonnatural, randomly controlledexperiments,whilesuper-powerfulandeasiertodointhedigitalage,stillarenotalwayspossible.Sometimes we can’t get our act together in time. Sometimes, as with that
electronicscompanythatdidn’twant torunanexperimentonitsadcampaign,wearetooinvestedintheanswertotestit.Sometimesexperimentsareimpossible.Supposeyouareinterestedinhowa
countryresponds to losinga leader.Does itgo towar?Does itseconomystopfunctioning? Does nothing much change? Obviously, we can’t just kill asignificantnumberofpresidentsandprimeministersandseewhathappens.Thatwould be not only impossible but immoral. Universities have built up, overmanydecades, institutional reviewboards (IRBs) that determine if a proposedexperimentisethical.Soifwewanttoknowcausaleffectsinacertainscenarioanditisunethicalor
otherwiseunfeasibletodoanexperiment,whatcanwedo?Wecanutilizewhateconomists—defining nature broadly enough to include football games—callnaturalexperiments.Forbetterorworse(okay,clearlyworse),thereisahugerandomcomponent
tolife.Nobodyknowsforsurewhatorwhoisinchargeoftheuniverse.Butonething is clear:whoever is running the show—the lawsof quantummechanics,God,apimplykidinhisunderwearsimulatingtheuniverseonhiscomputer—they,She,orheisnotgoingthroughIRBapproval.
Natureexperimentsonusallthetime.Twopeoplegetshot.Onebulletstopsjustshortofavitalorgan.Theotherdoesn’t.Thesebadbreaksarewhatmakelifeunfair.But,ifitisanyconsolation,thebadbreaksdomakelifealittleeasierforeconomiststostudy.Economistsusethearbitrarinessoflifetotestforcausaleffects.Of forty-three American presidents, sixteen have been victims of serious
assassinationattempts, and fourhavebeenkilled.The reasons that some livedwereessentiallyrandom.CompareJohnF.KennedyandRonaldReagan.Bothmenhadbulletsheaded
directly for theirmost vulnerable body parts. JFK’s bullet exploded his brain,killing him shortly afterward.Reagan’s bullet stopped centimeters short of hisheart,allowingdoctors tosavehis life.Reagan lived,whileJFKdied,withnorhymeorreason—justluck.Theseattemptsonleaders’livesandthearbitrarinesswithwhichtheyliveor
dieissomethingthathappensthroughouttheworld.CompareAkhmadKadyrov,ofChechyna,andAdolfHitler,ofGermany.Bothmenhavebeen inchesawayfromafullyfunctioningbomb.Kadyrovdied.Hitlerhadchangedhisschedule,woundupleavingthebooby-trappedroomafewminutesearlytocatchatrain,andthussurvived.Andwecanusenature’scoldrandomness—killingKennedybutnotReagan
—toseewhathappens,onaverage,whenacountry’sleaderisassassinated.Twoeconomists,BenjaminF.JonesandBenjaminA.Olken,didjustthat.Thecontrolgroup here is any country in the years immediately after a near-missassassination—forexample, theUnitedStates in themid-1980s.The treatmentgroupisanycountryintheyearsimmediatelyafteracompletedassassination—forexample,theUnitedStatesinthemid-1960s.What, then, is the effect of having your leadermurdered? Jones andOlken
found that successful assassinations dramatically alter world history, takingcountrieson radicallydifferentpaths.Anew leadercausespreviouslypeacefulcountriestogotowarandpreviouslywarringcountriestoachievepeace.Anewleadercauseseconomicallyboomingcountriestostartbustingandeconomicallybustingcountriestostartbooming.Infact,theresultsofthisassassination-basednaturalexperimentoverthrewa
few decades of conventional wisdom on how countries function. Many
economistspreviouslyleanedtowardtheviewthatleaderslargelywereimpotentfigureheads pushed around by external forces.Not so, according to Jones andOlken’sanalysisofnature’sexperiment.Manywouldnotconsiderthisexaminationofassassinationattemptsonworld
leaders an example of Big Data. The number of assassinated or almostassassinatedleadersinthestudywascertainlysmall—aswasthenumberofwarsthat did or did not result.The economic datasets necessary to characterize thetrajectoryofaneconomywerelargebutforthemostpartpredatedigitalization.Nonetheless,suchnaturalexperiments—thoughnowusedalmostexclusively
byeconomists—arepowerfulandwill takeon increasing importance inanerawithmore,better,andlargerdatasets.Thisisatoolthatdatascientistswillnotlongforgo.Andyes,asshouldbeclearbynow,economistsareplayingamajorroleinthe
development of data science. At least I’d like to think so, since that wasmytraining.
Whereelsecanwe findnatural experiments—inotherwords, situationswheretherandomcourseofeventsplacespeopleintreatmentandcontrolgroups?The clearest example is a lottery,which iswhy economists love them—not
playing them,whichwe find irrational,but studying them. IfaPing-Pongballwithathreeonitrisestothetop,Mr.Joneswillberich.Ifit’saballwithasixinstead,Mr.Johnsonwillbe.To test the causal effects ofmonetarywindfalls, economists compare those
whowinlotteriestothosewhobuyticketsbutlose.Thesestudieshavegenerallyfoundthatwinningthelotterydoesnotmakeyouhappyintheshortrunbutdoesinthelongrun.*
Economistscanalsoutilizetherandomnessoflotteriestoseehowone’slifechangeswhenaneighborgetsrich.Thedatashowsthatyourneighborwinningthe lottery can have an impact on your own life. If your neighbor wins thelottery, for example, you are more likely to buy an expensive car, such as aBMW.Why?Almostcertainly,economistsmaintain,thecauseisjealousyafteryour richer neighbor purchased his own expensive car. Chalk it up to humannature.IfMr.JohnsonseesMr.Jonesdrivingabrand-newBMW,Mr.Johnsonwantsone,too.
Unfortunately, Mr. Johnson often can’t afford this BMW, which is whyeconomistsfoundthatneighborsoflotterywinnersaresignificantlymorelikelytogobankrupt.KeepingupwiththeJoneses,inthisinstance,isimpossible.But natural experiments don’t have to be explicitly random, like lotteries.
Onceyoustartlookingforrandomness,youseeiteverywhere—andcanuseittounderstandhowourworldworks.Doctors are part of a natural experiment. Every once in a while, the
government, for essentially arbitrary reasons, changes the formula it uses toreimbursephysiciansforMedicarepatients.Doctors insomecountiessee theirfeesforcertainproceduresrise.Doctorsinothercountiesseetheirfeesdrop.Twoeconomists—JeffreyClemensandJoshuaGottlieb,aformerclassmateof
mine—tested the effects of this arbitrary change. Do doctors always givepatientsthesamecare,thecaretheydeemmostnecessary?Oraretheydrivenbyfinancialincentives?Thedataclearlyshowsthatdoctorscanbemotivatedbymonetaryincentives.
Incountieswithhigherreimbursements,somedoctorsordersubstantiallymoreof the better-reimbursed procedures—more cataract surgeries, colonoscopies,andMRIs,forexample.And then, thebigquestion:do their patients farebetter aftergetting all this
extra care? Clemens and Gottlieb reported only “small health impacts.” Theauthors found no statistically significant impact on mortality. Give strongerfinancial incentives to doctors to order certain procedures, this naturalexperiment suggests, and some will order more procedures that don’t makemuchdifferenceforpatients’healthanddon’tseemtoprolongtheirlives.Natural experiments can help answer life-or-death questions. They can also
helpwithquestionsthat,tosomeyoungpeople,feellikelife-or-death.
StuyvesantHighSchool(knownas“Stuy”)ishousedinaten-floor,$150milliontan,brickbuildingoverlookingtheHudsonRiver,afewblocksfromtheWorldTradeCenter,inlowerManhattan.Stuyis,inaword,impressive.Itoffersfifty-fiveAdvancedPlacement(AP)classes,sevenlanguages,andelectivesinJewishhistory, science fiction, andAsian-American literature.Roughlyone-quarterofits graduates are accepted to an Ivy League or similarly prestigious college.Stuyvesant trained Harvard physics professor Lisa Randall, Obama strategist
DavidAxelrod,AcademyAward–winningactorTimRobbins,andnovelistGaryShteyngart.ItscommencementspeakershaveincludedBillClinton,KofiAnnan,andConanO’Brien.TheonlythingmoreremarkablethanStuyvesant’sofferingsandgraduatesis
itscost:zerodollars.Itisapublichighschoolandprobablythecountry’sbest.Indeed,arecentstudyused27millionreviewsby300,000studentsandparentstorankeverypublichighschoolintheUnitedStates.Stuyrankednumberone.Itis no wonder, then, that ambitious, middle-class New York parents and theirequallyambitiousprogenycanbecomeobsessedwithStuy’sbrand.ForAhmedYilmaz,* the son of an insurance agent and teacher in Queens,
Stuywas“thehighschool.”“Working-class and immigrant families see Stuy as a way out,” Yilmaz
explains. “If your kid goes to Stuy, he is going to go to a legit, top-twentyuniversity.Thefamilywillbeokay.”SohowcanyougetintoStuyvesantHighSchool?Youhavetoliveinoneof
the fiveboroughsofNewYorkCity and score above a certainnumberon theadmissionexam.That’sit.Norecommendations,noessay,nolegacyadmission,noaffirmative action.Oneday,one test, one score. If yournumber is aboveacertainthreshold,you’rein.Each November, approximately 27,000 New York youngsters sit for the
admissionexam.Thecompetitionisbrutal.Fewerthan5percentof thosewhotakethetestgetintoStuy.Yilmazexplainsthathismotherhad“workedherassoff”andputwhatlittle
money she had into his preparation for the test. Aftermonths spending everyweekdayafternoonandfullweekendspreparing,Yilmazwasconfidenthewouldget into Stuy. He still remembers the day he received the envelope with theresults.Hemissedbytwoquestions.Iaskedhimwhatitfeltlike.“Whatdoesitfeellike,”heresponded,“tohave
yourworldfallapartwhenyou’reinmiddleschool?”Hisconsolationprizewashardly shabby—BronxScience, anotherexclusive
and highly ranked public school.But itwas not Stuy.AndYilmaz felt BronxSciencewasmoreaspecialtyschoolmeantfortechnicalpeople.Fouryearslater,hewas rejected fromPrinceton.He attendedTufts and has shuffled through afewcareers.Todayhe is a reasonably successful employeeat a techcompany,
althoughhesayshisjobis“mind-numbing”andnotaswellcompensatedashe’dlike.Morethanadecadelater,Yilmazadmitsthathesometimeswondershowlife
wouldhaveplayedouthadhegonetoStuy.“Everythingwouldbedifferent,”hesays.“Literally,everyoneIknowwouldbedifferent.”HewondersifStuyvesantHigh School would have led him to higher SAT scores, a university likePrincetonorHarvard(bothofwhichheconsiderssignificantlybetterthanTufts),andperhapsmorelucrativeorfulfillingemployment.Itcanbeanythingfromentertainingtoself-tortureforhumanbeingstoplay
outhypotheticals.WhatwouldmylifebelikeifImadethemoveonthatgirlorthat boy? If I took that job? If Iwent to that school?But thesewhat-ifs seemunanswerable. Life is not a video game. You can’t replay it under differentscenariosuntilyougettheresultsyouwant.Milan Kundera, the Czech-born writer, has a pithy quote about this in his
novelTheUnbearableLightnessofBeing: “Human lifeoccursonlyonce, andthereasonwecannotdeterminewhichofourdecisionsaregoodandwhichbadisthatinagivensituationwecanmakeonlyonedecision;wearenotgrantedasecond,thirdorfourthlifeinwhichtocomparevariousdecisions.”Yilmazwillneverexperiencea life inwhichhesomehowmanaged toscore
twopointshigheronthattest.Butperhapsthere’sawaywecangainsomeinsightonhowdifferenthislife
mayormaynothavebeenbydoingastudyoflargenumbersofStuyvesantHighSchoolstudents.Theblunt,naïvemethodologywouldbetocompareallthestudentswhowent
toStuyvesantandallthosewhodidnot.WecouldanalyzehowtheyperformedonAPtestsandSATs—andwhatcollegestheywereacceptedinto.Ifwedidthis,we would find that students who went to Stuyvesant score much higher onstandardized tests and get accepted to substantially better universities. But aswe’ve seen already in this chapter, this kind of evidence, by itself, is notconvincing.MaybethereasonStuyvesantstudentsperformsomuchbetteristhatStuy attractsmuch better students in the first place.Correlation here does notprovecausation.TotestthecausaleffectsofStuyvesantHighSchool,weneedtocomparetwo
groupsthatarealmostidentical:onethatgottheStuytreatmentandonethatdid
not.Weneedanaturalexperiment.Butwherecanwefindit?Theanswer:students, likeYilmaz,whoscoredvery,veryclose to thecutoff
necessary to attend Stuyvesant.* Students who just missed the cutoff are thecontrolgroup;studentswhojustmadethecutarethetreatmentgroup.There is little reason to suspect students on either side of the cutoff differ
muchintalentordrive.What,afterall,causesapersontoscorejustapointortwo higher on a test than another? Maybe the lower-scoring one slept tenminutestoolittleoratealessnutritiousbreakfast.Maybethehigher-scoringonehadrememberedaparticularlydifficultwordonthetestfromaconversationshehadwithhergrandmotherthreeyearsearlier.Infact,thiscategoryofnaturalexperiments—utilizingsharpnumericalcutoffs
—is so powerful that it has its own name among economists: regressiondiscontinuity. Anytime there is a precise number that divides people into twodifferent groups—a discontinuity—economists can compare—or regress—theoutcomesofpeoplevery,veryclosetothecutoff.Twoeconomists,M.KeithChenandJesseShapiro,tookadvantageofasharp
cutoffusedby federalprisons to test theeffectsof roughprisonconditionsonfuturecrime.FederalinmatesintheUnitedStatesaregivenascore,basedonthenature of their crime and their criminal history. The score determines theconditionsoftheirprisonstay.Thosewithahighenoughscorewillgotoahigh-security correctional facility,whichmeans less contactwith other people, lessfreedomofmovement,andlikelymoreviolencefromguardsorotherinmates.Again, itwouldnotbe fair to compare the entireuniverseofprisonerswho
went to high-security prisons to the entire universe of prisoners who went tolow-security prisons. High-security prisons will include more murderers andrapists,low-securityprisonsmoredrugoffendersandpettythieves.But those right above or right below the sharp numerical threshold had
virtually identical criminal histories and backgrounds. This onemeasly point,however,meantaverydifferentprisonexperience.The result? The economists found that prisoners assigned to harsher
conditions were more likely to commit additional crimes once they left. Thetoughprisonconditions, rather thandeterring themfromcrime,hardened themandmadethemmoreviolentoncetheyreturnedtotheoutsideworld.So what did such a “regression discontinuity” show for Stuyvesant High
School? A team of economists from MIT and Duke—Atila Abdulkadirog˘lu,Joshua Angrist, and Parag Pathak—performed the study. They compared theoutcomesofNewYorkpupilsonbothsidesofthecutoff.Inotherwords,theseeconomistslookedathundredsofstudentswho,likeYilmaz,missedStuyvesantbyaquestionor two.Theycompared themtohundredsofstudentswhohadabetter testdayandmadeStuybyaquestionor two.TheirmeasuresofsuccesswereAP scores, SAT scores, and the rankings of the colleges they eventuallyattended.Theirstunningresultsweremadeclearbythetitletheygavethepaper:“Elite
Illusion.” The effects of Stuyvesant High School? Nil. Nada. Zero. Bupkus.StudentsoneithersideofthecutoffendedupwithindistinguishableAPscoresand indistinguishable SAT scores and attended indistinguishably prestigiousuniversities.The entire reason that Stuy students achieve more in life than non-Stuy
students, the researchersconcluded, is thatbetter studentsattendStuyvesant inthefirstplace.StuydoesnotcauseyoutoperformbetteronAPtests,dobetteronyourSATs,orendupatabettercollege.“Theintensecompetitionforexamschoolseats,”theeconomistswrote,“does
notappeartobejustifiedbyimprovedlearningforabroadsetofstudents.”Whymightitnotmatterwhichschoolyougoto?Somemorestoriescanhelp
getattheanswer.Considertwomorestudents,SarahKaufmannandJessicaEng,twoyoungNewYorkerswhobothdreamedfromanearlyageofgoingtoStuy.Kaufmann’sscorewasjustonthecutoff;shemadeitbyonequestion.“Idon’tthinkanythingcouldbethatexcitingagain,”Kaufmannrecalls.Eng’sscorewasjustbelowthecutoff;shemissedbyonequestion.Kaufmannwenttoherdreamschool,Stuy.Engdidnot.Sohowdidtheirlivesendup?Bothhavesincehadsuccessful,andrewarding,
careers—asdomostpeoplewhoscoreinthetop5percentofallNewYorkersontests.Eng,ironically,enjoyedherhighschoolexperiencemore.BronxScience,where she attended,was the only high schoolwith aHolocaustmuseum.EngdiscoveredshelovedcurationandstudiedanthropologyatCornell.Kaufmann felt a little lost in Stuy,where studentswere heavily focused on
gradesandshefelttherewastoomuchemphasisontesting,notonteaching.Shecalledherexperience“definitelyamixedbag.”Butitwasalearningexperience.
Sherealized,forcollege,shewouldonlyapplytoliberalartsschools,whichhadmore emphasis on teaching. She got accepted to her dream school,WesleyanUniversity.Thereshefoundapassionforhelpingothers,andsheisnowapublicinterestlawyer.Peopleadapt to theirexperience,andpeoplewhoaregoing tobesuccessful
findadvantagesinanysituation.Thefactorsthatmakeyousuccessfulareyourtalent and your drive.They are notwho gives your commencement speech orotheradvantagesthatthebiggestname-brandschoolsoffer.Thisisonlyonestudy,anditisprobablyweakenedbythefactthatmostofthe
studentswhojustmissedtheStuyvesantcutoffendedupatanotherfineschool.But there isgrowingevidence that,whilegoing toagoodschool is important,thereislittlegainedfromgoingtothegreatestpossibleschool.Take college.Does itmatter if yougo to oneof the best universities in the
world,suchasHarvard,orasolidschoolsuchasPennState?Onceagain, there is a clear correlationbetween the rankingofone’s school
and howmuchmoney peoplemake. Ten years into their careers, the averagegraduateofHarvardmakes$123,000.TheaveragegraduateofPennStatemakes$87,800.Butthiscorrelationdoesnotimplycausation.Two economists, StacyDale andAlanB.Krueger, thought of an ingenious
waytotestthecausalroleofeliteuniversitiesonthefutureearningpotentialoftheirgraduates.Theyhadalargedatasetthattrackedawholehostofinformationon high school students, including where they applied to college, where theywereacceptedtocollege,wheretheyattendedcollege,theirfamilybackground,andtheirincomeasadults.To get a treatment and control group,Dale andKrueger compared students
with similar backgrounds who were accepted by the same schools but chosedifferent ones. Some students who got into Harvard attended Penn State—perhapstobenearertoagirlfriendorboyfriendorbecausetherewasaprofessortheywantedtostudyunder.Thesestudents,inotherwords,werejustastalented,according to admissions committees, as thosewhowent toHarvard. But theyhaddifferenteducationalexperiences.Sowhen twostudents, fromsimilarbackgrounds,bothgot intoHarvardbut
one chose Penn State, what happened? The researchers’ results were just as
stunning as those on Stuyvesant High School. Those students ended up withmoreor less thesame incomes in theircareers. If futuresalary is themeasure,similarstudentsaccepted tosimilarlyprestigiousschoolswhochoose toattenddifferentschoolsendupinaboutthesameplace.Our newspapers are peppered with articles about hugely successful people
whoattendedIvyLeagueschools:peoplelikeMicrosoftfounderBillGatesandFacebook founders Mark Zuckerberg and Dustin Moskovitz, all of whomattended Harvard. (Granted, they all dropped out, raising additional questionsaboutthevalueofanIvyLeagueeducation.)Therearealsostoriesofpeoplewhoweretalentedenoughtogetacceptedto
an Ivy League school, chose to attend a less prestigious school, and hadextremely successful lives: people like Warren Buffett, who started at theWharton School at the University of Pennsylvania, an Ivy League businessschool, but transferred to the University of Nebraska–Lincoln because it wascheaper,hehatedPhiladelphia,andhethoughttheWhartonclasseswereboring.The data suggests, earnings-wise at least, that choosing to attend a lessprestigiousschoolisafinedecision,forBuffettandothers.
ThisbookiscalledEverybodyLies.By this, Imostlymean thatpeople lie—tofriends,tosurveys,andtothemselves—tomakethemselveslookbetter.Buttheworldalsoliestousbypresentinguswithfaulty,misleadingdata.The
world shows us a huge number of successful Harvard graduates but fewersuccessfulPennStategraduates,andweassumethatthereisahugeadvantagetogoingtoHarvard.By cleverly making sense of nature’s experiments, we can correctly make
senseoftheworld’sdata—tofindwhat’sreallyusefulandwhatisnot.
Natural experiments relate to theprevious chapter, aswell.Theyoften requirezoomingin—onthetreatmentandcontrolgroups: thecities in theSuperBowlexperiment,thecountiesintheMedicarepricingexperiment,thestudentsclosetothecutoffintheStuyvesantexperiment.Andzoomingin,asdiscussedinthepreviouschapter,oftenrequireslarge,comprehensivedatasets—ofthetypethatare increasinglyavailableas theworldisdigitized.Sincewedon’tknowwhen
nature will choose to run her experiments, we can’t set up a small survey tomeasure the results. We need a lot of existing data to learn from theseinterventions.WeneedBigData.Thereisonemoreimportantpointtomakeabouttheexperiments—eitherour
ownorthoseofnature—detailedinthischapter.Muchofthisbookhasfocusedonunderstandingtheworld—howmuchracismcostObama,howmanymenarereally gay, how insecure men and women are about their bodies. But thesecontrolled or natural experiments have a more practical bent. They aim toimproveourdecisionmaking,tohelpuslearninterventionsthatworkandthosethatdonot.Companiescan learnhowtogetmorecustomers.Thegovernmentcan learn
how to use reimbursement to best motivate doctors. Students can learn whatschoolswillprovemostvaluable.Theseexperimentsdemonstrate thepotentialofBigData to replaceguesses, conventionalwisdom, and shoddycorrelationswithwhatactuallyworks—causally.
7
BIGDATA,BIGSCHMATA?WHATITCANNOTDO
“Seth, Lawrence Summers would like to meet with you,” the email said,somewhat cryptically. It was from one ofmy Ph.D. advisers, LawrenceKatz.Katz didn’t tell me why Summers was interested in my work, though I laterfoundoutKatzhadknownallalong.I sat in the waiting room outside Summers’s office. After some delay, the
formerTreasurysecretaryoftheUnitedStates,formerpresidentofHarvard,andwinnerofsomeofthebiggestawardsineconomics,summonedmeinside.Summers began the meeting by reading my paper on racism’s effect on
Obama,whichhissecretaryhadprintedforhim.Summersisaspeedreader.Ashe reads,heoccasionally stickshis tongueoutand to the right,whilehiseyesrapidlyshiftleftandrightanddownthepage.Summersreadingasocialsciencepaper remindsme of a great pianist performing a sonata.He is so focused heseemstolosetrackofallelse.Infewerthanfiveminutes,hehadcompletedmythirty-pagepaper.“You say thatGoogle searches for ‘nigger’ suggest racism,” Summers said.
“Thatseemsplausible.TheypredictwhereObamagetslesssupportthanKerry.Thatisinteresting.CanwereallythinkofObamaandKerryasthesame?”“They were ranked as having similar ideologies by political scientists,” I
responded.“Also,thereisnocorrelationbetweenracismandchangesinHouse
voting. The result stays strong evenwhenwe add controls for demographics,churchattendance,andgunownership.”This ishowweeconomists talk.Ihadgrownanimated.Summerspausedandstaredatme.HebrieflyturnedtotheTVinhisoffice,
whichwastunedtoCNBC,thenstaredatmeagain,thenlookedattheTV,thenbackatme.“Okay,Ilikethispaper,”Summerssaid.“Whatelseareyouworkingon?”Thenextsixtyminutesmayhavebeenthemostintellectuallyexhilaratingof
my life. Summers and I talked about interest rates and inflation, policing andcrime, business and charity. There is a reason so many people who meetSummers are enthralled. I have been fortunate to speakwith some incrediblysmartpeopleinmylife;Summersstruckmeasthesmartest.Heisobsessedwithideas,morethanallelse,whichseemstobewhatoftengetshimintrouble.HehadtoresignhispresidencyatHarvardaftersuggestingthepossibilitythatpartofthereasonfortheshortageofwomeninthesciencesmightbethatmenhavemorevariationintheirIQs.Ifhefindsanideainteresting,Summerstendstosayit,evenifitoffendssomeears.It was now a half hour past the scheduled end time for our meeting. The
conversationwasintoxicating,butIstillhadnoideawhyIwasthere,norwhenIwassupposedtoleave,norhowIwouldknowwhenIwassupposedtoleave.Igotthefeeling,bythispoint,thatSummershimselfmayhaveforgottenwhyhehadsetupthismeeting.And then he asked the million-dollar—or perhaps billion-dollar—question.
“Youthinkyoucanpredictthestockmarketwiththisdata?”Aha.HereatlastwasthereasonSummershadsummonedmetohisoffice.Summers is hardly the first person to ask me this particular question. My
father has generally been supportive of my unconventional research interests.Butonetimehedidbroachthesubject.“Racism,childabuse,abortion,”hesaid.“Can’t you make any money off this expertise of yours?” Friends and otherfamily members have raised the subject, as well. So have coworkers andstrangers on the internet. Everyone seems towant to knowwhether I can useGoogle searches—or other Big Data—to pick stocks. Now it was the formerTreasurysecretaryoftheUnitedStates.Thiswasmoreserious.So can new Big Data sources successfully predict which ways stocks are
headed?Theshortanswerisno.In the previous chapters we discussed the four powers of Big Data. This
chapterisallaboutBigData’slimitations—bothwhatwecannotdowithitand,onoccasion,whatweoughtnotdowithit.AndoneplacetostartisbytellingthestoryofthefailedattemptbySummersandmyselftobeatthemarkets.InChapter3,wenotedthatnewdataismostlikelytoyieldbigreturnswhen
theexistingresearchinagivenfieldisweak.Itisanunfortunatetruthabouttheworldthatyouwillhaveamucheasiertimegettingnewinsightsaboutracism,childabuse,orabortionthanyouwillgettinganew,profitableinsightintohowabusinessisperforming.That’sbecausemassiveresourcesarealreadydevotedtolooking for even the slightest edge in measuring business performance. Thecompetitioninfinanceisfierce.Thatwasalreadyastrikeagainstus.Summers, who is not someone known for effusing about other people’s
intelligence,wascertain thehedge fundswerealreadywayaheadofus. Iwasquite takenduringourconversationbyhowmuchrespecthehadfor themandhowmanyofmysuggestionshewasconvinced they’dbeatenus to. Iproudlyshared with him an algorithm I had devised that allowed me to obtain morecomplete Google Trends data. He said it was clever. When I asked him ifRenaissance,aquantitativehedgefund,wouldhavefiguredout thatalgorithm,hechuckledandsaid,“Yeah,ofcoursetheywouldhavefiguredthatout.”The difficulty of keeping up with the hedge funds wasn’t the only
fundamental problem that Summers and I ran up against in using new, bigdatasetstobeatthemarkets.
THECURSEOFDIMENSIONALITY
Supposeyourstrategyforpredictingthestockmarketistofindaluckycoin—butonethatwillbefoundthroughcarefultesting.Here’syourmethodology:Youlabel one thousand coins—1 to 1,000.Everymorning, for two years, you flipeachcoin, recordwhether itcameupheadsor tails,and thennotewhether theStandard&Poor’sIndexwentupordownthatday.Youpore throughallyourdata.Andvoilà!You’ve found something. It turnsout that70.3percentof thetime Coin 391 came up heads the S&P Index rose. The relationship isstatisticallysignificant,highlyso.Youhavefoundyourluckycoin!
JustflipCoin391everymorningandbuystockswheneveritcomesupheads.YourdaysofTargetT-shirtsandramennoodledinnersareover.Coin391isyourtickettothegoodlife!Ornot.Youhavebecomeanothervictimofoneofthemostdiabolicalaspectsof“the
curseofdimensionality.” It can strikewheneveryouhave lotsofvariables (or“dimensions”)—in this case, one thousand coins—chasing not that manyobservations—inthiscase,504tradingdaysoverthosetwoyears.Oneofthosedimensions—Coin391,inthiscase—islikelytogetlucky.Decreasethenumberofvariables—fliponlyonehundredcoins—anditwillbecomemuchlesslikelythat one of them will get lucky. Increase the number of observations—try topredictthebehavioroftheS&PIndexfortwentyyears—andcoinswillstruggletokeepup.The curse of dimensionality is a major issue with Big Data, since newer
datasets frequently give us exponentially more variables than traditional datasources—every search term, every category of tweet, etc. Many people whoclaim to predict themarket utilizing someBigData source havemerely beenentrapped by the curse.All they’ve really done is find the equivalent ofCoin391.Take,forexample,ateamofcomputerscientistsfromIndianaUniversityand
ManchesterUniversitywhoclaimed theycouldpredictwhichway themarketswouldgobasedonwhatpeopleweretweeting.Theybuiltanalgorithmtocodetheworld’sday-to-daymoodsbasedontweets.TheyusedtechniquessimilartothesentimentanalysisdiscussedinChapter3.However,theycodednotjustonemoodbutmanymoods—happiness,anger,kindness,andmore.Theyfoundthatapreponderanceof tweetssuggestingcalmness,suchas“I feelcalm,”predictsthat theDow Jones IndustrialAverage is likely to rise six days later.Ahedgefundwasfoundedtoexploittheirfindings.What’stheproblemhere?Thefundamentalproblemisthattheytestedtoomanythings.Andifyoutest
enough things, just by random chance, one of them will be statisticallysignificant.Theytestedmanyemotions.Andtheytestedeachemotiononedaybefore,twodaysbefore,threedaysbefore,anduptosevendaysbeforethestockmarket behavior that theywere trying to predict.And all these variableswere
usedtotrytoexplainjustafewmonthsofDowJonesupsanddowns.Calmnesssixdaysearlierwasnotalegitimatepredictorofthestockmarket.
CalmnesssixdaysearlierwastheBigDataequivalentofourhypotheticalCoin391.Thetweet-basedhedgefundwasshutdownonemonthafterstartingduetolacklusterreturns.Hedge funds trying to time the markets with tweets are not the only ones
battling the curse of dimensionality. So are the numerous scientistswho havetriedtofindthegenetickeystowhoweare.Thanks to the Human Genome Project, it is now possible to collect and
analyze the complete DNA of people. The potential of this project seemedenormous.Maybewe could find the gene that causes schizophrenia.Maybewe could
discoverthegenethatcausesAlzheimer’sandParkinson’sandALS.Maybewecouldfind thegene thatcauses—gulp—intelligence. Is thereonegene thatcanaddawholebunchofIQpoints?Isthereonegenethatmakesagenius?In 1998,Robert Plomin, a prominent behavioral geneticist, claimed to have
found the answer. He received a dataset that included the DNA and IQs ofhundredsofstudents.HecomparedtheDNAof“geniuses”—thosewithIQsof160orhigher—totheDNAofthosewithaverageIQs.HefoundastrikingdifferenceintheDNAofthesetwogroups.Itwaslocated
in one small corner of chromosome6, an obscure but powerful gene thatwasusedinthemetabolismofthebrain.Oneversionofthisgene,namedIGF2r,wastwiceascommoningeniuses.“First Gene to Be Linked with High Intelligence Is Reported Found,”
headlinedtheNewYorkTimes.YoumaythinkofthemanyethicalquestionsPlomin’sfindingraised.Should
parents be allowed to screen their kids for IGF2r? Should they be allowed toabortababywith the low-IQvariant?Shouldwegeneticallymodifypeople togivethemahighIQ?DoesIGF2rcorrelatewithrace?Dowewanttoknowtheanswertothatquestion?ShouldresearchonthegeneticsofIQcontinue?Before bioethicists had to tackle any of these thorny questions, therewas a
more basic question for geneticists, including Plomin himself.Was the resultaccurate?WasitreallytruethatIGF2rcouldpredictIQ?Wasitreallytruethatgeniusesweretwiceaslikelytocarryacertainvariantofthisgene?
Nope. A few years after his original study, Plomin got access to anothersampleofpeoplethatalsoincludedtheirDNAandIQscores.Thistime,IGF2rdid not correlate with IQ. Plomin—and this is a sign of a good scientist—retractedhisclaim.This,infact,hasbeenageneralpatternintheresearchintogeneticsandIQ.
First, scientists report that they have found a genetic variant that predicts IQ.Thenscientistsgetnewdataanddiscovertheiroriginalassertionwaswrong.For example, in a recent paper, a team of scientists, led by Christopher
Chabris, examined twelve prominent claims about genetic variants associatedwith IQ. They examined data from ten thousand people. They could notreproducethecorrelationforanyofthetwelve.What’s the issuewith all of these claims?The curse of dimensionality.The
human genome, scientists now know, differs in millions of ways. There are,quitesimply,toomanygenestotest.Ifyou testenough tweets tosee if theycorrelatewith thestockmarket,you
willfindonethatcorrelatesjustbychance.IfyoutestenoughgeneticvariantstoseeiftheycorrelatewithIQ,youwillfindonethatcorrelatesjustbychance.Howcanyouovercomethecurseofdimensionality?Youhavetohavesome
humilityaboutyourworkandnotfallinlovewithyourresults.Youhavetoputthese results through additional tests. For example, before you bet your lifesavingsonCoin391,youwouldwanttoseehowitdoesoverthenextcoupleofyears.Socialscientistscallthisan“out-of-sample”test.Andthemorevariablesyoutry,themorehumbleyouhavetobe.Themorevariablesyoutry,thetoughertheout-of-sampletesthastobe.Itisalsocrucialtokeeptrackofeverytestyouattempt.Thenyoucanknowexactlyhowlikelyitisyouarefallingvictimtothecurseandhowskepticalyoushouldbeofyourresults.WhichbringsusbacktoLarrySummersandme.Here’showwetriedtobeatthemarkets.Summers’s first idea was to use searches to predict future sales of key
products,suchasiPhones,thatmightshedlightonthefutureperformanceofthestock of a company, such as Apple. There was indeed a correlation betweensearches for“iPhones”and iPhones sales.WhenpeopleareGooglinga lot for“iPhones,”youcanbetalotofphonesarebeingsold.However,thisinformationwas already incorporated into theApple stockprice.Clearly,when therewerelotsofGooglesearchesfor“iPhones,”hedgefundshadalsofiguredout that it
wouldbeabigseller, regardlessofwhether theyused thesearchdataorsomeothersource.Summers’snextideawastopredictfutureinvestmentindevelopingcountries.
If a large number of investorswere going to be pouringmoney into countriessuchasBrazilorMexicointhenearfuture,thenstocksforcompaniesinthesecountrieswouldsurelyrise.PerhapswecouldpredictariseininvestmentwithkeyGooglesearches—suchas“investinMexico”or“investmentopportunitiesinBrazil.”Thisprovedadeadend.Theproblem?Thesearcheswere too rare.Instead of revealingmeaningful patterns, this search data jumped all over theplace.Wetriedsearchesforindividualstocks.Perhapsifpeopleweresearchingfor
“GOOG,”thismeanttheywereabouttobuyGoogle.Thesesearchesseemedtopredictthatthestockswouldbetradedalot.Buttheydidnotpredictwhetherthestockswouldriseorfall.Onemajorlimitationisthatthesesearchesdidnottelluswhethersomeonewasinterestedinbuyingorsellingthestock.One day, I excitedly showed Summers a new idea I had: past searches for
“buy gold” seemed to correlate with future increases in the price of gold.SummerstoldmeIshouldtestitgoingforwardtoseeifitremainedaccurate.Itstopped working, perhaps because some hedge fund had found the samerelationship.In the end, over a fewmonths,we didn’t find anything useful in our tests.
Undoubtedly,ifwehadlookedforacorrelationwithmarketperformanceineachof thebillionsofGooglesearch terms,wewouldhavefoundone thatworked,howeverweakly.ButitlikelywouldhavejustbeenourownCoin391.
THEOVEREMPHASISONWHATISMEASURABLE
InMarch 2012, Zoë Chance, a marketing professor at Yale, received a smallwhitepedometer inherofficemailbox indowntownNewHaven,Connecticut.Sheaimedtostudyhowthisdevice,whichmeasuresthestepsyoutakeduringthedayandgivesyoupointsasaresult,caninspireyoutoexercisemore.What happened next, as she recounted in a TEDx talk, was a Big Data
nightmare.Chancebecamesoobsessedandaddictedtoincreasinghernumbersthatshebeganwalkingeverywhere,fromthekitchentothelivingroom,tothe
dining room, to the basement, in her office.Shewalked early in themorning,late at night, at nearly all hours of the day—twenty thousand steps in a giventwenty-fourhourperiod.Shecheckedherpedometerhundredsoftimesperday,andmuchthatremainedofherhumancommunicationwaswithotherpedometerusersonline,discussingstrategiestoimprovescores.Sheremembersputtingthepedometer on her three-year-old daughter when her daughter was walking,becauseshewassoobsessedwithgettingthenumberhigher.Chance became so obsessed with maximizing this number that she lost all
perspective.Sheforgotthereasonsomeonewouldwanttogetthenumberhigher—exercising,nothavingherdaughterwalka fewsteps.Nordid shecompleteany academic research about the pedometer. She finally got rid of the deviceafterfallinglateonenight,exhausted,whiletryingtogetinmoresteps.Thoughshe is a data-driven researcher by profession, the experience affected herprofoundly.“Itmakesmeskepticalofwhetherhavingaccesstoadditionaldataisalwaysagoodthing,”Chancesays.Thisisanextremestory.Butitpointstoapotentialproblemwithpeopleusing
data tomake decisions.Numbers can be seductive.We can grow fixatedwiththem,andinsodoingwecanlosesightofmoreimportantconsiderations.ZoëChancelostsight,moreorless,oftherestofherlife.Evenlessobsessiveinfatuationswithnumberscanhavedrawbacks.Consider
thetwenty-first-centuryemphasisontestinginAmericanschools—andjudgingteachersbasedonhowtheirstudentsscore.Whilethedesireformoreobjectivemeasuresofwhathappensinclassroomsislegitimate,therearemanythingsthatgo on there that can’t readily be captured in numbers. Moreover, all of thattesting pressured many teachers to teach to the tests—and worse. A smallnumber, aswas proven in a paper by Brian Jacob and Steven Levitt, cheatedoutrightinadministeringthosetests.Theproblemisthis:thethingswecanmeasureareoftennotexactlywhatwe
careabout.Wecanmeasurehowstudentsdoonmultiple-choicequestions.Wecan’t easilymeasure critical thinking, curiosity, or personal development. Justtryingtoincreaseasingle,easy-to-measurenumber—testscoresorthenumberofstepstakeninaday—doesn’talwayshelpachievewhatwearereallytryingtoaccomplish.Initsefforts toimproveitssite,Facebookrunsintothisdangeraswell.The
companyhastonsofdataonhowpeopleusethesite.It’seasytoseewhetheraparticularNewsFeedstorywasliked,clickedon,commentedon,orshared.But,according toAlexPeysakhovich, a Facebook data scientistwithwhom I havewritten about these matters, not one of these is a perfect proxy for moreimportant questions:What was the experience of using the site like? Did thestoryconnecttheuserwithherfriends?Diditinformherabouttheworld?Diditmakeherlaugh?Orconsiderbaseball’sdatarevolutioninthe1990s.Manyteamsbeganusing
increasingly intricate statistics—rather than relying on old-fashioned humanscouts—tomakedecisions.Itwaseasytomeasureoffenseandpitchingbutnotfielding, so some organizations ended up underestimating the importance ofdefense.Infact,inhisbookTheSignalandtheNoise,NateSilverestimatesthattheOaklandA’s,adata-drivenorganizationprofiledinMoneyball,weregivingupeighttotenwinsperyearinthemid-ninetiesbecauseoftheirlousydefense.ThesolutionisnotalwaysmoreBigData.Aspecialsauceisoftennecessary
tohelpBigDataworkbest:thejudgmentofhumansandsmallsurveys,whatwemight call small data. In an interview with Silver, Billy Beane, the A’s thengeneralmanagerandthemaincharacterinMoneyball,saidthatheactuallyhadbegunincreasinghisscoutingbudget.To fill in the gaps in its giant data pool, Facebook too has to take an old-
fashionedapproach:askingpeoplewhattheythink.EverydayastheyloadtheirNewsFeed,hundredsofFacebookusersarepresentedwithquestionsaboutthestoriestheyseethere.Facebook’sautomaticallycollecteddatasets(likes,clicks,comments)aresupplemented,inotherwords,bysmallerdata(“DoyouwanttoseethispostinyourNewsFeed?”“Why?”).Yes,evenaspectacularlysuccessfulBig Data organization like Facebook sometimes makes use of the source ofinformationmuchdisparagedinthisbook:asmallsurvey.Indeed,becauseofthisneedforsmalldataasasupplementtoitsmainstay—
massive collections of clicks, likes, and posts—Facebook’s data teams lookdifferent than you might guess. Facebook employs social psychologists,anthropologists,andsociologistspreciselytofindwhatthenumbersmiss.Some educators, too, are becoming more alert to blind spots in Big Data.
There is agrowingnational effort to supplementmass testingwith smalldata.Student surveys have proliferated. So have parent surveys and teacher
observations,whereotherexperiencededucatorswatchateacherduringalesson.“Schooldistrictsrealizetheyshouldn’tbefocusingsolelyontestscores,”says
ThomasKane, a professor of education atHarvard.A three-year study by theBill&MelindaGatesFoundationbearsout thevalue ineducationofbothbigandsmalldata.Theauthorsanalyzedwhether test-score-basedmodels, studentsurveys, or teacher observations were best at measuring which teachers mostimproved student learning.When they put the three measures together into acomposite score, they got the best results. “Each measure adds something ofvalue,”thereportconcluded.Infact,itwasjustasIwaslearningthatmanyBigDataoperationsusesmall
data to fill in theholes that I showedup inOcala,Florida, tomeet JeffSeder.Remember,hewas theHarvard-educatedhorseguruwhoused lessons learnedfromahugedatasettopredictthesuccessofAmericanPharoah.Aftersharingallthecomputerfilesandmathwithme,Sederadmittedthathe
hadanotherweapon:PattyMurray.Murray,likeSeder,hashighintelligenceandelitecredentials—adegreefrom
BrynMawr.ShealsoleftNewYorkCityforrurallife.“Ilikehorsesmorethanhumans,”Murrayadmits.ButMurrayisabitmoretraditionalinherapproachestoevaluatinghorses.She, likemanyhorseagents,personallyexamineshorses,seeing how they walk, checking for scars and bruises, and interrogating theirowners.MurraythencollaborateswithSederastheypickthefinalhorsestheywantto
recommend.Murraysniffsoutproblemswiththehorses,problemsthatSeder’sdata,despitebeingthemostinnovativeandimportantdatasetevercollectedonhorses,stillmisses.I am predicting a revolution based on the revelations ofBigData. But this
doesnotmeanwecan just throwdataatanyquestion.AndBigDatadoesnoteliminate the need for all the other ways humans have developed over themillenniatounderstandtheworld.Theycomplementeachother.
8
MODATA,MOPROBLEMS?WHATWESHOULDN’TDO
Sometimes, thepowerofBigDataissoimpressiveit’sscary.Itraisesethicalquestions.
THEDANGEROFEMPOWEREDCORPORATIONS
Recently,threeeconomists—OdedNetzerandAlainLemaire,bothofColumbia,and Michal Herzenstein of the University of Delaware—looked for ways topredict the likelihood of whether a borrower would pay back a loan. Thescholars utilized data from Prosper, a peer-to-peer lending site. Potentialborrowerswrite abriefdescriptionofwhy theyneeda loanandwhy theyarelikelytomakegoodonit,andpotentiallendersdecidewhethertoprovidethemthemoney.Overall,about13percentofborrowersdefaultedontheirloan.It turnsoutthelanguagethatpotentialborrowersuseisastrongpredictorof
their probability of paying back.And it is an important indicator even if youcontrol for other relevant information lenderswere able to obtain about thosepotentialborrowers,includingcreditratingsandincome.Listed below are ten phrases the researchers found that are commonly used
whenapplyingforaloan.Fiveofthempositivelycorrelatewithpayingbackthe
loan. Five of them negatively correlate with paying back the loan. In otherwords,fivetendtobeusedbypeopleyoucantrust,fivebypeopleyoucannot.Seeifyoucanguesswhicharewhich.
Godpromisedebt-freeminimumpaymentlowerinterestrate
willpaygraduatethankyouafter-taxhospital
Youmightthink—oratleasthope—thatapolite,openlyreligiouspersonwhogiveshiswordwouldbeamongthemostlikelytopaybackaloan.Butinfactthis is not the case. This type of person, the data shows, is less likely thanaveragetomakegoodontheirdebt.Herearethephrasesgroupedbythelikelihoodofpayingback.
TERMSUSEDINLOANAPPLICATIONSBYPEOPLEMOSTLIKELYTOPAYBACK
debt-freelowerinterestrateafter-tax
minimumpaymentgraduate
TERMSUSEDINLOANAPPLICATIONSBYPEOPLEMOSTLIKELYTODEFAULT
Godpromisewillpay
thankyouhospital
Beforewediscuss the ethical implications of this study, let’s think through,withthehelpofthestudy’sauthors,whatitrevealsaboutpeople.Whatshouldwemakeofthewordsinthedifferentcategories?First,let’sconsiderthelanguagethatsuggestssomeoneismorelikelytomake
theirloanpayments.Phrasessuchas“lowerinterestrate”or“after-tax”indicateacertainleveloffinancialsophisticationontheborrower’spart,soit’sperhapsnotsurprisingtheycorrelatewithsomeonemorelikelytopaytheirloanback.Inaddition,ifheorshetalksaboutpositiveachievementssuchasbeingacollege“graduate”andbeing“debt-free,”heorsheisalsolikelytopaytheirloans.Now let’s consider language that suggests someone is unlikely to pay their
loans.Generally,ifsomeonetellsyouhewillpayyouback,hewillnotpayyouback. The more assertive the promise, the more likely he will break it. Ifsomeonewrites“IpromiseIwillpayback,sohelpmeGod,”he isamongtheleastlikelytopayyouback.Appealingtoyourmercy—explainingthatheneedsthemoneybecausehehasarelativeinthe“hospital”—alsomeansheisunlikelytopayyouback.Infact,mentioninganyfamilymember—ahusband,wife,son,daughter,mother,orfather—isasignsomeonewillnotbepayingback.Anotherwordthatindicatesdefaultis“explain,”meaningifpeoplearetryingtoexplainwhytheyaregoingtobeabletopaybackaloan,theylikelywon’t.The authors did not have a theory for why thanking people is evidence of
likelydefault.Insum,accordingto theseresearchers,givingadetailedplanofhowhecan
make his payments andmentioning commitments he has kept in the past areevidencesomeonewillpaybackaloan.Makingpromisesandappealingtoyourmercyisaclearsignsomeonewillgointodefault.Regardlessofthereasons—orwhatittellsusabouthumannaturethatmakingpromisesisasuresignsomeonewill, in actuality, not do something—the scholars found the test was anextremely valuable piece of information in predicting default. Someone whomentionsGodwas2.2timesmorelikelytodefault.Thiswasamongthesinglehighestindicatorsthatsomeonewouldnotpayback.But the authors also believe their study raises ethical questions.While this
was just anacademic study, somecompaniesdo report that theyutilizeonlinedata in approving loans. Is this acceptable?Dowewant to live in aworld inwhichcompaniesusethewordswewritetopredictwhetherwewillpaybackaloan?Itis,ataminimum,creepy—and,quitepossibly,scary.Aconsumer lookingfora loan in thenearfuturemighthave toworryabout
notmerely her financial history but also her online activity. And shemay bejudgedonfactors thatseemabsurd—whethersheusesthephrase“Thankyou”
orinvokes“God,”forexample.Further,whataboutawomanwholegitimatelyneeds tohelphersister inahospitalandwillmostcertainlypaybackher loanafterward?Itseemsawfultopunishherbecause,onaverage,peopleclaimingtoneed help for medical bills have often been proven to be lying. A worldfunctioningthiswaystartstolookawfullydystopian.Thisistheethicalquestion:Docorporationshavetherighttojudgeourfitness
fortheirservicesbasedonabstractbutstatisticallypredictivecriterianotdirectlyrelatedtothoseservices?Leavingbehindtheworldoffinance,let’slookatthelargerimplicationson,
forexample,hiringpractices.Employersareincreasinglyscouringsocialmediawhenconsideringjobcandidates.Thatmaynotraiseethicalquestionsifthey’relookingforevidenceofbad-mouthingpreviousemployersorrevealingpreviousemployers’ secrets. Theremay even be some justification for refusing to hiresomeonewhoseFacebookorInstagrampostssuggestexcessivealcoholuse.Butwhatiftheyfindaseeminglyharmlessindicatorthatcorrelateswithsomethingtheycareabout?ResearchersatCambridgeUniversityandMicrosoftgavefifty-eightthousand
U.S.Facebookusers a varietyof tests about their personality and intelligence.TheyfoundthatFacebooklikesarefrequentlycorrelatedwithIQ,extraversion,and conscientiousness. For example, people who like Mozart, thunderstorms,andcurly friesonFacebook tend tohavehigher IQs.Peoplewho likeHarley-Davidsonmotorcycles,thecountrymusicgroupLadyAntebellum,orthepage“ILoveBeingaMom”tendtohavelowerIQs.Someofthesecorrelationsmaybeduetothecurseofdimensionality.Ifyoutestenoughthings,somewillrandomlycorrelate.ButsomeinterestsmaylegitimatelycorrelatewithIQ.Nonetheless, it would seem unfair if a smart person who happens to like
Harleyscouldn’tgetajobcommensuratewithhisskillsbecausehewas,withoutrealizingit,signalinglowintelligence.Infairness,thisisnotanentirelynewproblem.Peoplehavelongbeenjudged
by factors not directly related to job performance—the firmness of theirhandshakes, the neatness of their dress.But a danger of the data revolution isthat, as more of our life is quantified, these proxy judgments can get moreesoteric yet more intrusive. Better prediction can lead to subtler and morenefariousdiscrimination.
Betterdatacanalsoleadtoanotherformofdiscrimination,whateconomistscall price discrimination. Businesses are often trying to figure out what pricetheyshouldchargeforgoodsorservices.Ideallytheywanttochargecustomersthemaximumtheyarewillingtopay.Thisway,theywillextractthemaximumpossibleprofit.Most businesses usually end up picking one price that everyone pays. But
sometimestheyareawarethatthemembersofacertaingroupwill,onaverage,paymore.Thisiswhymovietheaterschargemoretomiddle-agedcustomers—attheheightoftheirearningpower—thantostudentsorseniorcitizensandwhyairlinesoftenchargemoretolast-minutepurchasers.Theypricediscriminate.Big Data may allow businesses to get substantially better at learning what
customers are willing to pay—and thus gouging certain groups of people.Optimal DecisionsGroupwas a pioneer in using data science to predict howmuchconsumersarewillingtopayforinsurance.Howdidtheydoit?Theyusedamethodologythatwehavepreviouslydiscussedinthisbook.Theyfoundpriorcustomersmost similar to those currently looking to buy insurance—and sawhowhigh a premium theywerewilling to take on. In otherwords, they ran adoppelgangersearch.Adoppelgangersearchisentertainingifithelpsuspredictwhether a baseball playerwill return to his former greatness.A doppelgangersearchisgreatifithelpsuscuresomeone’sdisease.Butifadoppelgangersearchhelpsacorporationextractevery lastpenny fromyou?That’snot socool.Myspendthriftbrotherwouldhavearighttocomplainifhegotchargedmoreonlinethantightwadme.Gambling is one area in which the ability to zoom in on customers is
potentially dangerous. Big casinos are using something like a doppelgangersearchtobetterunderstandtheirconsumers.Theirgoal?Toextractthemaximumpossibleprofit—tomakesuremoreofyourmoneygoesintotheircoffers.Here’showitworks.Everygambler,casinosbelieve,hasa“painpoint.”This
istheamountoflossesthatwillsufficientlyfrightenhersothatsheleavesyourcasinoforanextendedperiodoftime.Suppose,forexample,thatHelen’s“painpoint”is$3,000.Thismeansifsheloses$3,000,you’velostacustomer,perhapsforweeksormonths.IfHelenloses$2,999,shewon’tbehappy.Who,afterall,likestolosemoney?Butshewon’tbesodemoralizedthatshewon’tcomebacktomorrownight.
Imagine for a moment that you are managing a casino. And imagine thatHelen has shownup to play the slotmachines.What is the optimal outcome?Clearly,youwantHelentogetascloseaspossibletoher“painpoint”withoutcrossing it.Youwanther to lose$2,999,enoughthatyoumakebigprofitsbutnotsomuchthatshewon’tcomebacktoplayagainsoon.Howcanyoudothis?Well,therearewaystogetHelentostopplayingonce
shehaslostacertainamount.Youcanofferherfreemeals,forexample.Maketheofferenticingenough,andshewillleavetheslotsforthefood.Butthere’sonebigchallengewiththisapproach.HowdoyouknowHelen’s
“painpoint”?Theproblemis,peoplehavedifferent“painpoints.”ForHelen,it’s$3,000. For John, it might be $2,000. For Ben, it might be $26,000. If youconvinceHelen to stopgamblingwhenshe lost$2,000,you leftprofitson thetable. Ifyouwait too long—aftershehas lost$3,000—youhave losther forawhile. Further,Helenmight notwant to tell you her pain point. Shemay notevenknowwhatitisherself.Sowhatdoyoudo?Ifyouhavemadeitthisfarinthebook,youcanprobably
guesstheanswer.Youutilizedatascience.Youlearneverythingyoucanaboutanumberofyourcustomers—theirage,gender,zipcode,andgamblingbehavior.And, from that gambling behavior—their winnings, losings, comings, andgoings—youestimatetheir“painpoint.”YougatheralltheinformationyouknowaboutHelenandfindgamblerswho
are similar to her—herdoppelgangers,moreor less.Thenyou figureout howmuchpaintheycanwithstand.It’sprobablythesameamountasHelen.Indeed,this is what the casino Harrah’s does, utilizing a Big Data warehouse firm,Terabyte,toassistthem.Scott Gnau, general manager of Terabyte, explains, in the excellent book
SuperCrunchers,what casinomanagers dowhen they see a regular customernearingtheirpainpoint:“Theycomeoutandsay,‘Iseeyou’rehavingaroughday. I know you like our steakhouse. Here, I’d like you to take your wife todinneronusrightnow.’”Thismightseemtheheightofgenerosity:a freesteakdinner.But really it’s
self-serving.Thecasinoisjusttryingtogetcustomerstoquitbeforetheylosesomuch that they’ll leave for an extended period of time. In other words,managementisusingsophisticateddataanalysistotrytoextractasmuchmoney
fromcustomers,overthelongterm,asitcan.We have a right to fear that better and better use of online data will give
casinos, insurance companies, lenders, and other corporate entities too muchpoweroverus.Ontheotherhand,BigDatahasalsobeenenablingconsumerstoscoresome
blowsagainstbusinessesthatoverchargethemordelivershoddyproducts.One important weapon is sites, such as Yelp, that publish reviews of
restaurants and other services. A recent study by economistMichael Luca, ofHarvard, has shown the extent to which businesses are at the mercy of Yelpreviews.Comparing those reviews to salesdata in the stateofWashington,hefoundthatonefewerstaronYelpwillmakearestaurant’srevenuesdrop5to9percent.Consumers are also aided in their struggles with business by comparison
shopping sites—likeKayak andBooking.com.As discussed inFreakonomics,when an internet site began reporting the prices different companies werecharging for term life insurance, these prices fell dramatically. If an insurancecompanywasovercharging,customerswouldknowitandusesomeoneelse.Thetotalsavingstoconsumers?Onebilliondollarsperyear.Dataon the internet, inotherwords, can tell businesseswhichcustomers to
avoidandwhichtheycanexploit.Itcanalsotellcustomersthebusinessestheyshouldavoidandwhoistryingtoexploitthem.BigDatatodatehashelpedbothsidesinthestrugglebetweenconsumersandcorporations.Wehavetomakesureitremainsafairfight.
THEDANGEROFEMPOWEREDGOVERNMENTS
Whenherex-boyfriendshowedupatabirthdayparty,AdrianaDonatoknewhewas upset. She knew that he was mad. She knew that he had struggled withdepression.Asheinvitedherforadrive,therewasonethingDonato,atwenty-year-old zoology student, did not know. She did not know her ex-boyfriend,twenty-two-year-old James Stoneham, had spent the previous three weekssearching for informationonhow tomurder somebodyand aboutmurder law,mixedinwiththeoccasionalsearchaboutDonato.If she had known this, presumably she would not have gotten in the car.
Presumably,shewouldnothavebeenstabbedtodeaththatevening.InthemovieMinorityReport,psychicscollaboratewithpolicedepartmentsto
stop crimes before they happen. ShouldBigData bemade available to policedepartments to stop crimes before they happen? Should Donato have at leastbeenwarned about her ex-boyfriend’s foreboding searches? Should the policehaveinterrogatedStoneham?First, it must be acknowledged that there is growing evidence that Google
searchesrelatedtocriminalactivitydocorrelatewithcriminalactivity.ChristineMa-Kellams, Flora Or, Ji Hyun Baek, and Ichiro Kawachi have shown thatGoogle searches related to suicide correlate strongly with state-level suiciderates. In addition, Evan Soltas and I have shown that weekly Islamophobicsearches—such as “I hate Muslims” or “kill Muslims”—correlate with anti-Muslimhatecrimesthatweek.Ifmorepeoplearemakingsearchessayingtheywanttodosomething,morepeoplearegoingtodothatthing.So what should we do with this information? One simple, fairly
uncontroversialidea:wecanutilizethearea-leveldatatoallocateresources.Ifacityhasahugeriseinsuicide-relatedsearches,wecanupthesuicideawarenessinthiscity.Thecitygovernmentornonprofitsmightruncommercialsexplainingwherepeople canget help, for example.Similarly, if a cityhas a huge rise insearches for “killMuslims,” police departmentsmight bewise to change howthey patrol the streets. Theymight dispatchmore officers to protect the localmosque,forexample.But one step we should be very reluctant to take: going after individuals
beforeanycrimehasbeencommitted.Thisseems,tobeginwith,aninvasionofprivacy.Thereisalargeethicalleapfromthegovernmenthavingthesearchdataofthousandsorhundredsofthousandsofpeopletothepolicedepartmenthavingthesearchdataofanindividual.Thereisa largeethical leapfromprotectingalocalmosquetoransackingsomeone’shouse.Thereisalargeethicalleapfromadvertising suicide prevention to locking someone up in a mental hospitalagainsthiswill.The reason to be extremely cautious using individual-level data, however,
goesbeyondevenethics.Thereisadatareasonaswell.Itisalargeleapfordatasciencetogofromtryingtopredicttheactionsofacitytotryingtopredicttheactionsofanindividual.
Let’sreturntosuicideforamoment.Everymonth,thereareabout3.5millionGooglesearchesintheUnitedStatesrelatedtosuicide,withthemajorityofthemsuggestingsuicidalideation—searchessuchas“suicidal,”“commitsuicide,”and“how to suicide.” In otherwords, everymonth, there ismore than one searchrelatedtosuicideforeveryonehundredAmericans.Thisbringstomindaquotefrom the philosopher Friedrich Nietzsche: “The thought of suicide is a greatconsolation:bymeansofitonegetsthroughmanyadarknight.”Googlesearchdatashowshow true that is,howcommon the thoughtof suicide is.However,everymonth, there are fewer than four thousand suicides in theUnitedStates.Suicidalideationisincrediblycommon.Suicideisnot.Soitwouldn’tmakealotofsenseforcopstobeshowingupatthedoorofeveryonewhohasevermadesomeonlinenoiseaboutwantingtoblowtheirbrainsout—iffornootherreasonthanthatthepolicewouldn’thavetimeforanythingelse.Or consider those incredibly vicious Islamophobic searches. In 2015, there
were roughly 12,000 searches in the United States for “kill Muslims.” Therewere12murdersofMuslimsreportedashatecrimes.Clearly,thevastmajorityof people who make this terrifying search do not go through with thecorrespondingact.There is some math that explains the difference between predicting the
behaviorofanindividualandpredictingthebehaviorinacity.Here’sasimplethought experiment. Suppose there are one million people in a city and onemosque.Suppose,ifsomeonedoesnotsearchfor“killMuslims,”thereisonlya1-in-100,000,000chancethathewillattackamosque.Supposeifsomeonedoessearch for “kill Muslims,” this chance rises sharply, to 1 in 10,000. SupposeIslamophobiahasskyrocketedandsearchesfor“killMuslims”haverisenfrom100to1,000.Inthissituation,mathshowsthatthechancesofamosquebeingattackedhas
risenaboutfivefold,fromabout2percentto10percent.Butthechancesofanindividualwhosearchedfor“killMuslims”actuallyattackingamosqueremainsonly1in10,000.Theproperresponseinthissituationisnottojailallthepeoplewhosearched
for“killMuslims.”Norisittovisittheirhouses.Thereisatinychancethatanyone of these people in particular will commit a crime. The proper response,however,wouldbetoprotectthatmosque,whichnowhasa10percentchanceof
beingattacked.Clearly,manyhorrificsearchesneverleadtohorribleactions.That said, it is at least theoretically possible that there are some classes of
searchesthatsuggestareasonablyhighprobabilityofahorriblefollow-through.Itisatleasttheoreticallypossible,forexample,thatdatascientistscouldinthefuturebuildamodel thatcouldhavefoundthatStoneham’ssearchesrelatedtoDonatoweresignificantcauseforconcern.In 2014, therewere about 6,000 searches for the exact phrase “how to kill
your girlfriend” and 400murders of girlfriends. If all of thesemurderers hadmade this exact search beforehand, that would mean 1 in 15 people whosearched “how to kill your girlfriend”went throughwith it.Of course,many,probablymost, peoplewhomurdered their girlfriends did notmake this exactsearch. Thiswouldmean the true probability that this particular search led tomurderislower,probablyalotlower.Butifdatascientistscouldbuildamodelthatshowedthatthethreatagainsta
particularindividualwas,say,1in100,wemightwanttodosomethingwiththatinformation. At the least, the person under threat might have the right to beinformed that there is a 1-in-100 chance shewill bemurdered by a particularperson.Overall, however,we have to be very cautious using search data to predict
crimesatanindividuallevel.Thedataclearlytellsusthattherearemany,manyhorrifyingsearchesthatrarelyleadtohorribleactions.Andtherehasbeen,asofyet,noproof that thegovernmentcanpredictaparticularhorribleaction,withhigh probability, just from examining these searches. Sowe have to be reallycautious about allowing the government to intervene at the individual levelbasedonsearchdata.Thisisnotjustforethicalorlegalreasons.It’salso,atleastfornow,fordatasciencereasons.
CONCLUSION
HOWMANYPEOPLEFINISHBOOKS?
Aftersigningmybookcontract,Ihadaclearvisionofhowthebookshouldbestructured. Near the start, youmay recall, I described a scene atmy family’sThanksgiving table.Myfamilymembersdebatedmysanityand tried to figureoutwhyI,atthirty-three,couldn’tseemtofindtherightgirl.Theconclusion to thisbook, then,practicallywrote itself. Iwouldmeetand
marrythegirl.Betterstill,IwoulduseBigDatatomeettherightgirl.PerhapsIcould weave in tidbits from the courting process throughout. Then the storywouldall come together in theconclusion,whichwoulddescribemyweddingdayanddoubleasalovelettertomynewwife.Unfortunately, lifedidn’tmatchmyvision.Lockingmyself inmyapartment
andavoidingtheworldwhilewritingabookprobablydidn’thelpmyromanticlife. And I, alas, still need to find a wife. More important, I needed a newconclusion.Iporedovermanyofmyfavoritebooksintryingtofindwhatmakesagreat
conclusion.Thebestconclusions,Iconcluded,bringtothesurfaceanimportantpoint that has been there all along, hovering just beneath the surface. For thisbook, thatbigpoint is this:socialscience isbecomingarealscience.Andthisnew,realscienceispoisedtoimproveourlives.In the beginning of Part II, I discussed Karl Popper’s critique of Sigmund
Freud.Popper,Inoted,didn’tthinkthatFreud’swackyvisionoftheworldwasscientific. But I didn’t mention something about Popper’s critique. It wasactuallyfarbroaderthanjustanattackonFreud.Popperdidn’tthinkanysocialscientist was particularly scientific. Popper was simply unimpressed with therigorofwhattheseso-calledscientistsweredoing.What motivated Popper’s crusade? When he interacted with the best
intellectuals of his day—the best physicists, the best historians, the bestpsychologists—Poppernoteda strikingdifference.When thephysicists talked,Popperbelievedinwhattheyweredoing.Sure,theysometimesmademistakes.Sure, they sometimeswere fooledby their subconsciousbiases.Butphysicistswereengagedinaprocessthatwasclearlyfindingdeeptruthsabouttheworld,culminating inEinstein’sTheoryofRelativity.When theworld’smost famoussocialscientiststalked,incontrast,Popperthoughthewaslisteningtoabunchofgobbledygook.Popper is hardly the only person to have made this distinction. Just about
everybody agrees that physicists, biologists, and chemists are real scientists.They utilize rigorous experiments to find how the physical world works. Incontrast,manypeoplethinkthateconomists,sociologists,andpsychologistsaresoftscientistswhothrowaroundmeaninglessjargonsotheycangettenure.Totheextentthiswasevertrue,theBigDatarevolutionhaschangedthat.If
KarlPopperwerealive todayandattendedapresentationbyRajChetty, JesseShapiro,EstherDuflo,or (humorme)myself, I strongly suspecthewouldnothavethesamereactionhehadbackthen.Tobehonest,hemightbemorelikelyto question whether today’s great string theorists are truly scientific or justengaginginself-indulgentmentalgymnastics.Ifaviolentmoviecomestoacity,doescrimegoupordown?Ifmorepeople
areexposedtoanad,domorepeopleusetheproduct?Ifabaseball teamwinswhenaboyistwenty,willhebemorelikelytorootforthemwhenhe’sforty?Theseareallclearquestionswithclearyes-or-noanswers.Andinthemountainsofhonestdata,wecanfindthem.Thisisthestuffofscience,notpseudoscience.This does notmean the social science revolution will come in the form of
simple,timelesslaws.Marvin Minsky, the late MIT scientist and one of the first to study the
possibilityof artificial intelligence, suggested thatpsychologygotoff trackbytryingtocopyphysics.Physicshadsuccessfindingsimplelawsthatheldinalltimesandallplaces.Humanbrains,Minskysuggested,maynotbesubjecttosuchlaws.Thebrain,
instead, is likely a complex system of hacks—one part correctingmistakes inotherparts.Theeconomyandpoliticalsystemmaybesimilarlycomplex.Forthisreason,thesocialsciencerevolutionisunlikelytocomeintheformof
neatformulas,suchasE=MC2.Infact,ifsomeoneisclaimingasocialsciencerevolutionbasedonaneatformula,youshouldbeskeptical.The revolution, instead, will come piecemeal, study by study, finding by
finding.Slowly,wewillgetabetterunderstandingofthecomplexsystemsofthehumanmindandsociety.
Aproperconclusionsumsup,butitalsopointsthewaytomorethingstocome.For this book, that’s easy. The datasets I have discussed herein are
revolutionary,buttheyhavebarelybeenexplored.Thereissomuchmoretobelearned.Frankly,theoverwhelmingmajorityofacademicshaveignoredthedataexplosion caused by the digital age.Theworld’smost famous sex researchersstickwiththetriedandtrue.Theyaskafewhundredsubjectsabouttheirdesires;they don’t ask sites like PornHub for their data. The world’s most famouslinguists analyze individual texts; they largely ignore the patterns revealed inbillionsofbooks.Themethodologiestaughttograduatestudentsinpsychology,politicalscience,andsociologyhavebeen,for themostpart,untouchedby thedigital revolution. The broad, mostly unexplored terrain opened by the dataexplosion has been left to a small number of forward-thinking professors,rebelliousgradstudents,andhobbyists.Thatwillchange.ForeveryideaIhavetalkedaboutinthisbook,thereareahundredideasjust
asimportantreadytobetackled.Theresearchdiscussedhereisthetipofthetipoftheiceberg,ascratchonthescratchofthesurface.Sowhatelseiscoming?Forone,aradicalexpansionofthemethodologythatwasusedinoneofthe
mostsuccessfulpublichealthstudiesofalltime.Inthemid-nineteenthcentury,John Snow, aBritish physician,was interested inwhatwas causing a cholera
outbreakinLondon.His ingenious idea: hemapped every cholera case in the city.When he did
this, he found the disease was largely clustered around one particular waterpump. This suggested the disease spread through germ-infested water,disprovingthethen-conventionalideathatitspreadthroughbadair.BigData—andthezoominginthatitallows—makesthistypeofstudyeasy.
Foranydisease,wecanexploreGooglesearchdataorotherdigitalhealthdata.Wecanfindifthereareanytinypocketsoftheworldwhereprevalenceofthisdiseaseisunusuallyhighorunusually low.Thenwecanseewhat theseplaceshaveincommon.Istheresomethingintheair?Thewater?Thesocialnorms?Wecandothisformigraines.Wecandothisforkidneystones.Wecandothis
for anxiety and depression and Alzheimer’s and pancreatic cancer and highbloodpressureandbackpainandconstipationandnosebleeds.Wecando thisfor everything.The analysis thatSnowdidonce,wemight be able to do fourhundredtimes(somethingasofthiswritingIamalreadystartingtoworkon).Wemightcallthis—takingasimplemethodandutilizingBigDatatoperform
an analysis several hundred times in a short period of time—science at scale.Yes, the social and behavioral sciences are most definitely going to scale.Zooming in on health conditionswill help these sciences scale.Another thingthatwillhelpthemscale:A/Btesting.WediscussedA/Btestinginthecontextofbusinesses getting users to click on headlines and ads—and this has been thepredominant use of themethodology.ButA/B testing can be used to uncoverthingsmorefundamental—andsociallyvaluable—thananarrowthatgetspeopletoclickonanad.BenjaminF.JonesisaneconomistatNorthwesternwhoistryingtouseA/B
testing tobetterhelpkids learn.Hehashelpedcreateaplatform,EDUSTAR,whichallowsforschoolstorandomlytestdifferentlessonplans.Many companies are in the education software business.With EDUSTAR,
studentslogintoacomputerandarerandomlyexposedtodifferentlessonplans.Thentheytakeshortteststoseehowwelltheylearnedthematerial.Schools,inotherwords,learnwhatsoftwareworksbestforhelpingstudentsgraspmaterial.Already, like all great A/B testing platforms, EDU STAR is yielding
surprisingresults.Onelessonplanthatmanyeducatorswereveryexcitedaboutincludedsoftwarethatutilizedgamestohelpteachstudentsfractions.Certainly,
ifyouturnedmathintoagame,studentswouldhavemorefun,learnmore,anddobetterontests.Right?Wrong.Studentswhoweretaughtfractionsviaagametestedworsethanthosewholearnedfractionsinamorestandardway.Gettingkids to learnmore is anexciting, and sociallybeneficial,useof the
testing thatSiliconValleypioneered to get people to clickonmore ads.So isgettingpeopletosleepmore.TheaverageAmericangets6.7hoursof sleepeverynight.MostAmericans
wanttosleepmore.But11P.M.rollsaround,andSportsCenterisonorYouTubeis calling. So shut-eye waits. Jawbone, a wearable-device company withhundredsof thousandsofcustomers,performs thousandsof tests to try to findinterventions that help get their users to do what they want to do: go to bedearlier.Jawbonescoredahugewinusingatwo-prongedgoal.First,askcustomersto
commit to a not-that-ambitious goal. Send them amessage like this: “It lookslikeyouhaven’tbeensleepingmuchinthelast3days.Whydon’tyouaimtogettobedby11:30tonight?Weknowyounormallygetupat8A.M.”Thentheuserswillhaveanoptiontoclickon“I’min.”Second,when10:30comes,Jawbonewillsendanothermessage:“Wedecided
you’daimtosleepat11:30.It’s10:30now.Whynotstartnow?”Jawbonefoundthisstrategyledtotwenty-threeminutesofextrasleep.They
didn’tgetcustomerstoactuallygettobedat10:30,buttheydidgetthemtobedearlier.Of course, every part of this strategy had to be optimized through lots of
experimentation.Starttheoriginalgoaltooearly—askuserstocommittogoingtobedby11P.M.—andfewwillplayalong.Askuserstogotobedbymidnightandlittlewillbegained.Jawbone used A/B testing to find the sleep equivalent of Google’s right-
pointingarrow.ButinsteadofgettingafewmoreclicksforGoogle’sadpartners,ityieldsafewmoreminutesofrestforexhaustedAmericans.Infact,thewholefieldofpsychologymightutilizethetoolsofSiliconValley
to dramatically improve their research. I’m eagerly anticipating the firstpsychologypaperthat,insteadofdetailingacoupleofexperimentsdonewithafewundergrads,showstheresultsofathousandrapidA/Btests.The days of academics devoting months to recruiting a small number of
undergradstoperformasingletestwillcometoanend.Instead,academicswillutilizedigitaldata to testa fewhundredora few thousand ideas in justa fewseconds.We’llbeabletolearnalotmoreinalotlesstime.Textasdata isgoing to teachusa lotmore.Howdo ideasspread?Howdo
new words form? How do words disappear? How do jokes form? Why arecertain words funny and others not? How do dialects develop? I bet, withintwentyyears,wewillhaveprofoundinsightsonallthesequestions.I think we might consider utilizing kids’ online behavior—appropriately
anonymized—asa supplement to traditional tests to seehow theyare learninganddeveloping.Howistheirspelling?Aretheyshowingsignsofdyslexia?Aretheydevelopingmature, intellectual interests?Dotheyhavefriends?Thereareclues to all these questions in the thousands of keystrokes every childmakeseveryday.Andthereisanother,not-trivialarea,whereplentymoreinsightsarecoming.Inthesong“Shattered,”bytheRollingStones,MickJaggerdescribesallthat
makes NewYork City, the Big Apple, somagical. Laughter. Joy. Loneliness.Rats.Bedbugs.Pride.Greed.Peopledressedinpaperbags.ButJaggerdevotesthemostwordsforwhatmakesthecitytrulyspecial:“sexandsexandsexandsex.”Aswith theBigApple, sowithBigData. Thanks to the digital revolution,
insightsarecominginhealth.Sleep.Learning.Psychology.Language.Plus,sexandsexandsexandsex.OnequestionIamcurrentlyexploring:howmanydimensionsofsexualityare
there?Weusually thinkofsomeoneasgayorstraight.Butsexuality isclearlymore complex than that. Among gay people and straight people, people havetypes—somemen like “blondes,”others “brunettes,” for instance.Might thesepreferencesbeas strongas thepreferences forgender?Anotherquestion I amlookinginto:wheredosexualpreferencescomefrom?Justaswecanfigureoutthekeyyearsthatdeterminebaseballfandomorpoliticalviews,wecannowfindthe key years that determine adult sexual preferences. To learn these answers,youwillhavetobuymynextbook,tentativelytitledEverybody(Still)Lies.The existence of porn—and the data that comeswith it—is a revolutionary
developmentinthescienceofhumansexuality.It took time for the natural sciences to begin changing our lives—to create
penicillin,satellites,andcomputers.ItmaytaketimebeforeBigDataleadsthesocialandbehavioralsciencestoimportantadvancesinthewaywelove,learn,and live.But I believe such advances are coming. I hope you see at least theoutlinesofsuchdevelopmentsfromthisbook.Ihope,infact,thatsomeofyoureadingthisbookhelpcreatesuchadvances.
Toproperlywriteaconclusion,anauthorshouldthinkaboutwhyhewrotethebookinthefirstplace.Whatgoalishetryingtoachieve?I think the largest reason Iwrote thisbook isasa resultofoneof themost
formativeexperiencesofmylife.Yousee,a littlemore thanadecadeago, thebookFreakonomicscameout.Thesurprisebestsellerdescribed theresearchofSteven Levitt, an award-winning economist at the University of Chicagomentionedfrequentlyinthisbook.Levittwasa“rogueeconomist”whoseemedtobeabletousedatatoansweranyquestionhisquirkymindcouldthinktoask:Dosumowrestlerscheat?Docontestantsongameshowsdiscriminate?Dorealestateagentsgetyouthesamedealstheygetforthemselves?Iwasjustoutofcollege,havingmajoredinphilosophy,withlittleideawhatI
wantedtodowithmylife.AfterreadingFreakonomics,Iknew.IwantedtodowhatStevenLevittdid.Iwantedtoporethroughmountainsofdatatofindouthowtheworldreallyworked.Iwouldfollowhim,Idecided,andgetaPh.D.ineconomics.Somuch has changed in the intervening twelve years.A couple of Levitt’s
studieswerefoundtohavecodingerrors.Levittsaidsomepoliticallyincorrectthingsaboutglobalwarming.Freakonomicshasgoneoutoffavorinintellectualcircles.ButIthink,afewmistakesaside,theyearshavebeenkindtothelargerpoint
Levittwastryingtomake.Levittwastellingusthatacombinationofcuriosity,creativity,anddatacoulddramaticallyimproveourunderstandingoftheworld.Therewere storieshidden indata thatwere ready tobe toldand thishasbeenprovenrightoverandoveragain.AndIhopethisbookmighthavethesameeffectonothersthatFreakonomics
hadonme.Ihopethereissomeyoungpersonreadingthisrightnowwhoisabitconfusedonwhat shewants todowithher life. Ifyouhaveabitof statisticalskill,anabundanceofcreativity,andcuriosity,enterthedataanalysisbusiness.
This book, in fact, and if I can be so bold, may be seen as next-levelFreakonomics. A major difference between the studies discussed inFreakonomics and those discussed in this book is the ambition. In the 1990s,whenLevittmadehisname,therewasn’tthatmuchdataavailable.Levittpridedhimselfongoingafterquirkyquestions,wheredatadidexist.Helargelyignoredbigquestionswhere thedatadidnotexist.Today,however,withsomuchdataavailable on just about every topic, it makes sense to go after big, profoundquestionsthatgettothecoreofwhatitmeanstobeahumanbeing.Thefutureofdataanalysisisbright.ThenextKinsey,Istronglysuspect,will
beadatascientist.ThenextFoucaultwillbeadatascientist.ThenextFreudwillbeadatascientist.ThenextMarxwillbeadatascientist.ThenextSalkmightverywellbeadatascientist.
Anyway, those were my attempts to do some of the things that a properconclusion does. But great conclusions, I came to realize, do a lot more. Somuch more. A great conclusion must be ironic. It must be moving. A greatconclusionmustbeprofoundandplayful.Itmustbedeep,humorous,andsad.Agreat conclusion must, in one sentence or two, make a point that sums upeverythingthathascomebefore,everythingthatiscoming.Itmustdosowithaunique, novel point—a twist. A great book must end on a smart, funny,provocativebang.Nowmightbeagoodtimetotalkabitaboutmywritingprocess.Iamnota
particularlyverbosewriter.Thisbookisonlyaboutseventy-fivethousandwords,whichisabitshortforatopicasrichasthisone.ButwhatIlackinbreadth,Imakeupinobsessiveness.Ispentfivemonthson,
andwroteforty-sevendraftsof,myfirstNewYorkTimessexcolumn,whichwastwo thousandwords.Somechapters in thisbook tooksixtydrafts. Icanspendhoursfindingtherightwordforasentenceinafootnote.Ilivedmuchofmypastyearasahermit.Justmeandmycomputer.Ilivedin
thehippestpartofNewYorkCityandwentoutapproximatelynever.Thisis,inmyopinion,mymagnumopus, thebest ideaIwillhaveinmylife.AndIwaswilling to sacrifice whatever it took to make it right. I wanted to be able todefend every word in this book. My phone is filled with emails I forgot torespondto,e-vitesIneveropened,BumblemessagesIignored.*
After thirteen months of hard work, I was finally able to send in a near-completedraft.Onepart,however,wasmissing:theconclusion.Iexplainedtomyeditor,Denise,thatitcouldtakeanotherfewmonths.Itold
hersixmonthswasmymostlikelyguess.Theconclusionis,inmyopinion,themostimportantpartofthebook.AndIwasonlybeginningtolearnwhatmakesagreatconclusion.Needlesstosay,Denisewasnotpleased.Then, one day, a friend of mine emailed me a study by Jordan Ellenberg.
Ellenberg, amathematician at theUniversity ofWisconsin,was curious abouthowmanypeopleactuallyfinishbooks.HethoughtofaningeniouswaytotestitusingBigData.Amazonreportshowmanypeoplequotevariouslinesinbooks.Ellenbergrealizedhecouldcomparehowfrequentlyquoteswerehighlightedatthebeginningofthebookversustheendofthebook.Thiswouldgivearoughguidetoreaders’propensitytomakeittotheend.Byhismeasure,morethan90percentofreadersfinishedDonnaTartt’snovelTheGoldfinch.Incontrast,onlyabout 7 percent made it through Nobel Prize economist Daniel Kahneman’smagnum opus, Thinking, Fast and Slow. Fewer than 3 percent, this roughmethodologyestimated,madeittotheendofeconomistThomasPiketty’smuchdiscussedandpraisedCapital in the21stCentury. Inotherwords,people tendnottofinishtreatisesbyeconomists.OneofthepointsofthisbookiswehavetofollowtheBigDatawherever it
leadsandactaccordingly.Imayhopethatmostreadersaregoingtohangonmyeverywordand try todetectpatterns linking the finalpages towhathappenedearlier.But,nomatterhowhardIworkonpolishingmyprose,mostpeoplearegoingtoreadthefirstfiftypages,getafewpoints,andmoveonwiththeirlives.Thus,Iconcludethisbookintheonlyappropriateway:byfollowingthedata,
whatpeopleactuallydo,notwhattheysay.Iamgoingtogetabeerwithsomefriends and stopworking on this damn conclusion. Too few of you,BigDatatellsme,arestillreading.
ACKNOWLEDGMENTS
Thisbookwasateameffort.TheseideasweredevelopedwhileIwasastudentatHarvard,adatascientist
atGoogle,andawriterfortheNewYorkTimes.HalVarian,withwhomIworkedatGoogle,hasbeenamajorinfluenceonthe
ideasofthisbook.AsbestIcantell,Halisperpetuallytwentyyearsaheadofhistime.HisbookInformationRules,writtenwithCarlShapiro,basicallypredictedthe future. And his paper “Predicting the Present,” with Hyunyoung Choi,largelystartedtheBigDatarevolutioninthesocialsciencesthatisdescribedinthisbook.Heisalsoanamazingandkindmentor,assomanywhohaveworkedunderhimcanattest.AclassicHalmoveis todomostof theworkonapaperyou are coauthoringwith him and then insist that your name goes before his.Hal’s combination of genius and generosity is something I have rarelyencountered.MywritingandideasdevelopedunderAaronRetica,whohasbeenmyeditor
for every singleNew York Times column. Aaron is a polymath. He somehowknows everything about music, history, sports, politics, sociology, economics,andGodonlyknowswhatelse.HeisresponsibleforahugeamountofwhatisgoodabouttheTimescolumnsthathavemynameonthem.OtherplayersontheteamforthesecolumnsincludeBillMarsh,whosegraphicscontinuetoblowmeaway,KevinMcCarthy,andGitaDaneshjoo.Thisbookincludespassagesfromthesecolumns,reprintedwithpermission.StevenPinker,whokindlyagreedtowritetheforeword,haslongbeenahero
ofmine.Hehasset thebarforamodernbookonsocialscience—anengagingexploration of the fundamentals of human nature, making sense of the bestresearchfromarangeofdisciplines.ThatbarisoneIwillbestrugglingtoreachmyentirelife.My dissertation, from which this book has grown, was written under my
brilliant and patient advisers Alberto Alesina, David Cutler, Ed Glaeser, andLawrenceKatz.Denise Oswald is an amazing editor. If you want to know how good her
editingis,comparethisfinaldrafttomyfirstdraft—actually,youcan’tdothatbecauseIamnotgoingtoevershowanyoneelsethatembarrassingfirstdraft.IalsothanktherestoftheteamatHarperCollins,includingMichaelBarrs,LynnGrady,LaurenJaniec,ShelbyMeizlik,andAmberOliver.EricLupfer,myagent,sawpotential in thisproject fromthebeginning,was
instrumentalinformingtheproposal,andhelpedcarryitthrough.Forsuperbfact-checking,IthankMelvisAcosta.OtherpeoplefromwhomIlearnedalotinmyprofessionalandacademiclife
includeSusanAthey,ShlomoBenartzi,JasonBordoff,DanielleBowers,DavidBroockman, Bo Cowgill, Steven Delpome, John Donohue, Bill Gale, ClaudiaGoldin,SuzanneGreenberg,ShaneGreenstein,SteveGrove,MikeHoyt,DavidLaibson,A.J.Magnuson,DanaMaloney, JeffreyOldham,PeterOrszag,DavidReiley, Jonathan Rosenberg, Michael Schwarz, Steve Scott, Rich Shavelson,MichaelD.Smith,LawrenceSummers,JonVaver,MichaelWiggins,andQingWu.IthankTimRequarthandNeuWriteforhelpingmedevelopmywriting.Forhelpininterpretingstudies,IthankChristopherChabris,RajChetty,Matt
Gentzkow,SolomonMessing,andJesseShapiro.I asked Emma Pierson and Katia Sobolski if they might give advice on a
chapter inmybook.Theydecided, for reasons Idonotunderstand, tooffer toreadtheentirebook—andgivewisecounseloneveryparagraph.Mymother, Esther Davidowitz, read the entire book onmultiple occasions
and helped dramatically improve it. She also taught me, by example, that Ishouldfollowmycuriosity,nomatterwhereitled.WhenIwasinterviewingforanacademic job,aprofessorgrilledme:“Whatdoesyourmother thinkof thiswork you do?” The idea was thatmymommight be embarrassed that I wasresearchingsexandothertabootopics.ButIalwaysknewshewasproudofmeforfollowingmycuriosity,whereveritled.Many people read sections and offered helpful comments. I thank Eduardo
Acevedo, Coren Apicella, Sam Asher, David Cutler, Stephen Dubner,ChristopherGlazek,JessicaGoldberg,LaurenGoldman,AmandaGordon,Jacob
Leshno,AlexPeysakhovich,NoahPopp,RamonRoullard,GregSobolski,EvanSoltas, Noah Stephens-Davidowitz, Lauren Stephens-Davidowitz, and JeanYang.Actually,JeanwasbasicallymybestfriendwhileIwrotethis,soIthankherforthat,too.Forhelpincollectingdata,IthankBrettGoldenberg,JamesRogers,andMike
Williams at MindGeek and Rob McQuown and Sam Miller at BaseballProspectus.IamgratefulforfinancialsupportfromtheAlfredSloanFoundation.Atonepoint,whilewriting thisbook, Iwasdeeply stuck, lost, andclose to
abandoning the project. I then went to the country with my dad, MitchellStephens.Overthecourseofaweek,Dadputmebacktogether.Hetookmeforwalksinwhichwediscussedlove,death,success,happiness,andwriting—andthensatmedownsowecouldgoovereverysentenceofthebook.Icouldnothavefinishedthisbookwithouthim.Allremainingerrorsare,ofcourse,myown.
NOTES
Thepaginationofthiselectroniceditiondoesnotmatchtheeditionfromwhichitwascreated.Tolocateaspecificentry,pleaseuseyoure-bookreader’ssearchtools.
INTRODUCTION
2AmericanvoterslargelydidnotcarethatBarackObama:KatieFretland,“Gallup:RaceNotImportanttoVoters,”TheSwamp,ChicagoTribune,June2008.
2Berkeleyporedthrough:AlexandreMasandEnricoMoretti,“RacialBiasinthe2008PresidentialElection,”AmericanEconomicReview99,no.2(2009).
2post-racialsociety:OntheNovember12,2009,episodeofhisshow,LouDobbssaidwelivedina“post-partisan,post-racialsociety.”OntheJanuary27,2010,episodeofhisshow,ChrisMatthewssaidthatPresidentObamawas“post-racialbyallappearances.”Forotherexamples,seeMichaelC.DawsonandLawrenceD.Bobo,“OneYearLaterandtheMythofaPost-RacialSociety,”DuBoisReview:SocialScienceResearchonRace6,no.2(2009).
5IanalyzeddatafromtheGeneralSocialSurvey:Detailsonallthesecalculationscanbefoundonmywebsite,sethsd.com,inthecsvlabeled“SexData.”DatafromtheGeneralSocialSurveycanbefoundathttp://gss.norc.org/.
5fewerthan600millioncondoms:Dataprovidedtotheauthor.7searchesandsign-upsforStormfront:Author’sanalysisofGoogleTrends
data.IalsoscrapeddataonallmembersofStormfront,asdiscussedinSeth
Stephens-Davidowitz,“TheDataofHate,”NewYorkTimes,July13,2014,SR4.Therelevantdatacanbedownloadedatsethsd.com,inthedatasectionheadlined“Stormfront.”
7moresearchesfor“niggerpresident”than“firstblackpresident”:Author’sanalysisofGoogleTrendsdata.ThestatesforwhichthisistrueincludeKentucky,Louisiana,Arizona,andNorthCarolina.
9rejectedbyfiveacademicjournals:ThepaperwaseventuallypublishedasSethStephens-Davidowitz,“TheCostofRacialAnimusonaBlackCandidate:EvidenceUsingGoogleSearchData,”JournalofPublicEconomics118(2014).Moredetailsabouttheresearchcanbefoundthere.Inaddition,thedatacanbefoundatmywebsite,sethsd.com,inthedatasectionheadlined“Racism.”
13singlefactorthatbestcorrelated:“StrongestcorrelateI’vefoundforTrumpsupportisGooglesearchesforthen-word.Othershavereportedthistoo”(February28,2016,tweet).SeealsoNateCohn,“DonaldTrump’sStrongestSupporters:ANewKindofDemocrat,”NewYorkTimes,December31,2015,A3.
13ThisshowsthepercentofGooglesearchesthatincludetheword“nigger(s).”Notethat,becausethemeasureisasapercentofGooglesearches,itisnotarbitrarilyhigherinplaceswithlargepopulationsorplacesthatmakealotofsearches.NotealsothatsomeofthedifferencesinthismapandthemapforTrumpsupporthaveobviousexplanations.TrumplostpopularityinTexasandArkansasbecausetheywerethehomestatesoftwoofhisopponents,TedCruzandMikeHuckabee.
13ThisissurveydatafromCivisAnalyticsfromDecember2015.Actualvotingdataislessusefulhere,sinceitishighlyinfluencedbywhentheprimarytookplaceandthevotingformat.ThemapsarereprintedwithpermissionfromtheNewYorkTimes.
152.5milliontrillionbytesofdata:“BringingBigDatatotheEnterprise,”IBM,https://www-01.ibm.com/software/data/bigdata/what-is-big-data.html.
17needlecomesinanincreasinglylargerhaystack:NassimM.Taleb,“BewaretheBigErrorsof‘BigData,’”Wired,February8,2013,http://www.wired.com/2013/02/big-data-means-big-errors-people.
18neitherracistsearchesnormembershipinStormfront:Iexaminedhow
internetracismchangedinpartsofthecountrywithhighandlowexposuretotheGreatRecession.IlookedatbothGooglesearchratesfor“nigger(s)”andStormfrontmembership.Therelevantdatacanbedownloadedatsethsd.com,inthedatasectionsheadlined“RacialAnimus”and“Stormfront.”
18ButGooglesearchesreflectinganxiety:SethStephens-Davidowitz,“FiftyStatesofAnxiety,”NewYorkTimes,August7,2016,SR2.Note,whiletheGooglesearchesdogivemuchbiggersamples,thispatternisconsistentwithevidencefromsurveys.See,forexample,WilliamC.Reevesetal.,“MentalIllnessSurveillanceAmongAdultsintheUnitedStates,”MorbidityandMortalityWeeklyReportSupplement60,no.3(2011).
18searchforjokes:ThisisdiscussedinSethStephens-Davidowitz,“WhyAreYouLaughing?”NewYorkTimes,May15,2016,SR9.Therelevantdatacanbedownloadedatsethsd.com,inthedatasectionheadlined“Jokes.”
19“myhusbandwantsmetobreastfeedhim”:ThisisdiscussedinSethStephens-Davidowitz,“WhatDoPregnantWomenWant?”NewYorkTimes,May17,2014,SR6.
19pornsearchesfordepictionsofwomenbreastfeedingmen:Author’sanalysisofPornHubdata.
19Womenmakenearlyasmany:ThisisdiscussedinSethStephens-Davidowitz,“SearchingforSex,”NewYorkTimes,January25,2015,SR1.
20“poemasparamiesposaembarazada”:Stephens-Davidowitz,“WhatDoPregnantWomenWant?”
21Friedmansays:IinterviewedJerryFriedmanbyphoneonOctober27,2015.
21samplingofalltheirdata:HalR.Varian,“BigData:NewTricksforEconometrics,”JournalofEconomicPerspectives28,no.2(2014).
CHAPTER1:YOURFAULTYGUT
26Thebestdatascience,infact,issurprisinglyintuitive:IamspeakingaboutthecornerofdataanalysisIknowabout—datasciencethattriestoexplainandpredicthumanbehavior.Iamnotspeakingofartificialintelligencethat
triesto,say,driveacar.Thesemethodologies,whiletheydoutilizetoolsdiscoveredfromthehumanbrain,arelesseasytounderstand.
28whatsymptomspredictpancreaticcancer:JohnPaparrizos,RyanW.White,andEricHorvitz,“ScreeningforPancreaticAdenocarcinomaUsingSignalsfromWebSearchLogs:FeasibilityStudyandResults,”JournalofOncologyPractice(2016).
31Winterclimateswampedalltherest:ThisresearchisdiscussedinSethStephens-Davidowitz,“Dr.GoogleWillSeeYouNow,”NewYorkTimes,August11,2013,SR12.
32biggestdataseteverassembledonhumanrelationships:LarsBackstromandJonKleinberg,“RomanticPartnershipsandtheDispersionofSocialTies:ANetworkAnalysisofRelationshipStatusonFacebook,”inProceedingsofthe17thACMConferenceonComputerSupportedCooperativeWork&SocialComputing(2014).
33peopleconsistentlyrank:DanielKahneman,Thinking,FastandSlow(NewYork:Farrar,StrausandGiroux,2011).
33asthmacausesaboutseventytimesmoredeaths:Between1979and2010,onaverage,55.81Americansdiedfromtornadosand4216.53Americansdiedfromasthma.SeeAnnualU.S.KillerTornadoStatistics,NationalWeatherService,http://www.spc.noaa.gov/climo/torn/fatalmap.phpandTrendsinAsthmaMorbidityandMortality,AmericanLungAssociation,EpidemiologyandStatisticsUnit.
33PatrickEwing:MyfavoriteEwingvideosare“PatrickEwing’sTop10CareerPlays,”YouTubevideo,postedSeptember18,2015,https://www.youtube.com/watch?v=Y29gMuYymv8;and“PatrickEwingKnicksTribute,”YouTubevideo,postedMay12,2006,https://www.youtube.com/watch?v=8T2l5Emzu-I.
34“basketballasamatteroflifeordeath”:S.L.Price,“WhateverHappenedtotheWhiteAthlete?”SportsIllustrated,December8,1997.
34aninternetsurvey:ThiswasaGoogleConsumerSurveyIconductedonOctober22,2013.Iasked,“WherewouldyouguessthatthemajorityofNBAplayerswereborn?”Thetwochoiceswere“poorneighborhoods”and“middle-classneighborhoods”;59.7percentofrespondentspicked“poorneighborhoods.”
36ablackperson’sfirstnameisanindicationofhissocioeconomicbackground:RolandG.FryerJr.andStevenD.Levitt,“TheCausesandConsequencesofDistinctivelyBlackNames,”QuarterlyJournalofEconomics119,no.3(2004).
37AmongallAfrican-Americansborninthe1980s:CentersforDiseaseControlandPrevention,“Health,UnitedStates,2009,”Table9,NonmaritalChildbearing,byDetailedRaceandHispanicOriginofMother,andMaternalAge:UnitedStates,SelectedYears1970–2006.
37ChrisBosh...ChrisPaul:“NotJustaTypicalJock:MiamiHeatForwardChrisBosh’sInterestsGoWellBeyondBasketball,”PalmBeachPost.com,February15,2011,http://www.palmbeachpost.com/news/sports/basketball/not-just-a-typical-jock-miami-heat-forward-chris-b/nLp7Z/;DaveWalker,“ChrisPaul’sFamilytoCompeteon‘FamilyFeud,’nola.com,October31,2011,http://www.nola.com/tv/index.ssf/2011/10/chris_pauls_family_to_compete.html.
38fourinchestaller:“WhyAreWeGettingTallerasaSpecies?”ScientificAmerican,http://www.scientificamerican.com/article/why-are-we-getting-taller/.Interestingly,Americanshavestoppedgettingtaller.AmandaOnion,“WhyHaveAmericansStoppedGrowingTaller?”ABCNews,July3,2016,http://abcnews.go.com/Technology/story?id=98438&page=1.Ihavearguedthatoneofthereasonstherehasbeenahugeincreaseinforeign-bornNBAplayersisthatothercountriesarecatchinguptotheUnitedStatesinheight.ThenumberofAmerican-bornseven-footersintheNBAincreasedsixteenfoldfrom1946to1980asAmericansgrew.Ithassinceleveledoff,asAmericanshavestoppedgrowing.Meanwhile,thenumberofseven-footersfromothercountrieshasrisensubstantially.Thebiggestincreaseininternationalplayers,Ifound,hasbeenextremelytallmenfromcountries,suchasTurkey,Spain,andGreece,wheretherehavebeennoticeableincreasesinchildhoodhealthandadultheightinrecentyears.
38Americansfrompoorbackgrounds:CarmenR.Isasietal.,“AssociationofChildhoodEconomicHardshipwithAdultHeightandAdultAdiposityamongHispanics/Latinos:TheHCHS/SOLSocio-CulturalAncillaryStudy,”PloSOne11,no.2(2016);JaneE.MillerandSandersKorenman,“PovertyandChildren’sNutritionalStatusintheUnitedStates,”American
JournalofEpidemiology140,no.3(1994);HarryJ.Holzer,DianeWhitmoreSchanzenbach,GregJ.Duncan,andJensLudwig,“TheEconomicCostsofChildhoodPovertyintheUnitedStates,”JournalofChildrenandPoverty14,no.1(2008).
38theaverageAmericanmanis5’9”:CherylD.Fryar,QiupingGu,andCynthiaL.Ogden,“AnthropometricReferenceDataforChildrenandAdults:UnitedStates,2007–2010,”VitalandHealthStatisticsSeries11,no.252(2012).
39somethinglikeoneinfivereachtheNBA:PabloS.Torre,“LargerThanRealLife,”SportsIllustrated,July4,2011.
39middle-class,two-parentfamilies:TimKautz,JamesJ.Heckman,RonDiris,BasTerWeel,andLexBorghans,“FosteringandMeasuringSkills:ImprovingCognitiveandNon-CognitiveSkillstoPromoteLifetimeSuccess,”NationalBureauofEconomicResearchWorkingPaper20749,2014.
39Wrennjumpedthehighest:DesmondConner,“ForWrenn,Sky’stheLimit,”HartfordCourant,October21,1999.
39ButWrenn:DougWrenn’sstoryistoldinPercyAllen,“FormerWashingtonandO’DeaStarDougWrennFindsToughTimes,”SeattleTimes,March29,2009.
40“DougWrennisdead”:Ibid.40Jordancouldbeadifficultkid:MelissaIsaacson,“PortraitofaLegend,”
ESPN.com,September9,2009,http://www.espn.com/chicago/columns/story?id=4457017&columnist=isaacson_melissa.AgoodJordanbiographyisRolandLazenby,MichaelJordan:TheLife(Boston:BackBayBooks,2015).
40Hisfatherwas:BarryJacobs,“High-FlyingMichaelJordanHasNorthCarolinaCruisingTowardAnotherNCAATitle,”People,March19,1984.
40Jordan’slifeisfilledwithstoriesofhisfamilyguidinghim:Isaacson,“PortraitofaLegend.”
41speechuponinductionintotheBasketballHallofFame:MichaelJordan’sBasketballHallofFameEnshrinementSpeech,YouTubevideo,postedFebruary21,2012,https://www.youtube.com/watch?v=XLzBMGXfK4c.
ThemostinterestingaspectofJordan’sspeechisnotthatheissoeffusiveabouthisparents;itisthathestillfeelstheneedtopointoutslightsfromearlyinhiscareer.Perhapsalifelongobsessionwithslightsisnecessarytobecomethegreatestbasketballplayerofalltime.
41LeBronJameswasinterviewed:“I’mLeBronJamesfromAkron,Ohio,”YouTubevideo,postedJune20,2013,https://www.youtube.com/watch?v=XceMbPVAggk.
CHAPTER2:WASFREUDRIGHT?
47afood’sbeingshapedlikeaphallus:Icodedfoodsasbeingshapedasaphallusiftheyweresignificantlymorelongthanwideandgenerallyround.Icountedcucumbers,corn,carrots,eggplant,squash,andbananas.Thedataandcodecanbefoundatsethsd.com.
48errorscollectedbyMicrosoftresearchers:Thedatasetcanbedownloadedathttps://www.microsoft.com/en-us/download/details.aspx?id=52418.TheresearchersaskedusersofAmazonMechanicalTurktodescribeimages.Theyanalyzedthekeystrokelogsandnotedanytimesomeonecorrectedaword.MoredetailscanbefoundinYukinoBabaandHisamiSuzuki,“HowAreSpellingErrorsGeneratedandCorrected?AStudyofCorrectedandUncorrectedSpellingErrorsUsingKeystrokeLogs,”ProceedingsoftheFiftiethAnnualMeetingoftheAssociationforComputationalLinguistics,2012.Thedata,code,andafurtherdescriptionofthisresearchcanbefoundatsethsd.com.
51Considerallsearchesoftheform“Iwanttohavesexwithmy”:Thefulldata—warning:graphic—isasfollows:
“IWANTTOHAVESEXWITH...”
MONTHLY GOOGLE SEARCHESWITHTHISEXACTPHRASE
mymom 720
myson 590
mysister 590
mycousin 480
mydad 480
myboyfriend 480
mybrother 320
mydaughter 260
myfriend 170
mygirlfriend 140
52cartoonporn:Forexample,“porn”isoneofthemostcommonwordsincludedinGooglesearchesforvariousextremelypopularanimatedprograms,asseenbelow.
52babysitters:Basedonauthor’scalculations,thesearethemostpopularfemaleoccupationsinpornsearchesbymen,brokendownbytheageofmen:
CHAPTER3:DATAREIMAGINED
56algorithmsinplace:MatthewLeising,“HFTTreasuryTradingHurtsMarketWhenNewsIsReleased,”BloombergMarkets,December16,2014;NathanielPopper,“TheRobotsAreComingforWallStreet,”NewYorkTimesMagazine,February28,2016,MM56;RichardFinger,“HighFrequencyTrading:IsItaDarkForceAgainstOrdinaryHumanTradersandInvestors?”Forbes,September30,2013,http://www.forbes.com/sites/richardfinger/2013/09/30/high-frequency-trading-is-it-a-dark-force-against-ordinary-human-traders-and-investors/#50875fc751a6.
56AlanKrueger:IinterviewedAlanKruegerbyphoneonMay8,2015.57importantindicatorsofhowfasttheflu:TheinitialpaperwasJeremy
Ginsberg,MatthewH.Mohebbi,RajanS.Patel,LynnetteBrammer,MarkS.Smolinski,andLarryBrilliant,“DetectingInfluenzaEpidemicsUsingSearchEngineQueryData,”Nature457,no.7232(2009).TheflawsintheoriginalmodelwerediscussedinDavidLazer,RyanKennedy,GaryKing,andAlessandroVespignani,“TheParableofGoogleFlu:TrapsinBigDataAnalysis,”Science343,no.6176(2014).AcorrectedmodelispresentedinShihaoYang,MauricioSantillana,andS.C.Kou,“AccurateEstimationofInfluenzaEpidemicsUsingGoogleSearchDataViaARGO,”Proceedings
oftheNationalAcademyofSciences112,no.47(2015).58whichsearchesmostcloselytrackhousingprices:SethStephens-
DavidowitzandHalVarian,“AHands-onGuidetoGoogleData,”mimeo,2015.AlsoseeMarcelleChauvet,StuartGabriel,andChandlerLutz,“MortgageDefaultRisk:NewEvidencefromInternetSearchQueries,”JournalofUrbanEconomics96(2016).
60BillClinton:SergeyBrinandLarryPage,“TheAnatomyofaLarge-ScaleHypertextualWebSearchEngine,”SeventhInternationalWorld-WideWebConference,April14–18,1998,Brisbane,Australia.
61pornsites:JohnBattelle,TheSearch:HowGoogleandItsRivalsRewrotetheRulesofBusinessandTransformedOurCulture(NewYork:Penguin,2005).
61crowdsourcetheopinions:AgooddiscussionofthiscanbefoundinStevenLevy,InthePlex:HowGoogleThinks,Works,andShapesOurLives(NewYork:Simon&Schuster,2011).
64“Sellyourhouse”:ThisquotewasalsoincludedinJoeDrape,“AhmedZayat’sJourney:BankruptcyandBigBets,”NewYorkTimes,June5,2015,A1.However,thearticleincorrectlyattributesthequotetoSeder.Itwasactuallymadebyanothermemberofhisteam.
65IfirstmetupwithSeder:IinterviewedJeffSederandPattyMurrayinOcala,Florida,fromJune12,2015,throughJune14,2015.
66Roughlyone-third:ThereasonsracehorsesfailareroughestimatesbyJeffSeder,basedonhisyearsinthebusiness.
66hundredsofhorsesdie:SupplementalTablesofEquineInjuryDatabaseStatisticsforThoroughbreds,http://jockeyclub.com/pdfs/eid_7_year_tables.pdf.
66mostlyduetobrokenlegs:“PostmortemExaminationProgram,”CaliforniaAnimalHealthandFoodLaboratorySystem,2013.
67Still,morethanthree-fourthsdonotwinamajorrace:AvalynHunter,“ACaseforFullSiblings,”Bloodhorse,April18,2014,http://www.bloodhorse.com/horse-racing/articles/115014/a-case-for-full-siblings.
67EarvinJohnsonIII:MelodyChiu,“E.J.JohnsonLoses50Lbs.SinceUndergoingGastricSleeveSurgery,”People,October1,2014.
67LeBronJames,whosemomis5’5”:EliSaslow,“LostStoriesofLeBron,Part1,”ESPN.com,October17,2013,http://www.espn.com/nba/story/_/id/9825052/how-lebron-james-life-changed-fourth-grade-espn-magazine.
68TheGreenMonkey:SeeSherryRoss,“16MillionDollarBaby,”NewYorkDailyNews,March12,2006,andJayPrivman,“TheGreenMonkey,WhoSoldfor$16M,Retired,”ESPN.com,February12,2008,http://www.espn.com/sports/horse/news/story?id=3242341.Avideooftheauctionisavailableat“$16MillionHorse,”YouTubevideo,postedNovember1,2008,https://www.youtube.com/watch?v=EyggMC85Zsg.
71weaknessofGoogle’sattempttopredictinfluenza:SharadGoel,JakeM.Hofman,SébastienLahaie,DavidM.Pennock,andDuncanJ.Watts,“PredictingConsumerBehaviorwithWebSearch,”ProceedingsoftheNationalAcademyofSciences107,no.41(2010).
72StrawberryPop-Tarts:ConstanceL.Hays,“WhatWal-MartKnowsAboutCustomers’Habits,”NewYorkTimes,November14,2004.
74“Itworkedoutgreat”:IinterviewedOrleyAshenfelterbyphoneonOctober27,2016.
80studiedhundredsofheterosexualspeeddaters:DanielA.McFarland,DanJurafsky,andCraigRawlings,“MakingtheConnection:SocialBondinginCourtshipSituations,”AmericanJournalofSociology118,no.6(2013).
82LeonardCohenoncegavehisnephewthefollowingadviceforwooingwomen:JonathanGreenberg,“WhatILearnedFromMyWiseUncleLeonardCohen,”HuffingtonPost,November11,2016.
83thewordsusedinhundredsofthousandsofFacebook:H.AndrewSchwartzetal.,“Personality,Gender,andAgeintheLanguageofSocialMedia:TheOpen-VocabularyApproach,”PloSOne8,no.9(2013).Thepaperalsobreaksdownthewayspeoplespeakbasedonhowtheyscoreonpersonalitytests.Hereiswhattheyfound:
88textofthousandsofbooksandmoviescripts:AndrewJ.Reagan,LewisMitchell,DilanKiley,ChristopherM.Danforth,andPeterSheridanDodds,“TheEmotionalArcsofStoriesAreDominatedbySixBasicShapes,”EPJDataScience5,no.1(2016).
91whattypesofstoriesgetshared:JonahBergerandKatherineL.Milkman,“WhatMakesOnlineContentViral?”JournalofMarketingResearch49,no.2(2012).
95whydosomepublicationsleanleft:ThisresearchisallfleshedoutinMatthewGentzkowandJesseM.Shapiro,“WhatDrivesMediaSlant?
EvidencefromU.S.DailyNewspapers,”Econometrica78,no.1(2010).AlthoughtheyweremerelyPh.D.studentswhenthisprojectstarted,GentzkowandShapiroarenowstareconomists.Gentzkow,nowaprofessoratStanford,wonthe2014JohnBatesClarkMedal,giventothetopeconomistundertheageofforty.Shapiro,nowaprofessoratBrown,isaneditoroftheprestigiousJournalofPoliticalEconomy.Theirjointpaperonmediaslantisamongthemostcitedpapersforeach.
96RupertMurdoch:Murdoch’sownershipoftheconservativeNewYorkPostcouldbeexplainedbythefactthatNewYorkissobig,itcansupportnewspapersofmultipleviewpoints.However,itseemsprettyclearthePostconsistentlylosesmoney.See,forexample,JoePompeo,“HowMuchDoesthe‘NewYorkPost’ActuallyLose?”Politico,August30,2013,http://www.politico.com/media/story/2013/08/how-much-does-the-new-york-post-actually-lose-001176.
97Shapirotoldme:IinterviewedMattGentzkowandJesseShapiroonAugust16,2015,attheRoyalSonestaBoston.
98scannedyearbooksfromAmericanhighschools:KateRakelly,SarahSachs,BrianYin,andAlexeiA.Efros,“ACenturyofPortraits:AVisualHistoricalRecordofAmericanHighSchoolYearbooks,”paperpresentedatInternationalConferenceonComputerVision,2015.Thephotosarereprintedwithpermissionfromtheauthors.
99subjectsinphotoscopiedsubjectsinpaintings:See,forexample,ChristinaKotchemidova,“WhyWeSay‘Cheese’:ProducingtheSmileinSnapshotPhotography,”CriticalStudiesinMediaCommunication22,no.1(2005).
100measureGDPbasedonhowmuchlightthereisinthesecountriesatnight:J.VernonHenderson,AdamStoreygard,andDavidN.Weil,“MeasuringEconomicGrowthfromOuterSpace,”AmericanEconomicReview102,no.2(2012).
101estimatedGDPwasnow90percenthigher:KathleenCaulderwood,“NigerianGDPJumps89%asEconomistsAddinTelecoms,Nollywood,”IBTimes,April7,2014,http://www.ibtimes.com/nigerian-gdp-jumps-89-economists-add-telecoms-nollywood-1568219.
101Reisingersaid:IinterviewedJoeReisingerbyphoneonJune10,2015.103$50million:LeenaRao,“SpaceXandTeslaBackerJustInvested$50
MillioninThisStartup,”Fortune,September24,2015.
CHAPTER4:DIGITALTRUTHSERUM
106importantpaperin1950:HughJ.ParryandHelenM.Crossley,“ValidityofResponsestoSurveyQuestions,”PublicOpinionQuarterly14,1(1950).
106surveyaskedUniversityofMarylandgraduates:FraukeKreuter,StanleyPresser,andRogerTourangeau,“SocialDesirabilityBiasinCATI,IVR,andWebSurveys,”PublicOpinionQuarterly72(5),2008.
107failureofthepolls:ForanarticlearguingthatlyingmightbeaproblemintryingtopredictsupportforTrump,seeThomasB.Edsall,“HowManyPeopleSupportTrumpbutDon’tWanttoAdmitIt?”NewYorkTimes,May15,2016,SR2.Butforanargumentthatthiswasnotalargefactor,seeAndrewGelman,“ExplanationsforThatShocking2%Shift,”StatisticalModeling,CausalInference,andSocialScience,November9,2016,http://andrewgelman.com/2016/11/09/explanations-shocking-2-shift/.
107saysTourangeau:IinterviewedRogerTourangeaubyphoneonMay5,2015.
107somanypeoplesaytheyareaboveaverage:ThisisdiscussedinAdamGrant,Originals:HowNon-ConformistsMovetheWorld(NewYork:Viking,2016).TheoriginalsourceisDavidDunning,ChipHeath,andJerryM.Suls,“FlawedSelf-Assessment:ImplicationsforHealth,Education,andtheWorkplace,”PsychologicalScienceinthePublicInterest5(2004).
108messwithsurveys:AnyaKamenetz,“‘MischievousResponders’ConfoundResearchonTeens,”nprED,May22,2014,http://www.npr.org/sections/ed/2014/05/22/313166161/mischievous-responders-confound-research-on-teens.TheoriginalresearchthisarticlediscussesisJosephP.Robinson-Cimpian,“InaccurateEstimationofDisparitiesDuetoMischievousResponders,”EducationalResearcher43,no.4(2014).
110searchfor“porn”morethantheysearchfor“weather”:https://www.google.com/trends/explore?date=all&geo=US&q=porn,weather.
110admittheywatchpornography:AmandaHess,“HowManyWomenAreNotAdmittingtoPewThatTheyWatchPorn?”Slate,October11,2013,http://www.slate.com/blogs/xx_factor/2013/10/11/pew_online_viewing_study_percentage_of_women_who_watch_online_porn_is_growing.html.
110“cock,”“fuck,”and“porn”:NicholasDiakopoulus,“Sex,Violence,andAutocompleteAlgorithms,”Slate,August2,2013,http://www.slate.com/articles/technology/future_tense/2013/08/words_banned_from_bing_and_google_s_autocomplete_algorithms.html.
1113.6timesmorelikelytotellGoogletheyregret:Iestimate,includingvariousphrasings,thereareabout1,730AmericanGooglesearcheseverymonthexplicitlysayingtheyregrethavingchildren.Thereareonlyabout50expressingaregretnothavingchildren.Thereareabout15.9millionAmericansovertheageofforty-fivewhohavenochildren.Thereareabout152millionAmericanswhohavechildren.Thismeans,amongtheeligiblepopulation,peoplewithchildrenareabout3.6timesaslikelytoexpressaregretonGooglethanpeoplewithoutchildren.Obviously,asmentionedinthetextbutworthemphasizingagain,theseconfessionalstoGoogleareonlymadebyasmall,selectnumberofpeople—presumablythosefeelingastrongenoughregretthattheymomentarilyforgetthatGooglecannothelpthemhere.
113highestsupportforgaymarriage:TheseestimatesarefromNateSilver,“HowOpiniononSame-SexMarriageIsChanging,andWhatItMeans,”FiveThirtyEight,March26,2013,http://fivethirtyeight.blogs.nytimes.com/2013/03/26/how-opinion-on-same-sex-marriage-is-changing-and-what-it-means/?_r=0.
113About2.5percentofmaleFacebookuserswholistagenderofinterestsaytheyareinterestedinmen:Author’sanalysisofFacebookadsdata.IdonotincludeFacebookuserswholist“menandwomen.”Myanalysissuggestsanon-trivialpercentofuserswhosaytheyareinterestedinmenandwomeninterpretthequestionasinterestinfriendshipratherthanromanticinterest.
115about5percentofmalepornsearchesareforgay-maleporn:Asdiscussed,GoogleTrendsdoesnotbreakdownsearchesbygender.GoogleAdWordsbreaksdownpageviewsforvariouscategoriesbygender.However,thisdataisfarlessprecise.Toestimatethesearchesbygender,Ifirstusethesearchdatatogetastatewideestimateofthepercentofgaypornsearchesbystate.IthennormalizethisdatabytheGoogleAdWordsgenderdata.
Anotherwaytogetgender-specificdataisusingPornHubdata.However,PornHubcouldbeahighlyselectedsample,sincemanygaypeoplemightinsteadusesitesfocusedonlyongayporn.PornHubsuggeststhatgaypornuseamongmenislowerthanGooglesearcheswouldsuggest.However,itconfirmsthatthereisnotastrongrelationshipbetweentolerancetowardhomosexualityandmalegaypornuse.Allthisdataandfurthernotesareavailableonmywebsite,atsethsd.com,inthesection“Sex.”
1164percentofthemareopenlygayonFacebook:Author’scalculationofFacebookadsdata:OnFebruary8,2017,roughly300malehighschoolstudentsinSanFrancisco-Oakland-SanJosemediamarketonFacebooksaidtheywereinterestedinmen.Roughly,7,800saidtheywereinterestedinwomen.
119“InIranwedon’thavehomosexuals”:“‘WeDon’tHaveAnyGaysinIran,’IranianPresidentTellsIvyLeagueAudience,”DailyMail.com,September25,2007,http://www.dailymail.co.uk/news/article-483746/We-dont-gays-Iran-Iranian-president-tells-Ivy-League-audience.html.
119“Wedonothavetheminourcity”:BrettLogiurato,“SochiMayorClaimsThereAreNoGayPeopleintheCity,”SportsIllustrated,January27,2014.
119internetbehaviorrevealssignificantinterestingayporninSochiandIran:AccordingtoGoogleAdWords,therearetensofthousandsofsearcheseveryyearfor“гейпорно”(gayporn).ThepercentofpornsearchesforgaypornisroughlysimilarinSochiasintheUnitedStates.GoogleAdWordsdoesnotincludedataforIran.PornHubalsodoesnotreportdataforIran.However,PornMDstudiedtheirsearchdataandreportedthatfiveofthetoptensearchtermsinIranwereforgayporn.Thisincluded“daddylove”and“hotelbusinessman”andisreportedinJosephPatrickMcCormick,“SurveyRevealsSearchesforGayPornAreTopinCountriesBanningHomosexuality,”PinkNews,http://www.pinknews.co.uk/2013/03/13/survey-reveals-searches-for-gay-porn-are-top-in-countries-banning-homosexuality/.AccordingtoGoogleTrends,about2percentofpornsearchesinIranareforgayporn,whichislowerthanintheUnitedStatesbutstillsuggestswidespreadinterest.
122Whenitcomestosex:Stephens-Davidowitz,“SearchingforSex.”Dataforthissectioncanbefoundonmywebsite,sethsd.com,inthesection“Sex.”
12211percentofwomen:CurrentContraceptiveStatusAmongWomenAged15–44:UnitedStates,2011–2013,CentersforDiseaseControlandPrevention,http://www.cdc.gov/nchs/data/databriefs/db173_table.pdf#1.
12210percentofthemtobecomepregnanteverymonth:DavidSpiegelhalter,“Sex:WhatAretheChances?”BBCNews,March15,2012,http://www.bbc.com/future/story/20120313-sex-in-the-city-or-elsewhere.
1221in113womenofchildbearingage:Thereareroughly6.6millionpregnancieseveryyearandthereare62millionwomenbetweenages15and44.
128performingoralsexontheoppositegender:Asmentioned,IdonotknowthegenderofaGooglesearcher.Iamassumingthattheoverwhelmingmajorityofsearcheslookinghowtoperformcunnilingusarebymenandthattheoverwhelmingmajorityofsearcheslookinghowtoperformfellatioarebywomen.Thisisbothbecausethelargemajorityofpeoplearestraightandbecausetheremightbelessofaneedtolearnhowtopleaseasame-sexpartner.
128topfivenegativewords:Author’sanalysisofGoogleAdWordsdata.130killthem:EvanSoltasandSethStephens-Davidowitz,“TheRiseofHate
Search,”NewYorkTimes,December13,2015,SR1.Dataandmoredetailscanbefoundonmywebsite,sethsd.com,inthesection“Islamophobia.”
132seventeentimesmorecommon:Author’sanalysisofGoogleTrendsdata.132MartinLutherKingJr.Day:Author’sanalysisofGoogleTrendsdata.133correlateswiththeblack-whitewagegap:AshwinRodeandAnandJ.
Shukla,“PrejudicialAttitudesandLaborMarketOutcomes,”mimeo,2013.134Theirparents:SethStephens-Davidowitz,“Google,TellMe.IsMySona
Genius?”NewYorkTimes,January19,2014,SR6.ThedataforexactsearchescanbefoundusingGoogleAdWords.EstimatescanalsobefoundwithGoogleTrends,bycomparingsearcheswiththewords“gifted”and“son”versus“gifted”and“daughter.”Compare,forexample,https://www.google.com/trends/explore?date=all&geo=US&q=gifted%20son,gifted%20daughterandhttps://www.google.com/trends/explore?date=all&geo=US&q=overweight%20son,overweight%20daughter.Oneexceptiontothegeneralpatternthattherearemorequestionsaboutsons’
brainsanddaughters’bodiesistherearemoresearchesfor“fatson”than“fatdaughter.”Thisseemstoberelatedtothepopularityofincestporndiscussedearlier.Roughly20percentofsearcheswiththewords“fat”and“son”alsoincludetheword“porn.”
135girlsare9percentmorelikelythanboystobeingiftedprograms:“GenderEquityinEducation:ADataSnapshot,”OfficeforCivilRights,U.S.DepartmentofEducation,June2012,http://www2.ed.gov/about/offices/list/ocr/docs/gender-equity-in-education.pdf.
136About28percentofgirlsareoverweight,while35percentofboysare:DataResourceCenterforChildandAdolescentHealth,http://www.childhealthdata.org/browse/survey/results?q=2415&g=455&a=3879&r=1.
137Stormfrontprofiles:Stephens-Davidowitz,“TheDataofHate.”Therelevantdatacanbedownloadedatsethsd.com,inthedatasectionheadlined“Stormfront.”
139StormfrontduringDonaldTrump’scandidacy:GooglesearchinterestinStormfrontwassimilarinOctober2016tothelevelsitwasduringOctober2015.ThisisinstarkcontrasttothesituationduringObama’sfirstelection.InOctober2008,searchinterestinStormfronthadrisenalmost60percentcomparedtothepreviousOctober.OnthedayafterObamawaselected,GooglesearchesforStormfronthadrisenroughlytenfold.OnthedayafterTrumpwaselected,Stormfrontsearchesroseabouttwo-point-five-fold.ThiswasroughlyequivalenttotherisethedayafterGeorgeW.Bushwaselectedin2004andmaylargelyreflectnewsinterestamongpoliticaljunkies.
141politicalsegregationontheinternet:MatthewGentzkowandJesseM.Shapiro,“IdeologicalSegregationOnlineandOffline,”QuarterlyJournalofEconomics126,no.4(2011).
144friendsonFacebook:EytanBakshy,SolomonMessing,andLadaA.Adamic,“ExposuretoIdeologicallyDiverseNewsandOpiniononFacebook,”Science348,no.6239(2015).Theyfoundthat,amongthe9percentofactiveFacebookuserswhodeclaretheirideology,about23percentoftheirfriendswhoalsodeclareanideologyhavetheopposite
ideologyand28.5percentofthenewstheyseeonFacebookisfromtheoppositeideology.ThesenumbersarenotdirectlycomparablewithothernumbersonsegregationbecausetheyonlyincludethesmallsampleofFacebookuserswhodeclaretheirideology.Presumably,theseusersaremuchmorelikelytobepoliticallyactiveandassociatewithotherpoliticallyactiveuserswiththesameideology.Ifthisiscorrect,thediversityamongalluserswillbemuchgreater.
144onecrucialreasonthatFacebook:Anotherfactorthatmakessocialmediasurprisinglydiverseisthatitgivesabigbonustoextremelypopularandwidelysharedarticles,nomattertheirpoliticalslant.SeeSolomonMessingandSeanWestwood,“SelectiveExposureintheAgeofSocialMedia:EndorsementsTrumpPartisanSourceAffiliationWhenSelectingNewsOnline,”2014.
144morefriendsonFacebookthantheydooffline:SeeBenQuinn,“SocialNetworkUsersHaveTwiceasManyFriendsOnlineasinRealLife,”Guardian,May8,2011.Thisarticlediscussesa2011studybytheCysticFibrosisTrust,whichfoundthattheaveragesocialnetworkuserhas121onlinefriendscomparedwith55physicalfriends.Accordingtoa2014PewResearchstudy,theaverageFacebookuserhadmorethan300friends.SeeAaronSmith,“6NewFactsAboutFacebook,”February3,2014,http://www.pewresearch.org/fact-tank/2014/02/03/6-new-facts-about-facebook/.
144weakties:EytanBakshy,ItamarRosenn,CameronMarlow,andLadaAdamic,“TheRoleofSocialNetworksinInformationDiffusion,”Proceedingsofthe21stInternationalConferenceonWorldWideWeb,2012.
145“doom-and-gloompredictionshaven’tcometrue”:“Study:ChildAbuseonDeclineinU.S.,”AssociatedPress,December12,2011.
146didchildabusereallydrop:SeeSethStephens-Davidowitz,“HowGooglingUnmasksChildAbuse,”NewYorkTimes,July14,2013,SR5,andSethStephens-Davidowitz,“UnreportedVictimsofanEconomicDownturn,”mimeo,2013.
146facinglongwaittimesandgivingup:“StoppingChildAbuse:ItBeginsWithYou,”TheArizonaRepublic,March26,2016.
147off-the-bookswaystoterminateapregnancy:SethStephens-Davidowitz,“TheReturnoftheD.I.Y.Abortion,”NewYorkTimes,March6,2016,SR2.Dataandmoredetailscanbefoundonmywebsite,sethsd.com,inthesection“Self-InducedAbortion.”
150similaraveragecirculations:AllianceforAuditedMedia,ConsumerMagazines,http://abcas3.auditedmedia.com/ecirc/magtitlesearch.asp.
151OnFacebook:Author’scalculations,onOctober4,2016,usingFacebook’sAdsManager.
151toptenmostvisitedwebsites:“ListofMostPopularWebsites,”Wikipedia.AccordingtoAlexa,whichtracksbrowsingbehavior,asofSeptember4,2016,themostpopularpornsitewasXVideos,andthiswasthe57th-most-popularwebsite.AccordingtoSimilarWeb,asofSeptember4,2016,themostpopularpornsitewasXVideos,andthiswasthe17th-most-popularwebsite.Thetopten,accordingtoAlexa,areGoogle,YouTube,Facebook,Baidu,Yahoo!,Amazon,Wikipedia,TencentQQ,GoogleIndia,andTwitter.
153IntheearlymorningofSeptember5,2006:ThisstoryisfromDavidKirkpatrick,TheFacebookEffect:TheInsideStoryoftheCompanyThatIsConnectingtheWorld(NewYork:Simon&Schuster,2010).
155greatbusinessesarebuiltonsecrets:PeterThielandBlakeMasters,ZerotoOne:NotesonStartups,orHowtoBuildtheFuture(NewYork:TheCrownPublishingGroup,2014).
157saysXavierAmatriain:IinterviewedXavierAmatriainbyphoneonMay5,2015.
159topquestionsAmericanshadduringObama’s2014:Author’sanalysisofGoogleTrendsdata.
162thistimeatamosque:“ThePresidentSpeaksattheIslamicSocietyofBaltimore,”YouTubevideo,postedFebruary3,2016,https://www.youtube.com/watch?v=LRRVdVqAjdw.
163hateful,ragefulsearchesagainstMuslimsdroppedinthehoursafterthepresident’saddress:Author’sanalysisofGoogleTrendsdata.Searchesfor“killMuslims”werelowerthanthecomparableperiodaweekbefore.Inaddition,searchesthatincluded“Muslims”andoneofthetopfivenegativewordsaboutthisgroupwerelower.
CHAPTER5:ZOOMINGIN
166howchildhoodexperiencesinfluencewhichbaseballteamyousupport:SethStephens-Davidowitz,“TheyHookYouWhenYou’reYoung,”NewYorkTimes,April20,2014,SR5.Dataandcodeforthisstudycanbefoundonmywebsite,sethsd.com,inthesection“Baseball.”
170thesinglemostimportantyear:YairGhitzaandAndrewGelman,“TheGreatSociety,Reagan’sRevolution,andGenerationsofPresidentialVoting,”unpublishedmanuscript.
173Chettyexplains:IinterviewedRajChettybyphoneonJuly30,2015.176escapethegrimreaper:RajChettyetal.,“TheAssociationBetween
IncomeandLifeExpectancyintheUnitedStates,2001–2014,”JAMA315,no.16(2016).
178Contagiousbehaviormaybedrivingsomeofthis:JuliaBelluz,“IncomeInequalityIsChippingAwayatAmericans’LifeExpectancy,”vox.com,April11,2016.
178whysomepeoplecheatontheirtaxes:RajChetty,JohnFriedman,andEmmanuelSaez,“UsingDifferencesinKnowledgeAcrossNeighborhoodstoUncovertheImpactsoftheEITConEarnings,”AmericanEconomicReview103,no.7(2013).
180IdecidedtodownloadWikipedia:ThisisfromSethStephens-Davidowitz,“TheGeographyofFame,”NewYorkTimes,March23,2014,SR6.Datacanbefoundonmywebsite,sethsd.com,inthesection“WikipediaBirthRate,byCounty.”ForhelpdownloadingandcodingcountyofbirthofeveryWikipediaentrant,IthankNoahStephens-Davidowitz.
183abigcity:Formoreevidenceonthevalueofcities,seeEdGlaeser,TriumphoftheCity(NewYork:Penguin,2011).(Glaeserwasmyadvisoringraduateschool.)
191manyexamplesofreallifeimitatingart:DavidLevinson,ed.,EncyclopediaofCrimeandPunishment(ThousandOaks,CA:SAGE,2002).
191subjectsexposedtoaviolentfilmwillreportmoreangerandhostility:CraigAndersonetal.,“TheInfluenceofMediaViolenceonYouth,”PsychologicalScienceinthePublicInterest4(2003).
192Onweekendswithapopularviolentmovie:GordonDahlandStefano
DellaVigna,“DoesMovieViolenceIncreaseViolentCrime?”QuarterlyJournalofEconomics124,no.2(2009).
195Googlesearchescanalsobebrokendownbytheminute:SethStephens-Davidowitz,“DaysofOurDigitalLives,”NewYorkTimes,July5,2015,SR4.
196alcoholisamajorcontributortocrime:AnnaRichardsonandTraceyBudd,“YoungAdults,Alcohol,CrimeandDisorder,”CriminalBehaviourandMentalHealth13,no.1(2003);RichardA.Scribner,DavidP.MacKinnon,andJamesH.Dwyer,“TheRiskofAssaultiveViolenceandAlcoholAvailabilityinLosAngelesCounty,”AmericanJournalofPublicHealth85,no.3(1995);DennisM.Gorman,PaulW.Speer,PaulJ.Gruenewald,andErichW.Labouvie,“SpatialDynamicsofAlcoholAvailability,NeighborhoodStructureandViolentCrime,”JournalofStudiesonAlcohol62,no.5(2001);TonyH.Grubesic,WilliamAlexPridemore,DominiqueA.Williams,andLoniPhilip-Tabb,“AlcoholOutletDensityandViolence:TheRoleofRiskyRetailersandAlcohol-RelatedExpenditures,”AlcoholandAlcoholism48,no.5(2013).
196lettingallfourofhissonsplayfootball:“EdMcCaffreyKnewChristianMcCaffreyWouldBeGoodfromtheStart—’TheHerd,’”YouTubevideo,postedDecember3,2015,https://www.youtube.com/watch?v=boHMmp7DpX0.
197analyzingpilesofdata:Researchershavefoundmorefromutilizingthiscrimedatabrokendownintosmalltimeincrements.Oneexample?Domesticviolencecomplaintsriseimmediatelyafteracity’sfootballteamlosesagameitwasexpectedtowin.SeeDavidCardandGordonB.Dahl,“FamilyViolenceandFootball:TheEffectofUnexpectedEmotionalCuesonViolentBehavior,”QuarterlyJournalofEconomics126,no.1(2011).
197Here’showBillSimmons:BillSimmons,“It’sHardtoSayGoodbyetoDavidOrtiz,”ESPN.com,June2,2009,http://www.espn.com/espnmag/story?id=4223584.
198howcanwepredicthowabaseballplayerwillperforminthefuture:ThisisdiscussedinNateSilver,TheSignalandtheNoise:WhySoManyPredictionsFail—ButSomeDon’t(NewYork:Penguin,2012).
199“beefysluggers”indeeddo,onaverage,peakearly:RyanCampbell,“How
WillPrinceFielderAge?”October28,2011,http://www.fangraphs.com/blogs/how-will-prince-fielder-age/.
199Ortiz’sdoppelgangers’:ThisdatawaskindlyprovidedtomebyRobMcQuownofBaseballProspectus.
204Kohaneasks:IinterviewedIsaacKohanebyphoneonJune15,2015.205JamesHeywoodisanentrepreneur:IinterviewedJamesHeywoodby
phoneonAugust17,2015.
CHAPTER6:ALLTHEWORLD’SALAB
207February27,2000:Thisstoryisdiscussed,amongotherplaces,inBrianChristian,“TheA/BTest:InsidetheTechnologyThat’sChangingtheRulesofBusiness,”Wired,April25,2012,http://www.wired.com/2012/04/ff_abtesting/.
209Whenteacherswerepaid,teacherabsenteeismdropped:EstherDuflo,RemaHanna,andStephenP.Ryan,“IncentivesWork:GettingTeacherstoCometoSchool,”AmericanEconomicReview102,no.4(2012).
209whenBillGateslearnedofDuflo’swork:IanParker,“ThePovertyLab,”NewYorker,May17,2010.
211GoogleengineersranseventhousandA/Btests:Christian,“TheA/BTest.”211forty-onemarginallydifferentshadesofblue:DouglasBowman,“Goodbye,
Google,”stopdesign,March20,2009,http://stopdesign.com/archive/2009/03/20/goodbye-google.html.
211Facebooknowruns:EytanBakshy,“BigExperiments:BigData’sFriendforMakingDecisions,”April3,2014,https://www.facebook.com/notes/facebook-data-science/big-experiments-big-datas-friend-for-making-decisions/10152160441298859/.Sourcesforinformationonpharmaceuticalstudiescanbefoundat“Howmanyclinicaltrialsarestartedeachyear?”Quorapost,https://www.quora.com/How-many-clinical-trials-are-started-each-year.
211Optimizely:IinterviewedDanSirokerbyphoneonApril29,2015.214nettingthecampaignroughly$60million:DanSiroker,“HowObama
Raised$60MillionbyRunningaSimpleExperiment,”Optimizelyblog,
November29,2010,https://blog.optimizely.com/2010/11/29/how-obama-raised-60-million-by-running-a-simple-experiment/.
214TheBostonGlobeA/B-testsheadlines:TheBostonGlobeA/Btestsandresultswereprovidedtotheauthor.SomedetailsabouttheGlobe’stestingcanbefoundat“TheBostonGlobe:DiscoveringandOptimizingaValuePropositionforContent,”MarketingSherpaVideoArchive,https://www.marketingsherpa.com/video/boston-globe-optimization-summit2.ThisincludesarecordedconversationbetweenPeterDoucetteoftheGlobeandPamelaMarkeyatMECLABS.
217Bensonsays:IinterviewedClarkBensonbyphoneonJuly23,2015.217addedarightward-pointingarrowsurroundedbyasquare:“Enhancing
TextAdsontheGoogleDisplayNetwork,”InsideAdSense,December3,2012,https://adsense.googleblog.com/2012/12/enhancing-text-ads-on-google-display.html.
218Googlecustomerswerecritical:See,forexample,“Largearrowsappearingingoogleads—pleaseremove,”DoubleClickPublisherHelpForum,https://productforums.google.com/forum/#!topic/dfp/p_TRMqWUF9s.
219theriseofbehavioraladdictionsincontemporarysociety:AdamAlter,Irresistible:TheRiseofAddictiveTechnologyandtheBusinessofKeepingUsHooked(NewYork:Penguin,2017).
219TopaddictionsreportedtoGoogle:Author’sanalysisofGoogleTrendsdata.
222saysLevittinalecture:ThisisdiscussedinavideocurrentlyfeaturedontheFreakonomicspageoftheHarryWalkerSpeakersBureau,http://www.harrywalker.com/speakers/authors-of-freakonomics/.
225beerandsoftdrinkadsrunduringtheSuperBowl:WesleyR.HartmannandDanielKlapper,“SuperBowlAds,”unpublishedmanuscript,2014.
226apimplykidinhisunderwear:Forthestrongcasethatwelikelyarelivinginacomputersimulation,seeNickBostrom,“AreWeLivinginaComputerSimulation?”PhilosophicalQuarterly53,no.211(2003).
227Offorty-threeAmericanpresidents:LosAngelesTimesstaff,“U.S.PresidentialAssassinationsandAttempts,”LosAngelesTimes,January22,2012,http://timelines.latimes.com/us-presidential-assassinations-and-attempts/.
227CompareJohnF.KennedyandRonaldReagan:BenjaminF.JonesandBenjaminA.Olken,“DoAssassinsReallyChangeHistory?”NewYorkTimes,April12,2015,SR12.
227Kadyrovdied:Adisturbingvideooftheattackcanbeseenat“Paradesurprise(Chechnya2004),”YouTubevideo,postedMarch31,2009,https://www.youtube.com/watch?v=fHWhs5QkfuY.
227Hitlerhadchangedhisschedule:ThisstoryisalsodiscussedinJonesandOlken,“DoAssassinsReallyChangeHistory?”
228theeffectofhavingyourleadermurdered:BenjaminF.JonesandBenjaminA.Olken,“HitorMiss?TheEffectofAssassinationsonInstitutionsandWar,”AmericanEconomicJournal:Macroeconomics1,no.2(2009).
229winningthelotterydoesnot:ThispointismadeinJohnTierney,“HowtoWintheLottery(Happily),”NewYorkTimes,May27,2014,D5.Tierney’spiecediscussesthefollowingstudies:BénédicteApoueyandAndrewE.Clark,“WinningBigbutFeelingNoBetter?TheEffectofLotteryPrizesonPhysicalandMentalHealth,”HealthEconomics24,no.5(2015);JonathanGardnerandAndrewJ.Oswald,“MoneyandMentalWellbeing:ALongitudinalStudyofMedium-SizedLotteryWins,”JournalofHealthEconomics26,no.1(2007);andAnnaHedenus,“AttheEndoftheRainbow:Post-WinningLifeAmongSwedishLotteryWinners,”unpublishedmanuscript,2011.Tierney’spiecealsopointsoutthatthefamous1978study—PhilipBrickman,DanCoates,andRonnieJanoff-Bulman,“LotteryWinnersandAccidentVictims:IsHappinessRelative?”JournalofPersonalityandSocialPsychology36,no.8(1978)—whichfoundthatwinningthelotterydoesnotmakeyouhappywasbasedonatinysample.
229yourneighborwinningthelottery:SeePeterKuhn,PeterKooreman,AdriaanSoetevent,andArieKapteyn,“TheEffectsofLotteryPrizesonWinnersandTheirNeighbors:EvidencefromtheDutchPostcodeLottery,”AmericanEconomicReview101,no.5(2011),andSumitAgarwal,VyacheslavMikhed,andBarryScholnick,“DoesInequalityCauseFinancialDistress?EvidencefromLotteryWinnersandNeighboringBankruptcies,”workingpaper,2016.
229neighborsoflotterywinners:Agarwal,Mikhed,andScholnick,“Does
InequalityCauseFinancialDistress?”230doctorscanbemotivatedbymonetaryincentives:JeffreyClemensand
JoshuaD.Gottlieb,“DoPhysicians’FinancialIncentivesAffectMedicalTreatmentandPatientHealth?”AmericanEconomicReview104,no.4(2014).Notethattheseresultsdonotmeanthatdoctorsareevil.Infact,theresultsmightbemoretroublingiftheextraproceduresdoctorsorderedwhentheywerepaidmoretoorderthemactuallysavedlives.Ifthiswerethecase,itwouldmeanthatdoctorsneededtobepaidenoughtoorderlifesavingtreatments.ClemensandGottlieb’sresultssuggest,instead,thatdoctorswillorderlifesavingtreatmentsnomatterhowmuchmoneytheyaregiventoorderthem.Forproceduresthatdon’thelpallthatmuch,doctorsmustbepaidenoughtoorderthem.Anotherwaytosaythis:doctorsdon’tpaytoomuchattentiontomonetaryincentivesforlife-threateningstuff;theypayatonofattentiontomonetaryincentivesforunimportantstuff.
231$150million:RobertD.McFaddenandEbenShapiro,“Finally,aFacetoFitStuyvesant:AHighSchoolofHighAchieversGetsaHigh-PricedHome,”NewYorkTimes,September8,1992.
231Itoffers:CourseofferingsareavailableonStuy’swebsite,http://stuy.enschool.org/index.jsp.
231one-quarterofitsgraduatesareaccepted:AnnaBahr,“WhentheCollegeAdmissionsBattleStartsatAge3,”NewYorkTimes,July29,2014,http://www.nytimes.com/2014/07/30/upshot/when-the-college-admissions-battle-starts-at-age-3.html.
231Stuyvesanttrained:SewellChan,“TheObamaTeam’sNewYorkTies,”NewYorkTimes,November25,2008;EvanT.R.Rosenman,“Classof1984:LisaRandall,”HarvardCrimson,June2,2009;“GaryShteyngartonStuyvesantHighSchool:MyNewYork,”YouTubevideo,postedAugust4,2010,https://www.youtube.com/watch?v=NQ_phGkC-Tk;CandaceAmos,“30StarsWhoAttendedNYCPublicSchools,”NewYorkDailyNews,May29,2015.
231Itscommencementspeakershaveincluded:CarlCampanile,“KidsStuyHighOverBubba:He’llAddressGroundZeroSchool’sGraduation,”NewYorkPost,March22,2002;UnitedNationsPressRelease,“Stuyvesant
HighSchool’s‘MulticulturalTapestry’EloquentResponsetoHatred,SaysSecretary-GeneralinGraduationAddress,”June23,2004;“ConanO’Brien’sSpeechatStuyvesant’sClassof2006GraduationinLincolnCenter,”YouTubevideo,postedMay6,2012,https://www.youtube.com/watch?v=zAMkUE9Oxnc.
231Stuyrankednumberone:Seehttps://k12.niche.com/rankings/public-high-schools/best-overall/.
232Fewerthan5percent:PamelaWheaton,“8th-GradersGetHighSchoolAdmissionsResults,”Insideschools,March4,2016,http://insideschools.org/blog/item/1001064-8th-graders-get-high-school-admissions-results.
235prisonersassignedtoharsherconditions:M.KeithChenandJesseM.Shapiro,“DoHarsherPrisonConditionsReduceRecidivism?ADiscontinuity-BasedApproach,”AmericanLawandEconomicsReview9,no.1(2007).
236TheeffectsofStuyvesantHighSchool?AtilaAbdulkadiroğlu,JoshuaAngrist,andParagPathak,“TheEliteIllusion:AchievementEffectsatBostonandNewYorkExamSchools,”Econometrica82,no.1(2014).ThesamenullresultwasindependentlyfoundbyWillDobbieandRolandG.FryerJr.,“TheImpactofAttendingaSchoolwithHigh-AchievingPeers:EvidencefromtheNewYorkCityExamSchools,”AmericanEconomicJournal:AppliedEconomics6,no.3(2014).
238averagegraduateofHarvardmakes:Seehttp://www.payscale.com/college-salary-report/bachelors.
238similarstudentsacceptedtosimilarlyprestigiousschoolswhochoosetoattenddifferentschoolsendupinaboutthesameplace:StacyBergDaleandAlanB.Krueger,“EstimatingthePayofftoAttendingaMoreSelectiveCollege:AnApplicationofSelectiononObservablesandUnobservables,”QuarterlyJournalofEconomics117,no.4(2002).
239WarrenBuffett:AliceSchroeder,TheSnowball:WarrenBuffettandtheBusinessofLife(NewYork:Bantam,2008).
CHAPTER7:BIGDATA,BIGSCHMATA?WHATITCANNOTDO
247claimedtheycouldpredictwhichway:JohanBollen,HuinaMao,andXiaojunZeng,“TwitterMoodPredictstheStockMarket,”JournalofComputationalScience2,no.1(2011).
248Thetweet-basedhedgefundwasshutdown:JamesMackintosh,“HedgeFundThatTradedBasedonSocialMediaSignalsDidn’tWorkOut,”FinancialTimes,May25,2012.
250couldnotreproducethecorrelation:ChristopherF.Chabrisetal.,“MostReportedGeneticAssociationswithGeneralIntelligenceAreProbablyFalsePositives,”PsychologicalScience(2012).
252ZoëChance:ThisstoryisdiscussedinTEDxTalks,“HowtoMakeaBehaviorAddictive:ZoëChanceatTEDxMillRiver,”YouTubevideo,postedMay14,2013,https://www.youtube.com/watch?v=AHfiKav9fcQ.Somedetailsofthestory,suchasthecolorofthepedometer,werefleshedoutininterviews.IinterviewedChancebyphoneonApril20,2015,andbyemailonJuly11,2016,andSeptember8,2016.
253Numberscanbeseductive:ThissectionisfromAlexPeysakhovichandSethStephens-Davidowitz,“HowNottoDrowninNumbers,”NewYorkTimes,May3,2015,SR6.
254cheatedoutrightinadministeringthosetests:BrianA.JacobandStevenD.Levitt,“RottenApples:AnInvestigationofthePrevalenceandPredictorsofTeacherCheating,”QuarterlyJournalofEconomics118,no.3(2003).
255saysThomasKane:IinterviewedThomasKanebyphoneonApril22,2015.
256“Eachmeasureaddssomethingofvalue”:BillandMelindaGatesFoundation,“EnsuringFairandReliableMeasuresofEffectiveTeaching,”http://k12education.gatesfoundation.org/wp-content/uploads/2015/05/MET_Ensuring_Fair_and_Reliable_Measures_Practitioner_Brief.pdf.
CHAPTER8:MODATA,MOPROBLEMS?WHATWESHOULDN’TDO
257Recently,threeeconomists:OdedNetzer,AlainLemaire,andMichalHerzenstein,“WhenWordsSweat:IdentifyingSignalsforLoanDefaultin
theTextofLoanApplications,”2016.258about13percentofborrowers:PeterRenton,“AnotherAnalysisofDefault
RatesatLendingClubandProsper,”October25,2012,http://www.lendacademy.com/lending-club-prosper-default-rates/.
261Facebooklikesarefrequentlycorrelated:MichalKosinski,DavidStillwell,andThoreGraepel,“PrivateTraitsandAttributesArePredictablefromDigitalRecordsofHumanBehavior,”PNAS110,no.15(2013).
265businessesareatthemercyofYelpreviews:MichaelLuca,“Reviews,Reputation,andRevenue:TheCaseofYelp,”unpublishedmanuscript,2011.
266Googlesearchesrelatedtosuicide:ChristineMa-Kellams,FloraOr,JiHyunBaek,andIchiroKawachi,“RethinkingSuicideSurveillance:GoogleSearchDataandSelf-ReportedSuicidalityDifferentiallyEstimateCompletedSuicideRisk,”ClinicalPsychologicalScience4,no.3(2016).
2673.5millionGooglesearches:Thisusesamethodologydiscussedonmywebsiteinthenotesonself-inducedabortion.IcomparesearchesintheGooglecategory“suicide”tosearchesfor“howtotieatie.”Therewere6.6millionGooglesearchesfor“howtotieatie”in2015.Therewere6.5timesmoresearchesinthecategorysuicide.6.5*6.6/12»3.5.
26812murdersofMuslimsreportedashatecrimes:BridgeInitiativeTeam,“WhenIslamophobiaTurnsViolent:The2016U.S.PresidentialElection,”May2,2016,availableathttp://bridge.georgetown.edu/when-islamophobia-turns-violent-the-2016-u-s-presidential-elections/.
CONCLUSION
272WhatmotivatedPopper’scrusade?:KarlPopper,ConjecturesandRefutations(London:Routledge&KeganPaul,1963).
275mappedeverycholeracaseinthecity:SimonRogers,“JohnSnow’sDataJournalism:TheCholeraMapThatChangedtheWorld,”Guardian,March15,2013.
276BenjaminF.Jones:IinterviewedBenjaminJonesbyphoneonJune1,2015.ThisworkisalsodiscussedinAaronChatterjiandBenjaminJones,
“HarnessingTechnologytoImproveK–12Education,”HamiltonProjectDiscussionPaper,2012.
283peopletendnottofinishtreatisesbyeconomists:JordanEllenberg,“TheSummer’sMostUnreadBookIs...,”WallStreetJournal,July3,2014.
INDEX
Thepaginationofthiselectroniceditiondoesnotmatchtheeditionfromwhichitwascreated.Tolocateaspecificentry,pleaseuseyoure-bookreader’ssearchtools.
A/BtestingABCsof,209–21andaddictions,219–20andBostonGlobeheadlines,214–17indigitalworld,210–19downsideto,219–21andeducation/learning,276andFacebook,211futureusesof,276,277,278andgamingindustry,220–21andGoogleadvertising,217–19importanceof,214,217andJawbone,277andpolitics,211–14andtelevision,222
Abdulkadiroglu,Atila,235–36abortion,truthabout,147–50Adamic,Lada,144Adams,John,78addictionsandA/Btesting,219–20Seealsospecificaddiction
advertising
andA/Btesting,217–19causaleffectsof,221–25,273andexamplesofBigDatasearches,22Google,217–19andLevitt-electronicscompany,222,225,226andmovies,224–25andscience,273andSuperBowlgames,221–26TV,221–26
AfricanAmericansandHarvardCrimsoneditorialaboutZuckerberg,155incomeand,175andoriginsofnotableAmericans,182–83andtruthabouthateandprejudice,129,134Seealso“nigger”;race/racism
ageandbaseballfans,165–69,165–66nandlying,108nandoriginsofpoliticalpreferences,169–71andpredictingfutureofbaseballplayers,198–99ofStormfrontmembers,137–38andwordsasdata,85–86Seealsochildren;teenagers
Aiden,Erez,76–77,78–79alcoholasaddiction,219andhealth,207–8
AltaVista(searchengine),60Alter,Adam,219–20Amatriain,Xavier,157Amazon,20,203,283AmericanPharoah(HorseNo.85),22,64,65,70–71,256Angrist,Joshua,235–36anti-Semitism.SeeJews
anxietydataabout,18andtruthaboutsex,123
AOL,andtruthaboutsex,117–18AOLNews,143art,reallifeasimitating,190–97Ashenfelter,Orley,72–74Asher,Sam,202Asians,andtruthabouthateandprejudice,129askingtherightquestions,21–22assassinations,227–28Atlanticmagazine,150–51,152,202Australia,pregnancyin,189auto-complete,110–11,116Avatar(movie),221–22
Bakshy,Eytan,144BaltimoreRavens-NewEnglandPatriotsgames,221,222–24baseballandinfluenceofchildhoodexperiences,165–69,165–66n,171,206andoveremphasisonmeasurability,254–55predictingaplayer’sfuturein,197–200,200n,203andscience,273scoutingfor,254–55zoominginon,165–69,165–66n,171,197–200,200n,203
basketballpedigreesand,67predictingsuccessin,33–41,67andsocioeconomicbackground,34–41
Beane,Billy,255Beethoven,Ludwigvon,zoominginon,190–91behavioralscience,anddigitalrevolution,276,279Belushi,John,185Benson,Clark,217
Berger,Jonah,91–92Bezos,Jeff,203biasimplicit,134languageaskeytounderstanding,74–76omitted-variable,208subconscious,132Seealsohate;prejudice;race/racism
BigDataandamountofinformation,15,21,59,171andaskingtherightquestions,21–22andcausalityexperiments,54,240definitionof,14,15anddimensionality,246–52andexamplesofsearches,15–16andexpansionofresearchmethodology,275–76andfinishingbooks,283–84futureof,279Googlesearchesasdominantsourceof,60honestyof,53–54importance/valueof,17–18,29–33,59,240,265,283limitationsof,20,245,254–55,256powersof,15,17,22,53–54,59,109,171,211,257andpredictingwhatpeoplewilldoinfuture,198–200asrevolutionary,17,18–22,30,62,76,256,274asrightdata,62skepticsof,17andsmalldata,255–56subsetsin,54understandingof,27–28Seealsospecifictopic
Bill&MelindaGatesFoundation,255Billings(Montana)Gazette,andwordsasdata,95Bing(searchengine),andColumbiaUniversity-Microsoftpancreaticcancer
study,28,30Black,Don,137BlackLivesMatter,12Blink(Gladwell),29–30Bloodstock,Incardo,64bodies,asdata,62–74Boehner,John,160Booking.com,265booksconclusionsto,271–72,279,280–84digitalizing,77,79numberofpeoplewhofinish,283–84
borrowingmoney,257–61Bosh,Chris,37BostonGlobe,andA/Btesting,214–17BostonMarathon(2013),19BostonRedSox,197–200brain,Minskystudyof,273Brazil,pregnancyin,190breasts,andtruthaboutsex,125,126Brin,Sergey,60,61,62,103Britain,pregnancyin,189BronxScienceHighSchool(NewYorkCity),232,237Buffett,Warren,239Bullock,Sandra,185Bundy,Ted,181Bush,GeorgeW.,67businessandcomparisonshopping,265reviewsof,265Seealsocorporations
butt,andtruthaboutsex,125–26
Calhoun,Jim,39
CambridgeUniversity,andMicrosoftstudyaboutIQofFacebookusers,261cancer,predictingpancreatic,28–29,30Capitalinthe21stCentury(Piketty),283casinos,andpricediscrimination,263–65causalityA/Btestingand,209–21andadvertising,221–25andBigDataexperiments,54,240collegeand,237–39correlationdistinguishedfrom,221–25andethics,226andmonetarywindfalls,229naturalexperimentsand,226–28andpowerofBigData,54,211andrandomizedcontrolledexperiments,208–9reverse,208andStuyvesantHighSchoolstudy,231–37,240
CentersforDiseaseControlandPrevention,57Chabris,Christopher,250Chance,Zoë,252–53Chaplin,Charlie,19charitablegiving,106,109Chen,M.Keith,235Chetty,Raj,172–73,174–75,176,177,178–80,185,273childrenabuseof,145–47,149–50,161andbenefitsofdigitaltruthserum,161andchildpornography,121decisionsabouthaving,111–12heightandweightdataabout,204–5ofimmigrants,184–85andincomedistribution,176andinfluenceofchildhoodexperiences,165–71,165–66n,206intelligenceof,135
andoriginsofnotableAmericans,184–85parentprejudicesagainst,134–36,135nphysicalappearanceof,135–36Seealsoparents/parenting;teenagers
cholera,Snowstudyabout,275Christians,andtruthabouthateandprejudice,129Churchill,Winston,169cigaretteeconomy,Philippines,102citiesanddangerofempoweredgovernment,267,268–69predictingbehaviorof,268–69zoominginon,172–90,239–40
CivilWar,79Clemens,Jeffrey,230Clinton,Bill,searchesfor,60–62Clinton,Hillary.Seeelections,2016AClockworkOrange(movie),190–91cnn.com,143,145Cohen,Leonard,82ncollegeandcausality,237–39andexamplesofBigDatasearches,22
collegetowns,andoriginsofnotableAmericans,182–83,184,186Colors(movie),191ColumbiaUniversity,Microsoftpancreaticcancerstudyand,28–29,30comparisonshopping,265conclusionsbenefitsofgreat,281–84tobooks,271–72,279,280–84characteristicsofbest,272,274–79importanceof,283aspointingwaytomorethingstocome,274–79purposeof,279–80Stephens-Davidowitz’swritingof,271–72,281–84
condoms,5,122CongressionalRecord,andGentzkow-Shapiroresearch,93conservativesandoriginsofpoliticalpreferences,169–71andparentsprejudiceagainstchildren,136andtruthabouttheinternet,140,141–44,145andwordsasdata,75–76,93,95–96
consumers.Seecustomers/consumerscontagiousbehavior,178conversation,anddating,80–82corporationsconsumersblowsagainst,265dangerofempowered,257–65reviewsof,265
correlationscausationdistinguishedfrom,221–25andpredictingthestockmarket,245–48,251–52
counties,zoominginon,172–90,239–40CountryMusicRadio,202Craigslist,117creativity,andunderstandingtheworld,280,281crimealcoholascontributorto,196anddangerofempoweredgovernment,266–70andprisonconditions,235violentmoviesand,193,194–95,273
Cundiff,Billy,223curiosityandbenefitsofdigitaltruthserum,162,163Levittviewsabout,280aboutnumberofpeoplewhofinishbooks,283–84andunderstandingtheworld,280,281
cursing,andwordsasdata,83–85customers/consumers
blowsagainstbusinessesby,265andpricediscrimination,265truthabout,153–57
Cutler,David,178
Dahl,Gordon,191–93,194–96,196–97n,197Dale,Stacy,238Dallas,Texas,“LargeandComplexDatasets”conference(1977)in,20–21dataamount/sizeof,15,20–21,30–31,53,171benefitsofexpansionof,16bodiesas,62–74collectingtheright,62government,149–50,266–70importanceof,26individual-level,266–70asintimidating,26Levittviewsabout,280asmoney-maker,103nontraditionalsourcesof,74picturesas,97–102,103reimaginingofwhatqualifiesas,55–103sourcesof,14,15speedfortransmitting,55–59andunderstandingtheworld,280whatcountsas,74wordsas,74–97SeealsoBigData;datascience;smalldata;specificdata
datascienceaschangingviewofworld,34andcounterintuitiveresults,37–38economistsroleindevelopmentof,228futureof,281goalof,37–38
asintuitive,26–33andwhoisadatascientist,27
datingandexamplesofBigDatasearches,22physicalappearanceand,82,120nandrejection,120nandStormfrontmembers,138–39andtruthabouthateandprejudice,138–39andtruthaboutsex,120nandwordsasdata,80–86,103
DawnoftheDead(movie),192death,andmemorablestories,33DellaVigna,Stefano,191–93,194–96,196–97n,197Democratscoreprinciplesof,94andoriginsofpoliticalpreferences,170–71andwordsasdata,93–97Seealsospecificpersonorelection
depressionGooglesearchsfor,31,110andhandlingthetruth,158andlying,109,110andparentsprejudiceagainstchildren,136
developingcountrieseconomiesof,101–2,103investingin,251
digitaltruthserumabortionand,147–50andchildabuse,145–47,149–50andcustomers,153–57andFacebookfriends,150–53andhandlingthetruth,158–63andhateandprejudice,128–40andignoringwhatpeopletellyou,153–57
incentivesand,109andinternet,140–45sexand,112–28sitesas,54Seealsolying;truth
digitalworld,randomizedexperimentsin,210–19dimensionality,curseof,246–52discriminationandoriginsofnotableAmericans,182–83price,262–65Seealsobias;prejudice;race/racism
DNA,248–50Dna88(Stormfrontmember),138doctors,financialincentivesfor,230,240Donato,Adriana,266,269doppelgangersbenefitsof,263andhealth,203–5andhuntingonsocialmedia,201–3andpredictingfutureofbaseballplayers,197–200,200n,203andpricediscrimination,262–63,264zoominginon,197–205
dreams,phallicsymbolsin,46–48drugs,asaddiction,219Duflo,Esther,208–9,210,273
EarnedIncomeTaxCredit,178,179economistsandnumberofpeoplefinishingbooks,283roleindatasciencedevelopmentof,228assoftscientists,273Seealsospecificperson
economy/economicscomplexityof,273
ofdevelopingcountries,101–2,103ofPhilippinescigaretteeconomy,102andpicturesasdata,99–102andspeedofdata,56–57andtruthabouthateandprejudice,139Seealsoeconomists;specifictopic
Edmonton,waterconsumptionin,206EDUSTAR,276educationandA/Btesting,276anddigitalrevolution,279andoveremphasisonmeasurability,253–54,255–56inruralIndia,209,210smalldatain,255–56statespendingon,185andusingonlinebehaviorassupplementtotesting,278Seealsohighschoolstudents;tests/testing
Eisenhower,DwightD.,170–71electionsandorderofsearches,10–11predictionsabout,9–14voterturnoutin,9–10
elections,2008andA/Btesting,211–12racismin,2,6–7,12,133,134andStormfrontmembership,139
elections,2012andA/Btesting,211–12predictionsabout,10racismin,2–3,8,133,134Trumpand,7
elections,2016andlying,107mappingof,12–13
pollsabout,1predictingoutcomeof,10–14andracism,8,11,12,14,133Republicanprimariesfor,1,13–14,133andStormfrontmembership,139voterturnoutin,11
electronicscompany,andadvertising,222,225,226“EliteIllusion”(Abdulkadiroglu,Angrist,andPathak),236Ellenberg,Jordan,283Ellerbee,William,34Eng,Jessica,236–37environment,andlifeexpectancy,177EPCORutilitycompany,193,194EQB,63–64equalityofopportunity,zoominginon,173–75ErrorBot,48–49ethicsandBigData,257–65anddangerofempoweredgovernment,267doppelgangersearchesand,262–63empoweredcorporationsand,257–65andexperiments,226hiringpracticesand,261–62andIQDNAstudyresults,249andpayingbackloans,257–61andpricediscrimination,262–65andstudyofIQofFacebookusers,261
Ewing,Patrick,33experimentsandethics,226andrealscience,272–73Seealsotypeofexperimentorspecificexperiment
andA/Btesting,211andaddictions,219,220andhiringpractices,261andignoringwhatpeopletellyou,153–55,157andinfluenceofchildhoodexperiencesdata,166–68,171IQofusersof,261Microsoft-CambridgeUniversitystudyofusersof,261“NewsFeed”of,153–55,255andoveremphasisonmeasurability,254,255andpicturesasdata,99and“secretsaboutpeople,”155–56andsizeofBigData,20andsmalldata,255assourceofinformation,14,32andtruthaboutcustomers,153–55truthaboutfriendson,150–53andtruthaboutsex,113–14,116andtruthabouttheinternet,144,145andwordsasdata,83,85,87–88
TheFacebookEffect:TheInsideStoryoftheCompanyThatIsConnectingtheWorld(Kirkpatrick),154
Facemash,156facesblack,133andpicturesasdata,98–99andtruthabouthateandprejudice,133
Farook,Rizwan,129–30Father’sDayadvertising,222,22550ShadesofGray,157financialincentives,fordoctors,230,240FirstLawofViticulture,73–74foodandphallicsymbolsindreams,46–48predictionsabout,71–72
andpregnancy,189–90footballandadvertising,221–25zoominginon,196–97n
Freakonomics(Levitt),265,280,281Freud,Sigmund,22,45–52,272,281Friedman,Jerry,20,21Fryer,Roland,36
Gabriel,Stuart,9–10,11Galluppolls,2,88,113gambling/gamingindustry,220–21,263–65“GangnamStyle”video,Psy,152Garland,Judy,114,114nGates,Bill,209,238–39gaysincloset,114–15,116,117,118–19,161anddimensionsofsexuality,279andexamplesofBigDatasearches,22andhandlingthetruth,159,161inIran,119andmarriage,74–76,93,115–16,117mobilityof,113–14,115populationof,115,116,240andpornography,114–15,114n,116,117,119inRussia,119stereotypeof,114nsurveysabout,113teenagersas,114,116andtruthabouthateandprejudice,129andtruthaboutsex,112–19andwivessuspicionsofhusbands,116–17womenas,116andwordsasdata,74–76,93
Gelles,Richard,145Gelman,Andrew,169–70genderandlifeexpectancy,176andparentsprejudiceagainstchildren,134–36,135nofStormfrontmembers,137Seealsogays
GeneralSocialSurvey,5,142genetics,andIQ,249–50genitalsandtruthaboutsex,126–27Seealsopenis;vagina
Gentzkow,Matt,74–76,93–97,141–44geographyzoominginby,172–90Seealsocities;counties
Germany,pregnancyin,190Ghana,pregnancyin,188Ghitza,Yair,169–70Ginsberg,Jeremy,57girlfriends,killing,266,269girls,parentsprejudiceagainstyoung,134–36Gladwell,Malcolm,29–30Gnau,Scott,264gold,priceof,252TheGoldfinch(Tartt),283GoldmanSachs,55–56,59Googleadvertisementsabout,217–19andamountofdata,21anddigitalizingbooks,77MountainViewcampusof,59–60,207Seealsospecifictopic
GoogleAdWords,3n,115,125
GoogleCorrelate,57–58GoogleFlu,57,57n,71GoogleNgrams,76–77,78,79Googlesearchesadvantagesofusing,60–62auto-completein,110–11differentiationfromothersearchenginesof,60–62asdigitaltruthserum,109,110–11asdominantsourceofBigData,60andtheforbidden,51foundingof,60–62andhiddenthoughts,110–12andhonesty/plausibilityofdata,9,53–54importance/valueof,14,21pollscomparedwith,9popularityof,62powerof,4–5,53–54andspeedofdata,57–58andwordsasdata,76,88SeealsoBigData;specificsearch
GoogleSTD,71GoogleTrends,3–4,3n,6,246Gottlieb,Joshua,202,230governmentdangerofempowered,266–70andpredictingactionsofindividuals,266–70andprivacyissues,267–70spendingby,93,94andtrustofdata,149–50andwordsasdata,93,94
“GreatBody,GreatSex,GreatBlowjob”(video),152,153GreatRecession,andchildabuse,145–47TheGreenMonkey(HorseNo.153),68grossdomesticproduct(GDP),andpicturesasdata,100–101
GrossNationalHappiness,87,88GuttmacherInstitute,148,149
Hannibal(movie),192,195happinessandpicturesasdata,99Seealsosentimentanalysis
Harrah’sCasino,264Harris,Tristan,219–20HarryPotterandtheDeathlyHallows(Rowling),88–89,91Hartmann,WesleyR.,225HarvardCrimson,editorialaboutZuckerbergin,155HarvardUniversity,incomeofgraduatesof,237–39hateanddangerofempoweredgovernments,266–67,268–69truthabout,128–40,162–63Seealsoprejudice;race/racism
healthandalcohol,207–8andcomparisonofsearchengines,71anddigitalrevolution,275–76,279andDNA,248–49anddoppelgangers,203–5methodologyforstudiesof,275–76andspeedofdatatransmission,57zoominginon,203–5,275Seealsolifeexpectancy
healthinsurance,177Henderson,J.Vernon,99–101TheHerdwithColinCowherd,McCaffreyinterviewon,196nHerzenstein,Michal,257–61Heywood,James,205highschoolstudentstestingof,231–37,253–54
andtruthaboutsex,114,116highschoolyearbooks,98–99hiringpractices,261–62Hispanics,andHarvardCrimsoneditorialaboutZuckerberg,155Hitler,Adolf,227hockeymatch,Olympic(2010),193,194HorseNo.85.SeeAmericanPharoahHorseNo.153(TheGreenMonkey),68horsesandBartlebysyndrome,66andexamplesofBigDatasearches,22internalorgansof,69–71pedigreesof,66–67,69,71predictingsuccessof,62–74,256searchesabout,62–74
hours,zoominginon,190–97housing,priceof,58HumanGenomeProject,248–49HumanRightsCampaign,161humankind,dataasmeansforunderstanding,16humor/jokes,searchesfor,18–19HurricaneFrances,71–72HurricaneKatrina,132husbandswivesdescriptionsof,160–61,160–61nandwivessuspicionsaboutgayness,116–17
Hussein,Saddam,93,94
ignoringwhatpeopletellyou,153–57immigrants,andoriginsofnotableAmericans,184,186implicitassociationtest,132–34incentives,108,109incest,50–52,54,121incomedistribution,174–78,185
Indiaeducationinrural,209,210pregnancyin,187,188–89andsex/pornsearches,19
IndianaUniversity,anddimensionalitystudy,247–48individuals,predictingtheactionsof,266–70influenza,dataabout,57,71information.SeeBigData;data;smalldata;specificsourceorsearchInstagram,99,151–52,261InternalRevenueService(IRS),172,178–80.Seealsotaxesinternetasaddiction,219–20browsingbehavioron,141–44asdominatedbysmut,151segregationon,141–44truthaboutthe,140–45SeealsoA/Btesting;socialmedia;specificsite
intuitionandA/Btesting,214andcounterintuitiveresults,37–38datascienceas,26–33andthedramatic,33aswrong,31,32–33
IQ/intelligenceandDNA,249–50ofFacebookusers,261andparentsprejudiceagainstchildren,135
Iran,gaysin,119IraqWar,94Irresistible(Alter),219–20Islamophobiaanddangerofempoweredgovernments,266–67,268–69SeealsoMuslims
IvyLeagueschools
incomeofgraduatesfrom,237–39Seealsospecificschool
Jacob,Brian,254James,Bill,198–99James,LeBron,34,37,41,67Jawbone,277Jews,129,138JiHyunBaek,266Jobs,Steve,185Johnson,EarvinIII,67Johnson,LyndonB.,170,171Johnson,“Magic,”67jokesanddating,80–81andlying,109nigger,6,15,132,133,134andtruthabouthateandprejudice,132,133,134
Jones,BenjaminF.,227,228,276Jordan,Jeffrey,67Jordan,Marcus,67Jordan,Michael,40–41,67Jurafsky,Dan,80
Kadyrov,Akhmad,227Kahneman,Daniel,283Kane,Thomas,255Katz,Lawrence,243Kaufmann,Sarah,236–37Kawachi,Ichiro,266Kayak(website),265Kennedy,JohnF.,170,171,227Kerry,John,8,244KingJohn(Shakespeare),89–90
King,MartinLutherJr.,132King,WilliamLyonMackenzie(alias),138–39Kinsey,Alfred,113Kirkpatrick,David,154Klapper,Daniel,225Knight,Phil,157Kodak,andpicturesasdata,99Kohane,Isaac,203–5Krueger,AlanB.,56,238KuKluxKlan,12,137Kubrick,Stanley,190–91Kundera,Milan,233
languageanddigitalrevolution,274,279emphasisin,94askeytounderstandingbias,74–76andpayingbackloans,259–60andtraditionalresearchmethods,274andU.S.asunitedordivided,78–79Seealsowords
learning.SeeeducationLemaire,Alain,257–61Levitt,Steven,36,222,254,280,281.SeealsoFreakonomicsliberalsandoriginsofpoliticalpreferences,169–71andparentsprejudiceagainstchildren,136andtruthabouttheinternet,140,141–45andwordsasdata,75–76,93,95–96
librarycards,andlying,106life,asimitatingart,190–97lifeexpectancy,176–78Linden,Greg,203listening,anddating,82n
loans,payingback,257–61LosAngelesTimes,andObamaspeechaboutterrorism,130lotteries,229,229nLuca,Michael,265Lycos(searchengine),60lyingandage,108nandincentives,108andjokes,109toourselves,107–8,109andpolls,107andpornography,110prevalenceof,21,105–12,239andracism,109reasonsfor,106,107,108,108nandreimagingdata,103andsearchinformation,5–6,12andsex,112–28byStephens-Davidowitz,282nandsurveys,105–7,108,108nandtaxes,180andvotingbehavior,106,107,109–10“white,”107Seealsodigitaltruthserum;truth;specifictopic
Ma-Kellams,Christine,266MaconCounty,Alabama,successful/notableAmericansfrom,183,186–87Malik,Tashfeen,129–30ManchesterUniversity,anddimensionalitystudy,247–48MassachusettsInstituteofTechnology,Pantheonprojectof,184–85Matthews,Dylan,202–3McCaffrey,Ed,196–97nMcFarland,Daniel,80McPherson,James,79
measurability,overemphasison,252–56“MeasuringEconomicGrowthfromOuterSpace”(Henderson,Storygard,and
Weil),99–101mediabiasof,22,74–77,93–97,102–3andexamplesofBigDatasearches,22ownersof,96andtruthabouthateandprejudice,130,131andtruthabouttheinternet,143andwordsasdata,74–77,93–97Seealsospecificorganization
Medicare,anddoctorsreimbursements,230,240medicine.Seedoctors;healthMessing,Solomon,144MetaCrawler(searchengine),60Mexicans,andtruthabouthateandprejudice,129Michel,Jean-Baptiste,76–77,78–79MicrosoftandCambridgeUniversitystudyaboutIQofFacebookusers,261ColumbiaUniversitypancreaticcancerstudyand,28–29,30andtypingerrorsbysearchers,48–50
Milkman,KatherineL.,91–92MinorityReport(movie),266Minsky,Marvin,273minutes,zoominginon,190–97Moneyball,OaklandA’sprofilein,254,255Moore,Julianne,185Moskovitz,Dustin,238–39moviesandadvertising,224–25andcrime,193,194–95,273violent,190–97,273zoominginon,190–97Seealsospecificmovie
msnbc.com,143murderanddangerofempoweredgovernment,266–67,268–69Seealsoviolence
Murdoch,Rupert,96Murray,Patty,256Muslimsanddangerofempoweredgovernments,266–67,268–69andtruthabouthateandprejudice,129–31,162–63
Nantz,Jim,223NationalCenterforHealthStatistics,181NationalEnquirermagazine,150–51,152nationalidentity,78–79naturalexperiments,226–28,229–30,234–37,239–40NBA.Seebasketballneighbors,andmonetarywindfalls,229Netflix,156–57,203,212Netzer,Oded,257–61NewEnglandPatriots-BaltimoreRavensgames,221,222–24NewJackCity(movie),191NewYorkCity,RollingStonessongabout,278NewYorkmagazine,andA/Btesting,212NewYorkMets,165–66,167,169,171NewYorkPost,andwordsasdata,96NewYorkTimesClinton(Bill)searchin,61andIQDNAstudyresults,249andObamaspeechaboutterrorism,130Stephens-Davidowitz’sfirstcolumnaboutsexin,282Stormfrontusersand,137,140,145andtruthaboutinternet,145typesofstoriesin,92vaginalodorsstoryin,161
andwordsasdata,95–96NewYorkTimesCompany,andwordsasdata,95–96NewYorkermagazineDuflostudyin,209andStephens-Davidowitz’sdoppelgangersearch,202
NewsCorporation,96newslibrary.com,95Nielsensurveys,5Nietzsche,Friedrich,268Nigeria,pregnancyin,188,189,190“nigger”andhateandprejudice,6,7,131–34,244jokes,6,15,132,133,134motivationforsearchesabout,6andObama’selection,7,244andpowerofBigData,15prevalenceofsearchesabout,6andTrump’selection,14
nightlight,andpicturesasdata,100–101Nike,157Nixon,RichardM.,170,171numbers,obsessiveinfatuationwith,252–56
Obama,BarackandA/Btesting,211–14campaignhomepagefor,212–14electionsof2008and,2,6–7,133,134,211–12electionsof2012and,8–9,10,133,134,211–12andracisminAmerica,2,6–7,8–9,12,134,240,243–44StateoftheUnion(2014)speechof,159–60andtruthabouthateandprejudice,130–31,133,134,162–63
Ocalahorseauction,65–66,67,69Oedipalcomplex,Freudtheoryof,50–51OkCupid(datingsite),139
Olken,BenjaminA.,227,228127Hours(movie),90,91OptimalDecisionsGroup,262Or,Flora,266Ortiz,David“BigPapi,”197–200,200n,203“out-of-sample”tests,250–51
Page,Larry,60,61,62,103pancreaticcancer,ColumbiaUniversity-Microsoftstudyof,28–29Pandora,203Pantheonproject(MassachusettsInstituteofTechnology),184–85parents/parentingandchildabuse,145–47,149–50,161andexamplesofBigDatasearches,22andprejudiceagainstchildren,134–36,135n
Parks,Rosa,93,94Parr,Ben,153–54Pathak,Parag,235–36PatientsLikeMe.com,205patterns,anddatascienceasintuitive,27,33Paul,Chris,37payingbackloans,257–61PECOTAmodel,199–200,200npedigreesofbasketballplayers,67ofhorses,66–67,69,71
pedometer,Chanceemphasison,252–53penisandFreud’stheories,46andphallicsymbolsindreams,46–47sizeof,17,19,123–24,124n,127
“penistrian,”45,46,48,50PennsylvaniaStateUniversity,incomeofgraduatesof,237–39Peysakhovich,Alex,254
phallicsymbols,indreams,46–48PhiladelphiaDailyNews,andwordsasdata,95Philippines,cigaretteeconomyin,102physicalappearanceanddating,82,120nandparentsprejudiceagainstchildren,135–36andtruthaboutsex,120,120n,125–26,127
physics,asscience,272–73pictures,asdata,97–102,103Pierson,Emma,160nPiketty,Thomas,283PinkyPizwaanski(horse),70pizza,informationabout,77PlentyOfFish(datingsite),139Plomin,Robert,249–50politicalscience,anddigitalrevolution,244,274politicsandA/Btesting,211–14complexityof,273andignoringwhatpeopletellyou,157andoriginofpoliticalpreferences,169–71andtruthabouttheinternet,140–44andwordsasdata,95–97Seealsoconservatives;Democrats;liberals;Republicans
pollsGooglesearchescomparedwith,9andlying,107reliabilityof,12Seealsospecificpollortopic
Pop-Tarts,72Popp,Noah,202Popper,Karl,45,272,273PornHub(website),14,50–52,54,116,120–22,274pornography
asaddiction,219andbiasofsocialmedia,151andbreastfeeding,19cartoon,52child,121anddigitalrevolution,279andgays,114–15,114n,116,117,119honestyofdataabout,53–54andincest,50–52inIndia,19andlying,110popularvideoson,152popularityof,53,151andpowerofBigData,53searchenginesfor,61nandtruthaboutsex,114–15,117unemployedand,58,59
Posada,Jorge,200povertyandlifeexpectancy,176–78andwordsasdata,93,94Seealsoincomedistribution
predictionsanddatascienceasintuitive,27andgettingthenumbersright,74andwhatcountsasdata,74andwhatvs.whyitworks,71Seealsospecifictopic
pregnancy,20,187–90prejudiceimplicit,132–34ofparentsagainstchildren,134–36,135nsubconscious,134,163truthabout,128–40,162–63
Seealsobias;hate;race/racism;StormfrontPremise,101–2,103pricediscrimination,262–65prisonconditions,andcrime,235privacyissues,anddangerofempoweredgovernment,267–70propertyrights,andwordsasdata,93,94proquest.com,95Prosper(lendingsite),257Psy,“GangnamStyle”videoof,152psychics,266psychologyanddigitalrevolution,274,277–78,279asscience,273assoftscience,273andtraditionalresearchmethods,274
Quantcast,137questionsaskingtheright,21–22anddating,82–83
race/racismcausesof,18–19electionsof2008and,2,6–7,12,133electionsof2012and,2–3,8,133electionsof2016and,8,11,12,14,133explicit,133,134andHarvardCrimsoneditorialaboutZuckerberg,155andlying,109mapof,7–9andObama,2,6–7,8–9,12,133,240,243–44andpredictingsuccessinbasketball,35,36–37andRepublicans,3,7,8Stephens-Davidowitz’sstudyof,2–3,6–7,12,14,243–44
andTrump,8,9,11,12,14,133andtruthabouthateandprejudice,129–34,162–63SeealsoMuslims;“nigger”
randomizedcontrolledexperimentsandA/Btesting,209–21andcausality,208–9
rape,121–22,190–91Rawlings,Craig,80“rawtube”(pornsite),59Reagan,Andy,88,90,91Reagan,Ronald,227regressiondiscontinuity,234–36Reisinger,Joseph,101–2,103relationships,lasting,31–33religion,andlifeexpectancy,177Renaissance(hedgefund),246Republicanscoreprinciplesof,94andoriginsofpoliticalpreferences,170–71andracism,3,7,8andwordsasdata,93–97Seealsospecificpersonorelection
researchandexpansionofresearchmethodology,275–76Seealsospecificresearcherorresearch
reviews,ofbusinesses,265“RocketTube”(gaypornsite),115RollingStones,278Romney,Mitt,10,212RoseauCounty,Minnesota,successful/notableAmericansfrom,186,187RunawayBride(movie),192,195
sabermetricians,198–99SanBernardino,California,shootingin,129–30
Sands,Emily,202scienceandBigData,273andexperiments,272–73real,272–73atscale,276soft,273
searchenginesdifferentiationofGooglefromother,60–62forpornography,61nreliabilityof,60word-count,71Seealsospecificengine
searchers,typingerrorsby,48–50searchesnegativewordsusedin,128–29Seealsospecificsearch
“secretsaboutpeople,”155–56Seder,Jeff,63–66,68–70,71,74,155,256segregation,141–44.Seealsobias;discrimination;race/racismself-employedpeople,andtaxes,178–80sentimentanalysis,87–92,247–48sexasaddiction,219andbenefitsofdigitaltruthserum,158–59,161andchildhoodexperiences,50–52condomsand,5,122anddigitalrevolution,274,279anddimensionsofsexuality,279duringmarriage,5–6andfetishes,120andFreud,45–52Googlesearchesabout,5–6,51–52,114,115,117,118,122–24,126,127–28andhandlingthetruth,158–59,161
andHarvardCrimsoneditorialaboutZuckerberg,155howmuch,122–23,124–25,127inIndia,19newinformationabout,19oral,128andphysicalappearance,120,120n,125–26,127andpowerofBigData,53pregancyandhaving,189RollingStonessongabout,278andsexorgans,123–24Stephens-Davidowitz’sfirstNewYorkTimescolumnabout,282andtraditionalresearchmethods,274truthabout,5–6,112–28,114n,117andtypingerrors,48–50andwomen’sgenitals,126–27Seealsoincest;penis;pornography;rape;vagina
Shadow(app),47Shakespeare,William,89–90Shapiro,Jesse,74–76,93–97,141–44,235,273“Shattered”(RollingStonessong),278shoppinghabits,predictionsabout,71–74TheSignalandtheNoise(Silver),254Silver,Nate,10,12–13,133,199,200,254,255Simmons,Bill,197–98Singapore,pregnancyin,190Siroker,Dan,211–12sleepanddigitalrevolution,279Jawboneand,276–77andpregnancy,189
“Slutload,”58smalldata,255–56smiles,andpicturesasdata,99Smith,MichaelD.,224
Snow,John,275Sochi,Russia,gaysin,119socialmediabiasofdatafrom,150–53doppelgangerhuntingon,201–3andwivesdescriptionsofhusbands,160–61,160–61nSeealsospecificsiteortopic
socialscience,272–74,276,279socialsecurity,andwordsasdata,93socioeconomicbackgroundandpredictingsuccessinbasketball,34–41Seealsopedigrees
sociology,273,274Soltas,Evan,130,162,266–67SouthAfrica,pregnancyin,189SouthernPovertyLawCenter,137Spain,pregnancyin,190SpartanburgHerald-Journal(SouthCarolina),andwordsasdata,96specialization,extreme,186speed,fortransmittingdata,56–59“SpiderSolitaire,”58Stephens-Davidowitz,Noah,165–66,165–66n,169,206,263Stephens-Davidowitz,Sethambitionsof,33lyingby,282nmatechoicefor,25–26,271motivationsof,2obsessivenessof,282,282nprofessionalbackgroundof,14andwritingconclusions,271–72,279,280–84
Stern,Howard,157stockmarketdatafor,55–56andexamplesofBigDatasearches,22
Summers-Stephens-Davidowitzattempttopredictthe,245–48,251–52Stone,Oliver,185Stoneham,James,266,269Storegard,Adam,99–101storiescategories/typesof,91–92viral,22,92andzoomingin,205–6Seealsospecificstory
Stormfront(website),7,14,18,137–40stretchmarks,andpregnancy,188–89StuyvesantHighSchool(NewYorkCity),231–37,238,240suburbanareas,andoriginsofnotableAmericans,183–84successful/notableAmericansfactorsthatdrive,185–86zoominginon,180–86
suffering,andbenefitsofdigitaltruthserum,161suicide,anddangerofempoweredgovernment,266,267–68Summers,LawrenceandObama-racismstudy,243–44andpredictingthestockmarket,245,246,251–52Stephens-Davidowitz’smeetingwith,243–45
Sunstein,Cass,140SuperBowlgames,advertisingduring,221–25,239SuperCrunchers(Gnau),264SupremeCourt,andabortion,147Surowiecki,James,203surveysin-person,108internet,108andlying,105–7,108,108nandpicturesasdata,97skepticismabout,171telephone,108
andtruthaboutsex,113,116andzoominginonhoursandminutes,193Seealsospecificsurveyortopic
Syrianrefugees,131
Taleb,Nassim,17Tartt,Donna,283TaskRabbit,212taxescheatingon,22,178–80,206andexamplesofBigDatasearches,22andlying,180andself-employedpeople,178–80andwordsasdata,93–95zoominginon,172–73,178–80,206
teachers,usingteststojudge,253–54teenagersadopted,108nasgay,114,116lyingby,108nandoriginsofpoliticalpreferences,169andtruthaboutsex,114,116Seealsochildren
televisionandA/Btesting,222advertisingon,221–26
Terabyte,264terrorism,18,129–31tests/testingofhighschoolstudents,231–37,253–54andjudgingteacher,253–54andobsessiveinfatuationswithnumbers,253–54onlinebehaviorassupplementto,278andsmalldata,255–56
SeealsospecifictestorstudyThiel,Peter,155ThinkProgress(website),130Thinking,FastandSlow(Kahneman),283Thome,Jim,200Tourangeau,Roger,107,108towns,zoominginon,172–90ToyStory(movie),192Trump,Donaldelectionsof2012and,7andignoringwhatpeopletellyou,157andimmigration,184issuespropagatedby,7andoriginsofnotableAmericans,184pollsabout,1predictionsabout,11–14andracism,8,9,11,12,14,133,139Seealsoelections,2016
truthbenefitsofknowing,158–63handlingthe,158–63Seealsodigitaltruthserum;lying;specifictopic
TuskegeeUniversity,183TwentiethCenturyFox,221–22Twitter,151–52,160–61n,201–3typingerrorsbysearchers,48–50
TheUnbearableLightnessofBeing(Kundera),233Uncharted(AidenandMichel),78–79unemploymentandchildabuse,145–47dataabout,56–57,58–59
unintendedconsequences,197UnitedStates
andCivilWar,79asunitedordivided,78–79
UniversityofCalifornia,Berkeley,racismin2008electionstudyat,2UniversityofMaryland,surveyofgraduatesof,106–7urbanareasandlifeexpectancy,177andoriginsofnotableAmericans,183–84,186
vagina,smellsof,19,126–27,161Varian,Hal,57–58,224Vikingmaiden88,136–37,140–41,145violenceandrealscience,273zoominginon,190–97Seealsomurder
voterregistration,106voterturnout,9–10,109–10votingbehavior,andlying,106,107,109–10Vox,202
Walmart,71–72WashingtonPost,andwordsasdata,75,94WashingtonTimes,andwordsasdata,75,94–95wealthandlifeexpectancy,176–77Seealsoincomedistribution
weather,andpredictionsaboutwine,73–74Weil,DavidN.,99–101Weiner,Anthony,234nwhitenationalism,137–40,145.SeealsoStormfrontWhitepride26,139Wikipedia,14,180–86wine,predictionsabout,72–74wives
anddescriptionsofhusbands,160–61,160–61nandsuspicionsaboutgaynessofhusbands,116–17
womenbreastsof,125,126buttof,125–26genitalsof,126–27violenceagainst,121–22Seealsogirls;wives;specifictopic
wordsandbias,74–76,93–97andcategories/typesofstories,91–92asdata,74–97anddating,80–86anddigitalrevolution,278anddigitalizationofbooks,77,79andgaymarriage,74–76andsentimentanalysis,87–92andU.S.asunitedordivided,78–79
workers’rights,93,94WorldBank,102WorldofWarcraft(game),220Wrenn,Doug,39–40,41
YahooNews,140,143yearbooks,highschool,98–99Yelp,265Yilmaz,Ahmed(alias),231–33,234,234nYouTube,152
Zayat,Ahmed,63–64,65ZerotoOne(Thiel),155zoominginonbaseball,165–69,165–66n,171,197–200,200n,203,206,239benefitsof,205–6
oncounties,cities,andtowns,172–90,239–40anddatasize,171,172–73ondoppelgangers,197–205onequalityofopportunity,173–75ongambling,263–65onhealth,203–5,275onincomedistribution,174–76,185andinfluenceofchildhoodexperiences,165–71,165–66n,206onlifeexpectancy,176–78onminutesandhours,190–97andnaturalexperiments,239–40andoriginofpoliticalpreferences,169–71onpregnancy,187–90storiesfrom,205–6onsuccessful/notableAmericans,180–86ontaxes,172–73,178–80,206
Zuckerberg,Mark,154–56,157,158,238–39
ABOUTTHEAUTHOR
Seth Stephens-Davidowitz is aNew York Times op-ed contributor, a visitinglectureratTheWhartonSchool,andaformerGoogledatascientist.HereceivedaBAinphilosophyfromStanford,wherehegraduatedPhiBetaKappa,andaPhD in economics from Harvard. His research—which uses new, big datasourcestouncoverhiddenbehaviorsandattitudes—hasappearedintheJournalofPublicEconomics andotherprestigiouspublications.He lives inNewYorkCity.
DiscoverGreatAuthors,ExclusiveOffers,andmoreathc.com.
COPYRIGHT
EVERYBODYLIES.Copyright©2017bySethStephens-Davidowitz.Copyright©2017bySethStephens-Davidowitz.AllrightsreservedunderInternationalandPan-AmericanCopyrightConventions.Bypaymentoftherequiredfees,you
havebeengrantedthenonexclusive,nontransferablerighttoaccessandreadthetextofthise-bookon-screen.Nopartofthistextmaybereproduced,
transmitted,downloaded,decompiled,reverse-engineered,orstoredinorintroducedintoanyinformationstorageandretrievalsystem,inanyformorbyanymeans,whetherelectronicormechanical,nowknownorhereafterinvented,
withouttheexpresswrittenpermissionofHarperCollinse-books.
FIRSTEDITION
CoverdesignbyLisaAmorosoCoverphotographofelephant/zebra©VisualsUnlimited,Inc./VictorHabbick
Otherzebras©Shutterstock/AaronAmat
ISBN978-0-06-239085-1
EPubEditionMay2017ISBN9780062390875
ABOUTTHEPUBLISHER
AustraliaHarperCollinsPublishers(Australia)Pty.Ltd.Level13,201ElizabethStreetSydney,NSW2000,Australiawww.harpercollins.com.au
CanadaHarperCollinsCanada2BloorStreetEast-20thFloorToronto,ONM4W1A8,Canadawww.harpercollins.ca
NewZealandHarperCollinsPublishersNewZealandUnitD1,63ApolloDriveRosedale0632Auckland,NewZealandwww.harpercollins.co.nz
UnitedKingdomHarperCollinsPublishersLtd.1LondonBridgeStreetLondonSE19GF,UKwww.harpercollins.co.uk
UnitedStatesHarperCollinsPublishersInc.195BroadwayNewYork,NY10007www.harpercollins.com
*GoogleTrendshasbeenasourceofmuchofmydata.However,sinceitonlyallowsyoutocomparetherelativefrequencyofdifferentsearchesbutdoesnotreporttheabsolutenumberofanyparticularsearch,IhaveusuallysupplementeditwithGoogleAdWords,whichreportsexactlyhowfrequentlyeverysearchismade.InmostcasesIhavealsobeenabletosharpenthepicturewiththehelpofmyownTrends-basedalgorithm,whichIdescribeinmydissertation,“EssaysUsingGoogleData,”andinmyJournalofPublicEconomicspaper,“TheCostofRacialAnimusonaBlackCandidate:EvidenceUsingGoogleSearchData.”Thedissertation,alinktothepaper,andacompleteexplanationofthedataandcodeusedinalltheoriginalresearchpresentedinthisbookareavailableonmywebsite,sethsd.com.
*Fulldisclosure:ShortlyafterIcompletedthisstudy,ImovedfromCaliforniatoNewYork.Usingdatatolearnwhatyoushoulddoisofteneasy.Actuallydoingitistough.
*WhiletheinitialversionofGoogleFluhadsignificantflaws,researchershaverecentlyrecalibratedthemodel,withmoresuccess.
*In1998,ifyousearched“cars”onapopularpre-Googlesearchengine,youwereinundatedwithpornsites.Thesepornsiteshadwrittentheword“cars”frequentlyinwhitelettersonawhitebackgroundtotrickthesearchengine.Theythengotafewextraclicksfrompeoplewhomeanttobuyacarbutgotdistractedbytheporn.
*OnetheoryIamworkingon:BigDatajustconfirmseverythingthelateLeonardCoheneversaid.Forexample,LeonardCohenoncegavehisnephewthefollowingadviceforwooingwomen:“Listenwell.Thenlistensomemore.Andwhenyouthinkyouaredonelistening,listensomemore.”Thatseemstoberoughlysimilartowhatthesescientistsfound.
*Anotherreasonforlyingissimplytomesswithsurveys.Thisisahugeproblemforanyresearchregardingteenagers,fundamentallycomplicatingourabilitytounderstandthisagegroup.Researchersoriginallyfoundacorrelationbetweenateenager’sbeingadoptedandavarietyofnegativebehaviors,suchasusingdrugs,drinkingalcohol,andskippingschool.Insubsequentresearch,theyfoundthiscorrelationwasentirelyexplainedbythe19percentofself-reportedadoptedteenagerswhoweren’tactuallyadopted.Follow-upresearchhasfoundthatameaningfulpercentofteenagerstellsurveystheyaremorethansevenfeettall,weighmorethanfourhundredpounds,orhavethreechildren.Onesurveyfound99percentofstudentswhoreportedhavinganartificiallimbtoacademicresearcherswerekidding.
*SomemayfinditoffensivethatIassociateamalepreferenceforJudyGarlandwithapreferenceforhavingsexwithmen,eveninjest.AndIcertainlydon’tmeantoimplythatall—orevenmost—gaymenhaveafascinationwithdivas.Butsearchdatademonstratesthatthereissomethingtothestereotype.IestimatethatamanwhosearchesforinformationaboutJudyGarlandisthreetimesmorelikelytosearchforgaypornthanstraightporn.Somestereotypes,BigDatatellsus,aretrue.
*Ithinkthisdataalsohasimplicationsforone’soptimaldatingstrategy.Clearly,oneshouldputoneselfoutthere,getrejectedalot,andnottakerejectionpersonally.Thisprocesswillallowyou,eventually,tofindthematewhoismostattractedtosomeonelikeyou.Again,nomatterwhatyoulooklike,thesepeopleexist.Trustme.
*IwantedtocallthisbookHowBigIsMyPenis?WhatGoogleSearchesTeachUsAboutHumanNature,butmyeditorwarnedmethatwouldbeatoughsell,thatpeoplemightbetooembarrassedtobuyabookwiththattitleinanairportbookstore.Doyouagree?
*Tofurthertestthehypothesisthatparentstreatkidsofdifferentgendersdifferently,Iamworkingonobtainingdatafromparentingwebsites.Thiswouldincludeamuchlargernumberofparentsthanthosewhomaketheseparticular,specificsearches.
*IanalyzedTwitterdata.IthankEmmaPiersonforhelpdownloadingthis.Ididnotincludedescriptorsofwhatone’shusbandisdoingrightnow,whichareprevalentonsocialmediabutwouldn’treallymakesenseonsearch.Eventhesedescriptionstilttowardthefavorable.Thetopwaystodescribewhatahusbandisdoingrightnowonsocialmediaare“working”and“cooking.”
*Fulldisclosure:WhenIwasfact-checkingthisbook,NoahdeniedthathishatredofAmerica’spastimeisakeypartofhispersonality.Hedoesadmittohatingbaseball,buthebelieveshiskindness,loveofchildren,andintelligencearethecoreelementsofhispersonality—andthathisattitudesaboutbaseballwouldnotevenmakethetopten.However,Iconcludedthatit’ssometimeshardtoseeone’sownidentityobjectivelyand,asanoutsideobserver,IamabletoseethathatingbaseballisindeedfundamentaltowhoNoahis,whetherhe’sabletorecognizeitornot.SoIleftitin.
*Thisstoryshowshowthingsthatseembadmaybegoodiftheypreventsomethingworse.EdMcCaffrey,aStanford-educatedformerwidereceiver,usesthisargumenttojustifylettingallfourofhissonsplayfootball:“Theseguyshaveenergy.And,so,ifthey’renotplayingfootball,they’reskateboarding,they’reclimbingtrees,they’replayingtaginthebackyard,they’redoingpaintball.Imean,they’renotgoingtositthereanddonothing.And,so,thewayIlookatitis,hey,atleastthere’sruleswithinthesportoffootball....Mykidshavebeentotheemergencyroomforfallingoffdecks,gettinginbikecrashes,skateboarding,fallingoutoftrees.Imean,younameit...Yea,it’saviolentcollisionsport.But,also,myguysjusthavethepersonality,where,atleastthey’renotsquirrel-jumpingoffmountainsanddoingcrazystufflikethat.So,it’sorganizedaggression,Iguess.”McCaffrey’sargument,madeinaninterviewonTheHerdwithColinCowherd,isoneIhadneverheardbefore.AfterreadingtheDahl/DellaVignapaper,Itaketheargumentseriously.Anadvantageofhugereal-worlddatasets,ratherthanlaboratorydata,isthattheycanpickupthesekindsofeffects.
*YoucanprobablytellbythispartofthebookItendtobecynicalaboutgoodstories.Iwantedonefeel-goodstoryinhere,soIamleavingmycynicismtoafootnote.IsuspectPECOTAjustfoundoutthatOrtizwasasteroiduserwhostoppedusingsteroidsandwouldstartusingthemagain.Fromthestandpointofprediction,itisactuallyprettycoolifPECOTAwasabletodetectthat—butitmakesitalessmovingstory.
*InlookingforpeoplelikeYilmazwhoscorednearthecutoff,Iwasblownawaybythenumberofpeople—intheirtwentiesthroughtheirfifties—whorememberthistest-takingexperiencefromtheirearlyteensandspeakaboutmissingacutoffindramaticterms.ThisincludesformercongressmanandNewYorkCitymayoralcandidateAnthonyWeiner,whosayshemissedStuybyasinglepoint.“Theydidn’twantme,”hetoldme,inaphoneinterview.
*Sinceeverybodylies,youshouldquestionmuchofthisstory.MaybeI’mnotanobsessiveworker.MaybeIdidn’tworkextraordinarilyhardonthisbook.MaybeI,likelotsofpeople,canexaggeratejusthowmuchIwork.Maybemythirteenmonthsof“hardwork”includedfullmonthsinwhichIdidnoworkatall.MaybeIdidn’tliveasahermit.Maybe,ifyoucheckedmyFacebookprofile,you’dseepicturesofmeoutwithfriendsduringthissupposedhermitperiod.OrmaybeIwasahermit,butitwasnotself-imposed.MaybeIspentmanynightsalone,unabletowork,hopinginvainthatsomeonewouldcontactme.Maybenobodye-vitesmetoanything.MaybenobodymessagesmeonBumble.Everybodylies.Everynarratorisunreliable.