INRIASAC: Simple Hypernym Extraction Methods

17
INRIASAC: Simple Hypernym Extraction Methods: a baseline Gregory Grefenstette Inria Saclay

Transcript of INRIASAC: Simple Hypernym Extraction Methods

INRIASAC: Simple Hypernym Extraction Methods:

a baseline

Gregory Grefenstette Inria Saclay

…yachtfour-wheel_driveboatsledmountain_bikerecreational_vehiclebicycletricyclerocketbulldozertugboatjeep..

yacht < boattugboat < boatmountain_bike < bicycleairplane < aircraftsedan < carbarge < boatroadster < carairplane < airshipboat < vessel …

SemEval-2015 : Taxonomy Extraction Evaluation

SemEval-2015 : Taxonomy Extraction Evaluation

8 domain lists (370 to 17584 terms each)chemical: agarose, nickel sulfate heptahydrate, aminoglycan, pinoquercetin,

lupanine, alkadiyne, …

equipment: storage equipment, strapping, traveling microscope, minneapolis-moline, aerial straps…

food: sauce gribiche, botifarra, phitti, food colouring, bean, limequat, kalach, ice cream, acini di pepe …

science: biological and physical, history of religions of eastern origins, linguistic anthropology, religion, …

WN_chemical: abo antibodies, acaricide, acaroid resin, acceptor, acetal, acetaldehyde, acetaldol, …

WN_equipment: acoustic modem, aerator, air search radar, amplifier, anti submarine rocket, apishamore…

WN_food: absinth, acidophilus milk, adobo, agar, aioli, alcohol, ale, alfalfa, allemande, allergy diet, …

WN_science: abnormal psychology, acoustics, aerology, aeromechanics, aeronautics, agrobiology …

Terms: one to nine wordsSome terms were very short and ambiguous

• ga, os, tu, ada, aji, …Some very long are rare

• udp-n-acetyl-alpha-d-muramoyl-l-alanyl-gamma-d-glutamyl-l-lysyl-d-alanyl-d-alanine• korea advanced institute of science and technology satellite 4

Easy Part: substring inclusion

bicycle helmet < helmetboar's tusk helmet < helmetboeotian helmet < helmetbok telescope < telescopebradford robotic telescope < robotic

telescopebradford robotic telescope <

telescopebroad band x-ray telescope < telescopebroad band x-ray telescope < x-ray

telescopebucket conveyor <

conveyorbulgarian m36 helmet < helmet

ats 56 g < atscaterpillar d2 <

caterpillarcaterpillar d8 <

caterpillarcaterpillar d9 <

caterpillardennis ds series <

dennishelmet of coţofeneşti <

helmethistory of the telescope <

historylist of agricultural

equipment < list

suffix prefix : A of B

Around ¼ submissions

Type of substring errors

licorice < ricesurface to air missile system < surfacefruit 'n fibre < fruit

Omissions:rataniaphenol iii < rataniaphenol rataniaphenol i < rataniaphenol rataniaphenol ii < rataniaphenol

Main Intuition

Hypernyms and hyponyms often occur together-The poodle is a group of formal dog breeds…

Hypernyms are more common than hyponyms-Though canid is rarer than dog

co-occurrence statistics, term frequencies in a collection of documents

Reference document collection

Wikipedia (but could have been web collection)-Only text (no categories, redirects, titles, …)-Sentencized, 125M sentences-Porter stemmed

' Anarchism ' is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions , but that several authors have defined as more specific institutions based on non-hierarchical free associations.

anarch _ _ _ polit philosophi _ advoc stateless societi _ defin _ self-govern voluntari institut _ _ _ sever author _ defin _ _ specif institut base _ non-hierarch free associ

0 electro-mechanical systems1 biological and physical2 history of religions of eastern origins

3 linguistic anthropology4 metaphysics

0 electro-mechan system1 biolog _ physic2 histori _ religion _ eastern origin

3 linguist anthropolog4 metaphys

SemEval-2015 Task 17: Taxonomy Extraction EvaluationGather text statistics from Wikipedia dump• Counts of term co-occurrence, in the same sentence• document frequency of terms

Method:• Consider all domain terms B co-occuring in the same

Wikipedia sentences as a domain term A as possible hypernym

• Eliminate any candidate B that appears in fewer documents than A (i.e., “less general”)

• Retain N=3 most frequent remaining candidates as “hypernyms” of A

SemEval-2015 Task 17: Taxonomy Extraction EvaluationGather text statistics from Wikipedia dump• presence of terms in the same sentence• presence in the same document• term frequency• document frequencyConsider theory of inheritance < theory Method:

• Consider all domain terms B co-occuring in the same Wikipedia sentences as a domain term A as possible hypernym

• Eliminate any candidate B that appears in fewer documents than A (i.e., “less general”)

• Retain N=3 most frequent candidates as “hypernyms” of A

#t1&t2 #doc1 #doc2 t1 t2 215 887 21977 biblic studi theologi111 887 383927 biblic studi histori50 887 64044 biblic studi religion43 887 412791 biblic studi music42 887 224983 biblic studi scienc

SemEval-2015 Task 17: Taxonomy Extraction EvaluationGather text statistics from Wikipedia dump• presence of terms in the same sentence• presence in the same document• term frequency• document frequencyConsider • Subsequences (25% of correct answers found)

- source code < code- theory of inheritance < theory

Method:1. Align subterms2. Otherwise, • Consider all domain terms B co-occuring in the same Wikipedia

sentences as a domain term A as possible hypernym• Eliminate any candidate B that appears in fewer documents than A(i.e.,

“less general”)• Retain N=3 most frequent candidates as “hypernyms” of A

#t1&t2 #doc1 #doc2 t1 t2 215 887 21977 biblic studi theologi111 887 383927 biblic studi histori50 887 64044 biblic studi religion43 887 412791 biblic studi music42 887 224983 biblic studi sciencbiblical

studies<theologybiblical studies<historybiblical

studies<religion

aerodynamics<engineeringaerodynamics<manufacturingaerodynamics<scienceaeronautical vehicles<engineeringaeronautical vehicles<manufacturingaerospace engineering<engineeringaerospace engineering<mechanical engineeringaerospace engineering<scienceafrican history<communicationafrican history<economicsafrican history<historyagricultural and resource economics<economicsagricultural and resource economics<nutritionagricultural and resource economics<scienceagronomy<economicsagronomy<engineeringagronomy<scienceair traffic control<communicationair traffic control<engineeringair traffic control<instrumentationalgebra<analysisalgebra<mathematicsalgebra<physicsalgebraic geometry<algebraalgebraic geometry<mathematicsalgebraic geometry<number theoryalgorithms<analysisalgorithms<mathematicsalgorithms<networking

alpha-bits<cerealalpha-bits<marshmallowalpha-bits<sugaralphabet pasta<pastaalphabet pasta<saucealphabet pasta<soupamandine<almondamandine<cakeamandine<potatoamaranth<beanamaranth<greensamaranth<seedamish friendship bread<breadamish friendship bread<cakeamish friendship bread<yeastamsterdam ossenworst<beefamsterdam ossenworst<sausageanadama bread<breadanadama bread<cornmealanadama bread<yeastanchovy essence<mayonnaiseanchovy essence<mustardanchovy essence<sauceandouille<pepperandouille<porkandouille<sausageandouillette<eggandouillette<porkandouillette<sausageanellini<barleyanellini<flouranellini<pastaangel food cake<cakeangel food cake<foodangel food cake<fruit

science food

Missing terms

Nb Terms Missing PercentageWN_chemical 1350 23 2%WN_equipment 475 10 2%WN_food 1486 40 3%WN_science 370 4 1%chemical 17584 5210 30%equipment 612 104 17%food 1555 55 4%science 452 37 8%

Results

http://alt.qcri.org/semeval2015/task17/

Future work

Eliminate cyclesCrawl web for missing terms when neededCompare results using a Web collection rather than Wikipedia