INRIASAC: Simple Hypernym Extraction Methods
Transcript of INRIASAC: Simple Hypernym Extraction Methods
…yachtfour-wheel_driveboatsledmountain_bikerecreational_vehiclebicycletricyclerocketbulldozertugboatjeep..
yacht < boattugboat < boatmountain_bike < bicycleairplane < aircraftsedan < carbarge < boatroadster < carairplane < airshipboat < vessel …
SemEval-2015 : Taxonomy Extraction Evaluation
SemEval-2015 : Taxonomy Extraction Evaluation
8 domain lists (370 to 17584 terms each)chemical: agarose, nickel sulfate heptahydrate, aminoglycan, pinoquercetin,
lupanine, alkadiyne, …
equipment: storage equipment, strapping, traveling microscope, minneapolis-moline, aerial straps…
food: sauce gribiche, botifarra, phitti, food colouring, bean, limequat, kalach, ice cream, acini di pepe …
science: biological and physical, history of religions of eastern origins, linguistic anthropology, religion, …
WN_chemical: abo antibodies, acaricide, acaroid resin, acceptor, acetal, acetaldehyde, acetaldol, …
WN_equipment: acoustic modem, aerator, air search radar, amplifier, anti submarine rocket, apishamore…
WN_food: absinth, acidophilus milk, adobo, agar, aioli, alcohol, ale, alfalfa, allemande, allergy diet, …
WN_science: abnormal psychology, acoustics, aerology, aeromechanics, aeronautics, agrobiology …
Terms: one to nine wordsSome terms were very short and ambiguous
• ga, os, tu, ada, aji, …Some very long are rare
• udp-n-acetyl-alpha-d-muramoyl-l-alanyl-gamma-d-glutamyl-l-lysyl-d-alanyl-d-alanine• korea advanced institute of science and technology satellite 4
Easy Part: substring inclusion
bicycle helmet < helmetboar's tusk helmet < helmetboeotian helmet < helmetbok telescope < telescopebradford robotic telescope < robotic
telescopebradford robotic telescope <
telescopebroad band x-ray telescope < telescopebroad band x-ray telescope < x-ray
telescopebucket conveyor <
conveyorbulgarian m36 helmet < helmet
ats 56 g < atscaterpillar d2 <
caterpillarcaterpillar d8 <
caterpillarcaterpillar d9 <
caterpillardennis ds series <
dennishelmet of coţofeneşti <
helmethistory of the telescope <
historylist of agricultural
equipment < list
suffix prefix : A of B
Around ¼ submissions
Type of substring errors
licorice < ricesurface to air missile system < surfacefruit 'n fibre < fruit
Omissions:rataniaphenol iii < rataniaphenol rataniaphenol i < rataniaphenol rataniaphenol ii < rataniaphenol
Main Intuition
Hypernyms and hyponyms often occur together-The poodle is a group of formal dog breeds…
Hypernyms are more common than hyponyms-Though canid is rarer than dog
co-occurrence statistics, term frequencies in a collection of documents
Reference document collection
Wikipedia (but could have been web collection)-Only text (no categories, redirects, titles, …)-Sentencized, 125M sentences-Porter stemmed
' Anarchism ' is a political philosophy that advocates stateless societies often defined as self-governed voluntary institutions , but that several authors have defined as more specific institutions based on non-hierarchical free associations.
anarch _ _ _ polit philosophi _ advoc stateless societi _ defin _ self-govern voluntari institut _ _ _ sever author _ defin _ _ specif institut base _ non-hierarch free associ
0 electro-mechanical systems1 biological and physical2 history of religions of eastern origins
3 linguistic anthropology4 metaphysics
0 electro-mechan system1 biolog _ physic2 histori _ religion _ eastern origin
3 linguist anthropolog4 metaphys
SemEval-2015 Task 17: Taxonomy Extraction EvaluationGather text statistics from Wikipedia dump• Counts of term co-occurrence, in the same sentence• document frequency of terms
Method:• Consider all domain terms B co-occuring in the same
Wikipedia sentences as a domain term A as possible hypernym
• Eliminate any candidate B that appears in fewer documents than A (i.e., “less general”)
• Retain N=3 most frequent remaining candidates as “hypernyms” of A
SemEval-2015 Task 17: Taxonomy Extraction EvaluationGather text statistics from Wikipedia dump• presence of terms in the same sentence• presence in the same document• term frequency• document frequencyConsider theory of inheritance < theory Method:
• Consider all domain terms B co-occuring in the same Wikipedia sentences as a domain term A as possible hypernym
• Eliminate any candidate B that appears in fewer documents than A (i.e., “less general”)
• Retain N=3 most frequent candidates as “hypernyms” of A
#t1&t2 #doc1 #doc2 t1 t2 215 887 21977 biblic studi theologi111 887 383927 biblic studi histori50 887 64044 biblic studi religion43 887 412791 biblic studi music42 887 224983 biblic studi scienc
SemEval-2015 Task 17: Taxonomy Extraction EvaluationGather text statistics from Wikipedia dump• presence of terms in the same sentence• presence in the same document• term frequency• document frequencyConsider • Subsequences (25% of correct answers found)
- source code < code- theory of inheritance < theory
Method:1. Align subterms2. Otherwise, • Consider all domain terms B co-occuring in the same Wikipedia
sentences as a domain term A as possible hypernym• Eliminate any candidate B that appears in fewer documents than A(i.e.,
“less general”)• Retain N=3 most frequent candidates as “hypernyms” of A
#t1&t2 #doc1 #doc2 t1 t2 215 887 21977 biblic studi theologi111 887 383927 biblic studi histori50 887 64044 biblic studi religion43 887 412791 biblic studi music42 887 224983 biblic studi sciencbiblical
studies<theologybiblical studies<historybiblical
studies<religion
aerodynamics<engineeringaerodynamics<manufacturingaerodynamics<scienceaeronautical vehicles<engineeringaeronautical vehicles<manufacturingaerospace engineering<engineeringaerospace engineering<mechanical engineeringaerospace engineering<scienceafrican history<communicationafrican history<economicsafrican history<historyagricultural and resource economics<economicsagricultural and resource economics<nutritionagricultural and resource economics<scienceagronomy<economicsagronomy<engineeringagronomy<scienceair traffic control<communicationair traffic control<engineeringair traffic control<instrumentationalgebra<analysisalgebra<mathematicsalgebra<physicsalgebraic geometry<algebraalgebraic geometry<mathematicsalgebraic geometry<number theoryalgorithms<analysisalgorithms<mathematicsalgorithms<networking
alpha-bits<cerealalpha-bits<marshmallowalpha-bits<sugaralphabet pasta<pastaalphabet pasta<saucealphabet pasta<soupamandine<almondamandine<cakeamandine<potatoamaranth<beanamaranth<greensamaranth<seedamish friendship bread<breadamish friendship bread<cakeamish friendship bread<yeastamsterdam ossenworst<beefamsterdam ossenworst<sausageanadama bread<breadanadama bread<cornmealanadama bread<yeastanchovy essence<mayonnaiseanchovy essence<mustardanchovy essence<sauceandouille<pepperandouille<porkandouille<sausageandouillette<eggandouillette<porkandouillette<sausageanellini<barleyanellini<flouranellini<pastaangel food cake<cakeangel food cake<foodangel food cake<fruit
science food
Missing terms
Nb Terms Missing PercentageWN_chemical 1350 23 2%WN_equipment 475 10 2%WN_food 1486 40 3%WN_science 370 4 1%chemical 17584 5210 30%equipment 612 104 17%food 1555 55 4%science 452 37 8%