Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon...

91
ONTOLOGY LEARNING
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    226
  • download

    0

Transcript of Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon...

Page 1: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ONTOLOGY LEARNING

Page 2: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

A LITTLE BIT OF CONTEXT….THE LANGUAGE TECHNOLOGIES

INSTITUTE

Carnegie Mellon University’s School of Computer Science: 1 undergraduate program 7 graduate departments (CSD, HCI, LTI, RI, SEI,MLD,ETC)The Language Technologies Institute is a graduate department

in the School of Computer Science About 25 faculty About 125 graduate students (~85 PhD, ~40MS) About 30 visitors, post-docs, staff programmers, …

© 2005 JAMIE CALLAN

2

Page 3: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

A LITTLE BIT OF CONTEXT….THE LANGUAGE TECHNOLOGIES

INSTITUTE

LTI courses and research focus on Machine translation, especially high-accuracy MT Natural language processing & computational linguistics Information retrieval & text mining Speech recognition & synthesis Computer-assisted language learning & intelligent tutoring Computational biology (“the language of the human genome”) … and combinations of the above

Speech-to-speech MT Open domain question answering …

3

Page 4: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

A LITTLE BIT ABOUT ME

My Research Interests Text Mining Information Retrieval Natural Language Processing Statistical Machine Learning

My Earlier Work Question Answering Multimedia Information Retrieval Near-duplicate Detection Opinion Detection

Page 5: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

TODAY’S TALK

ONTOLOGY LEARNING

Page 6: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ROADMAP

Introduction

Subtasks in Ontology Learning

Human-Guided Ontology Learning

User Study

Metric-Based Ontology Learning

Experimental Results

Conclusions

Page 7: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

INFORMATION RETRIEVAL

TECHNOLOGIES

Web Search Engines have changed our life

Google’s great achievement

But, have Search Engines fulfilled Information Needs?

Some, only some

What does search bring to us?

Overwhelming Information in Search Results

Tedious manual judgment still needed

Page 8: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

FIND A GOOD KINDERGARTEN IN THE

PITTSBURGH AREA

Page 9: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

BUY A USED CAR IN THE PITTSBURGH AREA

Page 10: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

IT WILL BE GREAT TO HAVE

A process to Crawl related documents Sort through relevant documents Identify important concepts/topics Organize materials

Page 11: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

THIS IS EXACTLY THE TASK OF INFORMATION TRIAGE, OR PERSONAL ONTOLOGY LEARNING

Page 12: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

INTRODUCTION

Ontology is a data model that represents a set of concepts within a domain and the set of pair-wise relationships between those concepts.

Page 13: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

EXAMPLE: A SIMPLE ONTOLOGY

ball table

Game Equipment

Page 14: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

EXAMPLE: WORDNET

Page 15: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

EXAMPLE: ODP

ODP

Page 16: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

INTRODUCTION

Ontology Learning is the task to construct a well-defined ontology given

a text corpus or

a set of concept terms

Page 17: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

INTRODUCTION

Ontology offers a nice way to summarize the important topics in a domain/collection

Ontology facilitates knowledge sharing and reuse

Ontology offers relational associations for reasoning and inference

Page 18: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ROADMAP

Introduction

Subtasks in Ontology Learning

Human-Guided Ontology Learning

User Study

Metric-Based Ontology Learning

Experimental Results

Conclusions

Page 19: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

SUBTASKS IN ONTOLOGY LEARNING

Concept Extraction

Synonym Detection

Relationship Formulation by Clustering

Cluster Labeling

Page 20: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

SUBTASKS IN ONTOLOGY LEARNING

Concept Extraction

Synonym Detection

Relationship Formulation by Clustering

Cluster Labeling

Page 21: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CONCEPT EXTRACTION

Two Steps:

Noun N-gram and Named Entity Mining

Web-based Concept Filtering

Page 22: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

NOUN N-GRAM MINING

I/PRP strongly/RB urge/VBP you/PRP to/TO cut/VB mercury/NN emissions/NNS from/IN power/NN plants/NNS by/IN 90/CD percent/NN by/IN 2008/CD ./.

Page 23: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

NOUN N-GRAM MINING

I/PRP strongly/RB urge/VBP you/PRP to/TO cut/VB mercury/NN emissions/NNS from/IN power/NN plants/NNS by/IN 90/CD percent/NN by/IN 2008/CD ./.

Extracted Bi-grams

Mercury emissions

Power plants

Page 24: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

Concept Filtering

Web-based POS error detection Assumption:

Among the first 10 google snippets, a valid concept appears more than a threshold (4 in our case)

Remove POS errors protect/NN polar/NN bear/NN

Remove Spelling errors Pullution, polor bear

Page 25: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CONCEPT EXTRACTION

Page 26: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

SUBTASKS IN ONTOLOGY LEARNING

Concept Extraction

Synonym Detection

Relationship Formulation by Clustering

Cluster Labeling

Page 27: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CLUSTERING

Hierarchical Clustering

Different Strategies for Concepts at Different Abstraction Levels

Page 28: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

EXAMPLE: A SIMPLE ONTOLOGY

ball table

Game Equipment

Abstract Level

Concrete Level

Page 29: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

BOTTOM-UP HIERARCHICAL

CLUSTERING

Concept candidates are organized into groups based on the 1st sense of the head noun in WordNet

One of their common head nouns will be selected as the parent concept for this group

pollution subsumes water pollution, air pollution.

Create a high accuracy concept forests at the lower level of the ontology

Start from Concrete Concepts

Page 30: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ONTOLOGY FRAGMENTS

Different fragments are

grouped

Page 31: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CONTINUE TO BE BOTTOM-UP

Problem Still a forest Many concepts at top level are not

grouped Solution: Clustering In any clustering algorithm, we need

a metric Hard to know the metric to

measure distance for those top level nodes

Page 32: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

HUMAN-GUIDED ONTOLOGY LEARNING

Learn What? A distance metric function

Learn from What? Concepts at lower levels since they are highly

accurate User feedback

After learning, then what? Apply the distance metric function to concepts at

the higher level to get distance scores for them Then use whatever clustering algorithm to group

them based on the distance scores

Page 33: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

TRAINING DATA FROM LOWER LEVELS

A set of concepts x(i) on the ith level of the ontology hierarchy

Distance matrix y(i)

The Matrix entry which corresponding to concept x(i)

j and x(i)k is y(i)

jk∈{0,1},

y(i)jk = 0, if x(i)

j and x(i)k in the same group;

y(i)jk = 1, otherwise.

Page 34: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

TRAINING DATA FROM LOWER LEVELS

0011

0011

1100

1100

y(i)

Page 35: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

LEARNING THE DISTANCE METRIC

Distance metric represented as a Mahalanobis distance

Φ(x(i)j, x(i)

k)represents a set of pairwise underlying feature functions

A is a positive semi-definite matrix, the parameter we need to learn

Parameter estimation by Minimizing Squared Errors

Page 36: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

SOLVE THE OPTIMIZATION

PROBLEM

Optimization can be done by

Newton’s Method

Interior-Point Method

Any standard semi-definite programming (SDP) solvers Sedumi, yalmip

Page 37: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

GENERATE DISTANCE SCORES

We have learned A!

For any pair of concepts at higher level (x(i+1)l,

x(i+1)m), the corresponding entry in the distance

matrix y(i+1) is

Page 38: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

K-MEDOIDS CLUSTERING

Flat clustering at a level

Use one of the concepts as the cluster center

Estimate the number of clusters by Gap statistics [Tibshirani et al. 2000]

Page 39: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

HUMAN COMPUTER INTERACTION

Page 40: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

SUBTASKS IN ONTOLOGY LEARNING

Concept Extraction

Synonym Detection

Relationship Formulation by Clustering

Cluster Labeling

Page 41: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CLUSTER LABELING

Problem: Concepts are grouped together,

but nameless Solution:

A web-based approach Send a query formed by

concatenating the child concepts to Google

Parse top 10 snippets The most frequent word is

selected to be the parent of this group

Page 42: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ROADMAP

Introduction

Subtasks in Ontology Learning

Human-Guided Ontology Learning

User Study

Metric-Based Ontology Learning

Experimental Results

Conclusions

Page 43: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

USER STUDY

12 graduate students from political science in University of Pittsburgh

Divided into two groups Manual group:4 Interactive group:8

Task: Construct ontology hierarchy for 4 datasets

Mercury Polar bear Wolf Toxin Release Inventory (tri)

90 minutes limit or till user’s satisfaction

Page 44: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

DATASETS

Page 45: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

SOFTWARE USED FOR USER STUDY

Page 46: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

QUALITY OF MANUAL VS. INTERACTIVE RUNS

Manual Users show moderate agreements (0.4~0.6)

Interactive runs produce similar quality results

Difference between manual and interactive runs is NOT statistically significant

Page 47: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

COSTS OF MANUAL VS. INTERACTIVE RUNS

Interactive users use 40% less edits (statistically significant)

Interactive runs save 30-60 minutes per ontology

Within interactive runs, a human spends 64% less time than manual runs

Page 48: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CONTRIBUTIONS OF HUMAN-GUIDED ONTOLOGY

LEARNING

Effectively combine the strengths of automatic systems and human knowledge

Combine many techniques into a unified framework pattern-based(concept mining) knowledge-based (use of Wordnet) Web-based (concept filtering and cluster naming) Machine Learning

A detailed independent user study

Page 49: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

WHAT TO IMPROVE?

Is bottom-up the best way to do? Maybe not Incremental clustering saves most efforts

We have used different technologies for concepts at different levels, how to formally generalize it? Model concept abstractness explicitly

We have tested on domain-specific corpora, how about corpora for more general purposes? Can we reconstruct WordNet or ODP?

Page 50: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ROADMAP

Introduction

Subtasks in Ontology Learning

Human-Guided Ontology Learning

User Study

Metric-Based Ontology Learning

Experimental Results

Conclusions

Page 51: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CHALLENGES

Hard to find a good name for a new group in bottom-up clustering framework

Formally model concept abstractness Intelligently use different techniques for concepts

at different abstract levels Flexibly incorporate heterogeneous features

State-of-the-art either use one type of semantic evidence to infer all relationships, or

Use one type of feature for a particular subtask

Page 52: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CHALLENGES

Hard to find a good name for a new group in bottom-up clustering framework

Formally model concept abstractness Intelligently use different techniques for concepts

at different abstract levels Flexibly incorporate heterogeneous features

State-of-the-art either use one type of semantic evidence to infer all relationships, or

Use one type of feature for a particular subtask

Solution:Increment

al clustering

Page 53: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CHALLENGES

Hard to find a good name for a new group in bottom-up clustering framework

Formally model concept abstractness Intelligently use different techniques for concepts

at different abstract levels Flexibly incorporate heterogeneous features

State-of-the-art either use one type of semantic evidence to infer all relationships, or

Use one type of feature for a particular subtask

Solution:Learn statistical models for each abstraction level

Page 54: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CHALLENGES

Hard to find a good name for a new group in bottom-up clustering framework

Formally model concept abstractness Intelligently use different techniques for concepts

at different abstract levels Flexibly incorporate heterogeneous features

State-of-the-art either use one type of semantic evidence to infer all relationships, or

Use one type of feature for a particular subtask

Solution:Separate metric learning and ontology construction

Page 55: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

A UNIFIED SOLUTION

Metric-based Ontology Learning

Page 56: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

LET’S BEGIN WITH SOME IMPORTANT

DEFINITIONS

An ontology is a data model

Concept Set Relationship Set Domain

Page 57: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

MORE DEFINITIONS

ball table

Game Equipment

A Full Ontology

Page 58: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

MORE DEFINITIONS

ball

Game Equipment

A Partial Ontology

table

Page 59: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

MORE DEFINITIONSOntology

Metric

weight = 1.5 weight= 2

weight=1weight =1

d( , ) = 2

d( , ) = 1 ball

d( , ) = 4.5 table

Page 60: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

MORE DEFINITIONS

weight= 1.5 weight= 2

weight=1weight=1

d( , ) = 2

d( , ) = 1 ball

d( , ) = 4.5 table

Information in an

Ontology T

Page 61: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

MORE DEFINITIONS

d( , ) = 2

d( , ) = 1 ball

d( , ) = 1

Information in a Level L

ball

Page 62: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGY

Minimum Evolution Assumption: The

Optimal Ontology is the One Introduces the Least Information

Changes!

Page 63: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption

Page 64: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption

Page 65: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption

Page 66: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption

ball

Page 67: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption ball

table

Page 68: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption

ball table

Game Equipment

Page 69: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption

ball table

Game Equipment

Page 70: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYMinimum

Evolution Assumption

ball table

Game Equipment

Page 71: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYAbstractness

Assumption: Each abstraction level

has its own Information

function

Page 72: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ASSUMPTIONS OF ONTOLOGYAbstractness

Assumption

ball table

Game Equipment

Page 73: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

FORMAL FORMULATION OF ONTOLOGY

LEARNING

The Task of Ontology Learning is defined as

The construction of a full ontology T given a set of concepts C and an initial partial ontology T0

Keeping adding concepts in C into T0

Note T0 could be empty

Until a full ontology is formed

Page 74: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

GOAL OF ONTOLOGY LEARNING

Find the optimal full ontology s.t. the information changes since T0 are least , i.e.,

Note that this is by the Minimum Evolution Assumption

Page 75: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

GET TO THE GOAL

Goal:

Since the optimal set of concepts is always C

Concepts are added incrementally

Page 76: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

GET TO THE GOAL

Plug in definition of information change

Transform into a minimization problemMinimum

Evolution objective function

Page 77: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

EXPLICITLY MODEL ABSTRACTNESS

Model Abstractness for each Level by Least Square Fit

Plug in definition of amount of information for an abstraction level

Abstractnessobjective function

Page 78: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

MULTIPLE CRITERION OPTIMIZATION

FUNCTIONMinimum Evolution

objective function

Abstractnessobjective function

Scalarization variable

Page 79: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

THE OPTIMIZATION ALGORITHM

Page 80: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ESTIMATING ONTOLOGY METRIC

Assume ontology metric is a linear interpolation of some underlying feature functions

Ridge Regression to estimate and predict the ontology metric

Page 81: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

FEATURES

Google KL-Divergence

Wikipedia KL-Divergence

Google Minipar Syntactic distance

Lexico-Syntactic Patterns

Term Co-occurrence

Word Length Difference

Page 82: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

EVALUATION

Reconstruct subdirectories from WordNet and ODP

50 WordNet subdirectories are from 12 topics: gathering, professional, people, building, place, milk, meal, water, beverage, alcohol, dish and herb

50 ODP subdirectories are from 16 topics: computers, robotics, intranet, mobile computing, database, operating system, linux, tex, software, computer science, data communication, algorithms, data formats, security multimedia and artificial intelligence

Page 83: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

DATASETS

Page 84: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

ONTOLOGY RECONSTRUCTION

An absolute gain of 10% compared to the state-of-the-art system developed in Stanford University (ACL2006 Best Paper)

Page 85: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

INTERACTION OF ABSTRACTION LEVELS

AND FEATURES

Abstract concepts are sensitive to the explicit modeling – a good modeling on abstract concepts greatly boost performance

Contributions from different features for abstract concepts vary; for concrete concepts indifferent

Simple features (term co-occurrence, word length) work the best

Combination of heterogeneous features works better than individual features

Page 86: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

CONTRIBUTIONS OF METRIC-BASED

ONTOLOGY LEARNING

Avoid and hence solve the problem of unknown group names

Tackle the problem of no control over concept abstractness

Experiments show that concept at different abstraction levels behave different and sensitive to different features

Provide a solution to incorporate heterogeneous features

An absolute gain of 10% in precision for both WordNet and ODP than a state-of-the-art system

Page 87: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

WE HAVE TALKED ABOUT

The Task of Information Triage and Personal Ontology Learning

Human-guided Ontology Learning

Metric-based Ontology Learning

Page 88: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

AT THE BEGINNING, WE SAID:

IT WILL BE GREAT TO HAVE

A process to Crawl related documents Sort through relevant documents Identify important concepts/topics Organize materials

Page 89: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

FIND A GOOD KINDERGARTEN IN THE

PITTSBURGH AREA

Are We there yet?

Page 90: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

THE KINDERGARTEN EXAMPLE

We Have DONE with the organization! However, does it support further

inference for a decent decision making? Maybe not Future work!

More Future work Model multiple relationships

simultaneously More Efficient Distance Metric

Learning

Page 91: Hui Yang ( 杨慧 ) Language Technologies Institute School of Computer Science Carnegie Mellon University huiyang@cs.cmu.edu 5 Sep 2008 @ Xi’an Jiaotong University.

THANK YOU AND QUESTIONS