A Linguistic Approach to Hindi Name Entity Disambiguation

6
BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online) Bhartiya Bhasha, Shiksha, Sahitya evam Shodh www. bhartiyashodh.com Page 41 A Linguistic Approach to Hindi Name Entity Disambiguation Brijesh Kumar Yadav PhD Scholar Dept. of Language Technology Mahatma Gandhi International Hindi University, Wardha (MS) brijesh.bhu08@gmail.com Abstract The term Named Entity(NE) is the unsolved and open ended issue for Natural Language Processing (NLP) tasks. Recognition of NE is as crucial tasks as the classification of it. To extract them firstly we have to decide how to recognize the NEs. Named entities are often mined for marketing initiatives. Several works have been done in this area within Machine Translation (MT) perspective in major languages of the world. In the context of Indian languages a very few works have been done. Though there are many information can be retrieved from the name only like personal identification, caste, dynasty, religions, locality etc. NEs also include; geographic locations, ages, addresses, phone numbers, companies and addresses in other words proper nouns. In Hindi, a major Indo-Aryan Language of Indian subcontinent this area has been initially dealt by IIT-Bombay and IIIT- Hyderabad. The research paper tries to discuss the issue of Hindi NEs and describes the ambiguity caused by them without proper identification. Along with the simple NEs data, this paper also discusses the culture-specific NEs, their relations and tries to provide some linguistic inputs to recognize them which cover the bigger space in problematic Hindi NEs list. Introduction The term Named Entity(NE) is in current use in Information Extraction (IE) applications. It was coined at the sixth Message Understanding Conference (MUC-6) (Grishman & Sundheim 1996), which influenced IE research in the 1990s. At the time,

Transcript of A Linguistic Approach to Hindi Name Entity Disambiguation

BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)

Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 41

A Linguistic Approach to Hindi Name Entity Disambiguation

Brijesh Kumar YadavPhD Scholar

Dept. of Language TechnologyMahatma Gandhi International Hindi University, Wardha (MS)

[email protected]

Abstract

The term “Named Entity” (NE) is the unsolved and open ended issue for Natural Language

Processing (NLP) tasks. Recognition of NE is as crucial tasks as the classification of it. To

extract them firstly we have to decide how to recognize the NEs. Named entities are often

mined for marketing initiatives. Several works have been done in this area within

Machine Translation (MT) perspective in major languages of the world. In the context

of Indian languages a very few works have been done. Though there are many information

can be retrieved from the name only like personal identification, caste, dynasty, religions,

locality etc. NEs also include; geographic locations, ages, addresses, phone numbers,

companies and addresses in other words proper nouns. In Hindi, a major Indo-Aryan

Language of Indian subcontinent this area has been initially dealt by IIT-Bombay and IIIT-

Hyderabad. The research paper tries to discuss the issue of Hindi NEs and describes the

ambiguity caused by them without proper identification.

Along with the simple NEs data, this paper also discusses the culture-specific NEs, their

relations and tries to provide some linguistic inputs to recognize them which cover the

bigger space in problematic Hindi NEs list.

IntroductionThe term “Named Entity” (NE) is in current use in Information Extraction (IE)

applications. It was coined at the sixth Message Understanding Conference (MUC-6)

(Grishman & Sundheim 1996), which influenced IE research in the 1990s. At the time,

BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)

Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 42

MUC was focusing on IE tasks wherein structured information on company and defense-

related activities are extracted from unstructured text, such as newspaper articles. In

defining IE tasks, people noticed that it is essential to recognize information units such as

names including person, organization, and location names, and numeric expressions

including time, date, money, and percentages. Identifying references to these entities in

text was acknowledged as one of IE’s important sub-tasks and was called “Named Entity

Recognition (NER).” Before the NER field was recognized in 1996, significant research

was conducted by extracting proper names from texts. A paper published in 1991 by Lisa F.

Rau (1991) is often cited as the root of the field. For more than fifteen years, a

dynamic research community advanced the fundamental knowledge and the engineered

solutions to create an NER system. In its canonical form, the input of an NER system is a

text and the output is information on boundaries and types of NEs found in the text. The

vast majority of proposed systems fall in two categories: the handmade rule-based

systems; and the supervised learning-based systems. In both approaches, large collections of

documents are analyzed by hand to obtain sufficient knowledge for designing rules or for

feeding machine learning algorithms. Expert linguists must execute this important amount

of work, which in turn limits the building and maintenance of large-scale NER systems.

NER involves identification of proper names in texts and classifying them into a set of

predefined categories of interest such as

o Person names

o Organizations names (companies, political parties etc.)

o Locations (cities, countries, towns and villages etc.)

o Date and monetary expressions

Note that Named Entities are Proper Names and are usually not found in Dictionaries

Language is the primary tool with which human beings talk about things in the

world and what things do in the world. To know language is therefore to know the world.

As the German philosopher Wittgenstein put it “The limit of your language is the limit of

your world.” But as we share this knowledge with our fellow beings language is also a

means of communication. Modern Linguistics also studies how the knowledge about

language can be applied to computer applications (e.g. machine translation. Finally we

BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)

Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 43

can be still say that language is an expression of thought, it may be written, spoken,

symbolic etc which communicate to each others.

A named entity is a phrase that clearly identifies one item from a set of other

items that have similar attributes. Examples of named entities are first and last names,

geographic locations, ages, addresses, phone numbers, companies, and addresses. Named

entities are often mined for marketing initiatives.

To identify any thing its ‘name’ is the most important identity of thing or person.

To identify the person or things without name is very hard. If its name is so easy and

different then identifying that name is too simple. The different name of a person makes

the difference in their personality. After seeing the name of the person is helpful to

imagine his/her figure. Sometimes these names come as separate words and sometimes

compound words. Here we are going to discus about those Indian origin names which are

ambiguous in their meaning. They use to identifying name for proper or common name

while these are problematic for MT.

Methodology:

Data have been collected from the magazines, news papers, voter list of Election

commission of India and social contact like students name which can be recognize as

Hindi Named Entity.

Data Example:

Main Name: - aanad, khushboo, gambheer, gulaab, nainaa, sheetal, dheeraj etc.

(आन◌ंद, खशब, ग◌ंभ◌ीर, गल◌ाब, खश◌ी, न◌ैन◌ा, श◌ीतल,ध◌ीरज)

Multi words Name: - raam prasaad (र◌ाम स◌ाद), soorya prakaash (सय

क◌ाश), vijay saagar ( वजय स◌ागर), phool badan (फल बदन),

raam din (र◌ाम #दन), prakash singh baadal ( क◌ाश $सह

BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)

Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 44

ब◌ादल) etc.

Use in sentences:

1. khushboo aa rahee hai. (खशब आ रह( ह।)

Khushboo is coming.Or

Fragrance is feeling.

2. aanand aa gaya. (आन◌ंद आ गय◌ा।)

Anand came.Or

Felt overjoyed.

3. raam prasaad kha rahaa hai. (र◌ाम स◌ाद ख◌ा रह◌ा ह।)

Raam Prasad is eating.Or

Raam is eating Ambrosia.

Now we are going to describe in brief that the sentence 1st there is a complexity

to distinguish the meaning that khushboo means what? khushboo is a human/person name

or khushboo is a feeling of fragrance or something like that with Anand. There is the

same complexity to distinguish to the sentence 2nd here second entity of the name is

what? Is this a part of named entity (subject part) or another part (predicate part)? That's

ambiguous.

After seeing the above Hindi names (Indian origin names) we can say that these

are the name entities which are ambiguous in named entities. There are so many

information can be retrieved from the name entities like personal identification, caste,

dynasty, religions (sometimes), locality etc. Sometimes these identities can identify from

the main names and sometimes from the surnames, title names. Identifying the name in

BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)

Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 45

roman script (English) is too easy because in roman names are written in capital letters in

the nature of roman writing system whenever in Devnagari script Hindi does not have

capital and small writing system. This difficulty of writing system in Devnagari is

problematic for computer and sometimes human being also. Today’s discussion in the

world market is centralize on Hindi translation and many institutes are doing in this area,

still doing a lot of works we haven’t reach to a satisfactory result in MT and so on. This

research paper will help to understand the complexity related named-entity. I want to

highlight and discus these types of ambiguous problems and how can computer overcome

on such types of problems in presented research paper.

In this paper we have tried to identify, where & how the main name and multi word

name (compound name) function as PPN & CN. With finding the NEs In the program and

shorting of them like three words left of the main targeted word and same process like

three words right of the main targeted word. If the system will find out three words before

this targeted word- PPN/CN/ Honorific marker word/Post name and same three word after

that targeted word finds Post position then word define Proper name otherwise that targeted

word will be common name.

Syntax for computer:

W3 <W2 <W1< WORD >W1> W2> W3 . . .

Here in the above syntax we have tried to identify the Name according to the

sentence therefore we will have to parse the sentence with pos-tagging. It will be see the

previous two or three words and post two or three words then decide that word is Name.

Overviews and ConclusionThe recent successful approaches to NER are based on application of Conditional

Random Fields (CRF), i.e. a method for sequence tagging. This method has been alreadyapplied to English, Polish, Bulgarian, Arabic and many other languages. Hidden MarkovModels (HMM) or Maximum Entropy Markov Model (MEMM) is that ConditionalRandom Fields (CRF) can make use of additional features attached to a sequence of words.Construction of new features and selection of the best subset of all possible featuresmust be done in order to obtain optimal results.

There are some Names in any language which are ambiguous in their nature butsome of them like Hindi language have more complexities with the Name which are

BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)

Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 46

shown in above data. During the translating of Hindi source text into English target textor at the time of machine translation system it creates problems with Named and ourresults being unsatisfactory. If we avail to solve these types of problems then our outputwill be correct or near correct. There is very less work has been done in previous on thisarea which research papers have been done that is not satisfactory works or those are onlyrelated works.

o According to the nature of Hindi script based Noun Tagger developed

o Output is used as one of the features for CRF based NER system

o Gazetteers play an important role in CRF based NER system improvement inperformance

o Majority tag concept has shown some improvement in the performance of CRFbased NER system

o Performance of the system can be improved using POS tagger and chunkers

o Current work is limited to recognizing single and double word NEs.

References:

• उ ◌ेत◌ी, मर◌ार(ल◌ाल: #हद( म /यय एव प2च◌ा4य◌ी- वच◌ार, आलखक◌ाशन, नई

#द6ल( (2009)• कपर, बदर(न◌ाथ: व◌ा:य-स◌ंरचन◌ा और व2लषण: नए >तम◌ान,र◌ाध◌ाक@ण क◌ाशन, नई #द6ल( (2008)

• McGregor, R.S: Outline of Hindi Grammar, oxford university press, New Delhi

(1999)

• Nadeau, David: Semi-Supervised Named Entity Recognition, University of

Ottawa, Canada, (2007)

• McCallum, “Early results for Named Entity Recognition with Conditional

Random Fields, feature induction and web-enhanced lexicons,” in proceedings of

7th conference on Natural Language Learning at HLT NAACL 2003,

• W. Li and A. McCallum, “Rapid development of Hindi named entity recognition

using conditional random fields and feature induction,” ACM Transactions on

Asian Language Information Processing (TALIP), Vol. 2, no. 3, pp. 290-294,

2003.

• https://www.google.co.in/webhp?source=search_app#q=name+entity+in+data

• http://www.mt-archive.info/srch/keyword-search.html