A Linguistic Approach to Hindi Name Entity Disambiguation
-
Upload
independent -
Category
Documents
-
view
1 -
download
0
Transcript of A Linguistic Approach to Hindi Name Entity Disambiguation
BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)
Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 41
A Linguistic Approach to Hindi Name Entity Disambiguation
Brijesh Kumar YadavPhD Scholar
Dept. of Language TechnologyMahatma Gandhi International Hindi University, Wardha (MS)
Abstract
The term “Named Entity” (NE) is the unsolved and open ended issue for Natural Language
Processing (NLP) tasks. Recognition of NE is as crucial tasks as the classification of it. To
extract them firstly we have to decide how to recognize the NEs. Named entities are often
mined for marketing initiatives. Several works have been done in this area within
Machine Translation (MT) perspective in major languages of the world. In the context
of Indian languages a very few works have been done. Though there are many information
can be retrieved from the name only like personal identification, caste, dynasty, religions,
locality etc. NEs also include; geographic locations, ages, addresses, phone numbers,
companies and addresses in other words proper nouns. In Hindi, a major Indo-Aryan
Language of Indian subcontinent this area has been initially dealt by IIT-Bombay and IIIT-
Hyderabad. The research paper tries to discuss the issue of Hindi NEs and describes the
ambiguity caused by them without proper identification.
Along with the simple NEs data, this paper also discusses the culture-specific NEs, their
relations and tries to provide some linguistic inputs to recognize them which cover the
bigger space in problematic Hindi NEs list.
IntroductionThe term “Named Entity” (NE) is in current use in Information Extraction (IE)
applications. It was coined at the sixth Message Understanding Conference (MUC-6)
(Grishman & Sundheim 1996), which influenced IE research in the 1990s. At the time,
BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)
Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 42
MUC was focusing on IE tasks wherein structured information on company and defense-
related activities are extracted from unstructured text, such as newspaper articles. In
defining IE tasks, people noticed that it is essential to recognize information units such as
names including person, organization, and location names, and numeric expressions
including time, date, money, and percentages. Identifying references to these entities in
text was acknowledged as one of IE’s important sub-tasks and was called “Named Entity
Recognition (NER).” Before the NER field was recognized in 1996, significant research
was conducted by extracting proper names from texts. A paper published in 1991 by Lisa F.
Rau (1991) is often cited as the root of the field. For more than fifteen years, a
dynamic research community advanced the fundamental knowledge and the engineered
solutions to create an NER system. In its canonical form, the input of an NER system is a
text and the output is information on boundaries and types of NEs found in the text. The
vast majority of proposed systems fall in two categories: the handmade rule-based
systems; and the supervised learning-based systems. In both approaches, large collections of
documents are analyzed by hand to obtain sufficient knowledge for designing rules or for
feeding machine learning algorithms. Expert linguists must execute this important amount
of work, which in turn limits the building and maintenance of large-scale NER systems.
NER involves identification of proper names in texts and classifying them into a set of
predefined categories of interest such as
o Person names
o Organizations names (companies, political parties etc.)
o Locations (cities, countries, towns and villages etc.)
o Date and monetary expressions
Note that Named Entities are Proper Names and are usually not found in Dictionaries
Language is the primary tool with which human beings talk about things in the
world and what things do in the world. To know language is therefore to know the world.
As the German philosopher Wittgenstein put it “The limit of your language is the limit of
your world.” But as we share this knowledge with our fellow beings language is also a
means of communication. Modern Linguistics also studies how the knowledge about
language can be applied to computer applications (e.g. machine translation. Finally we
BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)
Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 43
can be still say that language is an expression of thought, it may be written, spoken,
symbolic etc which communicate to each others.
A named entity is a phrase that clearly identifies one item from a set of other
items that have similar attributes. Examples of named entities are first and last names,
geographic locations, ages, addresses, phone numbers, companies, and addresses. Named
entities are often mined for marketing initiatives.
To identify any thing its ‘name’ is the most important identity of thing or person.
To identify the person or things without name is very hard. If its name is so easy and
different then identifying that name is too simple. The different name of a person makes
the difference in their personality. After seeing the name of the person is helpful to
imagine his/her figure. Sometimes these names come as separate words and sometimes
compound words. Here we are going to discus about those Indian origin names which are
ambiguous in their meaning. They use to identifying name for proper or common name
while these are problematic for MT.
Methodology:
Data have been collected from the magazines, news papers, voter list of Election
commission of India and social contact like students name which can be recognize as
Hindi Named Entity.
Data Example:
Main Name: - aanad, khushboo, gambheer, gulaab, nainaa, sheetal, dheeraj etc.
(आन◌ंद, खशब, ग◌ंभ◌ीर, गल◌ाब, खश◌ी, न◌ैन◌ा, श◌ीतल,ध◌ीरज)
Multi words Name: - raam prasaad (र◌ाम स◌ाद), soorya prakaash (सय
क◌ाश), vijay saagar ( वजय स◌ागर), phool badan (फल बदन),
raam din (र◌ाम #दन), prakash singh baadal ( क◌ाश $सह
BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)
Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 44
ब◌ादल) etc.
Use in sentences:
1. khushboo aa rahee hai. (खशब आ रह( ह।)
Khushboo is coming.Or
Fragrance is feeling.
2. aanand aa gaya. (आन◌ंद आ गय◌ा।)
Anand came.Or
Felt overjoyed.
3. raam prasaad kha rahaa hai. (र◌ाम स◌ाद ख◌ा रह◌ा ह।)
Raam Prasad is eating.Or
Raam is eating Ambrosia.
Now we are going to describe in brief that the sentence 1st there is a complexity
to distinguish the meaning that khushboo means what? khushboo is a human/person name
or khushboo is a feeling of fragrance or something like that with Anand. There is the
same complexity to distinguish to the sentence 2nd here second entity of the name is
what? Is this a part of named entity (subject part) or another part (predicate part)? That's
ambiguous.
After seeing the above Hindi names (Indian origin names) we can say that these
are the name entities which are ambiguous in named entities. There are so many
information can be retrieved from the name entities like personal identification, caste,
dynasty, religions (sometimes), locality etc. Sometimes these identities can identify from
the main names and sometimes from the surnames, title names. Identifying the name in
BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)
Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 45
roman script (English) is too easy because in roman names are written in capital letters in
the nature of roman writing system whenever in Devnagari script Hindi does not have
capital and small writing system. This difficulty of writing system in Devnagari is
problematic for computer and sometimes human being also. Today’s discussion in the
world market is centralize on Hindi translation and many institutes are doing in this area,
still doing a lot of works we haven’t reach to a satisfactory result in MT and so on. This
research paper will help to understand the complexity related named-entity. I want to
highlight and discus these types of ambiguous problems and how can computer overcome
on such types of problems in presented research paper.
In this paper we have tried to identify, where & how the main name and multi word
name (compound name) function as PPN & CN. With finding the NEs In the program and
shorting of them like three words left of the main targeted word and same process like
three words right of the main targeted word. If the system will find out three words before
this targeted word- PPN/CN/ Honorific marker word/Post name and same three word after
that targeted word finds Post position then word define Proper name otherwise that targeted
word will be common name.
Syntax for computer:
W3 <W2 <W1< WORD >W1> W2> W3 . . .
Here in the above syntax we have tried to identify the Name according to the
sentence therefore we will have to parse the sentence with pos-tagging. It will be see the
previous two or three words and post two or three words then decide that word is Name.
Overviews and ConclusionThe recent successful approaches to NER are based on application of Conditional
Random Fields (CRF), i.e. a method for sequence tagging. This method has been alreadyapplied to English, Polish, Bulgarian, Arabic and many other languages. Hidden MarkovModels (HMM) or Maximum Entropy Markov Model (MEMM) is that ConditionalRandom Fields (CRF) can make use of additional features attached to a sequence of words.Construction of new features and selection of the best subset of all possible featuresmust be done in order to obtain optimal results.
There are some Names in any language which are ambiguous in their nature butsome of them like Hindi language have more complexities with the Name which are
BBSSES Volume 5 Issue 4 [Year - 2014] ISSN 2321 – 9726(online)
Bhartiya Bhasha, Shiksha, Sahitya evam Shodhwww. bhartiyashodh.com Page 46
shown in above data. During the translating of Hindi source text into English target textor at the time of machine translation system it creates problems with Named and ourresults being unsatisfactory. If we avail to solve these types of problems then our outputwill be correct or near correct. There is very less work has been done in previous on thisarea which research papers have been done that is not satisfactory works or those are onlyrelated works.
o According to the nature of Hindi script based Noun Tagger developed
o Output is used as one of the features for CRF based NER system
o Gazetteers play an important role in CRF based NER system improvement inperformance
o Majority tag concept has shown some improvement in the performance of CRFbased NER system
o Performance of the system can be improved using POS tagger and chunkers
o Current work is limited to recognizing single and double word NEs.
References:
• उ ◌ेत◌ी, मर◌ार(ल◌ाल: #हद( म /यय एव प2च◌ा4य◌ी- वच◌ार, आलखक◌ाशन, नई
#द6ल( (2009)• कपर, बदर(न◌ाथ: व◌ा:य-स◌ंरचन◌ा और व2लषण: नए >तम◌ान,र◌ाध◌ाक@ण क◌ाशन, नई #द6ल( (2008)
• McGregor, R.S: Outline of Hindi Grammar, oxford university press, New Delhi
(1999)
• Nadeau, David: Semi-Supervised Named Entity Recognition, University of
Ottawa, Canada, (2007)
• McCallum, “Early results for Named Entity Recognition with Conditional
Random Fields, feature induction and web-enhanced lexicons,” in proceedings of
7th conference on Natural Language Learning at HLT NAACL 2003,
• W. Li and A. McCallum, “Rapid development of Hindi named entity recognition
using conditional random fields and feature induction,” ACM Transactions on
Asian Language Information Processing (TALIP), Vol. 2, no. 3, pp. 290-294,
2003.
• https://www.google.co.in/webhp?source=search_app#q=name+entity+in+data
• http://www.mt-archive.info/srch/keyword-search.html