Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data

11
Applicationof the NearMissStrategyandEdit Distanceto HandleDirtyData Cihan Varol 1 , Coskun Bayrak 1 , Rick Wagner 2 and Dana Goff 2 1 Computer Science Department, University of Arkansas at Little Rock, 2801 S. South University Ave., Little Rock, AR 72204, USA and 2 Acxiom Corporation Lit- tle Rock, AR, USA 1 Introduction In today’s information age, processing customer information in a standardized and accurate manner is known to be a difficult task. Data collection methods vary from source to source by format, volume, and media type. Therefore, it is advanta- geous to deploy customized data hygiene techniques to standardize the data for meaningfulness and usefulness based on the organization. A standardized and ac- curate data set can have the following advantages [25]: Lowering cost by limiting erroneous data used in marketing. Increasing performance since post-processing can assume that the input field is formatted the same every time. Allowing multiple data sets to be run through one standardized process. It is also important to understand which techniques can be used to improve data accuracy. Because of human errors, data collection methods can often produce in- correct and meaningless data. Extracting the most relevant information is a com- plex process because data can be freely formatted and voluminous. Moreover, un- derstanding the impact on customer satisfaction caused by ill-defined/dirty data (misspelled or mistyped data) is more challenging. Most of the time, the use of data administrators or a tool that has limited capabilities to correct the mistyped information can cause many problems. Therefore, the more accurate the selected word is, the more useful the information that is retrieved. This is why one goal of the data processing industry is to make source data more meaningful. This can be

Transcript of Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data

Cihan Varol1, Coskun Bayrak1, Rick Wagner2 and Dana Goff2

1Computer Science Department, University of Arkansas at Little Rock, 2801 S. South University Ave., Little Rock, AR 72204, USA and 2Acxiom Corporation Lit-tle Rock, AR, USA

1 Introduction

In today’s information age, processing customer information in a standardized and accurate manner is known to be a difficult task. Data collection methods vary from source to source by format, volume, and media type. Therefore, it is advanta-geous to deploy customized data hygiene techniques to standardize the data for meaningfulness and usefulness based on the organization. A standardized and ac-curate data set can have the following advantages [25]:

• Lowering cost by limiting erroneous data used in marketing.

• Increasing performance since post-processing can assume that the input field is formatted the same every time.

• Allowing multiple data sets to be run through one standardized process.

It is also important to understand which techniques can be used to improve data accuracy. Because of human errors, data collection methods can often produce in-correct and meaningless data. Extracting the most relevant information is a com-plex process because data can be freely formatted and voluminous. Moreover, un-derstanding the impact on customer satisfaction caused by ill-defined/dirty data (misspelled or mistyped data) is more challenging. Most of the time, the use of data administrators or a tool that has limited capabilities to correct the mistyped information can cause many problems. Therefore, the more accurate the selected word is, the more useful the information that is retrieved. This is why one goal of the data processing industry is to make source data more meaningful. This can be

2 Cihan Varol1, Coskun Bayrak1, Rick Wagner2 and Dana Goff2

accomplished by utilizing more effective tools and sequences or statistical ap-proaches to provide a suggestion table. For these reasons, the Personal Name Rec-ognizing Strategy (PNRS) was developed to provide the closest match for a mis-spelled name. In Section 2, some relevant definitions and background of the sys-tem are introduced. The strategies and methodology that are used in PNRS to overcome the problem are discussed in Section 3. In the final section, the paper is concluded by summarizing the test result.

2 Background

Before discussing the techniques that are being used, it is necessary to point out the different types of errors targeted in this research to judge the effectiveness of the solutions. The study presented here deals primarily with four types of errors. The primary type of errors is an isolated-word error [14]. This is a single mis-spelled or mistyped word that can be captured with simple techniques. As the name suggests, isolated-word errors are invalid strings, properly identified and isolated as incorrect representations of a valid word [3]. There are many isolated-word error correction applications (for an exhaustive list, see [14]) and different issues and techniques may pertain to specific applications. The primary isolated errors are as follows:

• Typographic errors

• Cognitive errors

• Phonetic errors

• OCR errors

Typographic errors (also known as fat-fingering) occur when one letter is acci-dentally typed in place of another. For example, in the case of “teh” while trying to type “the”. These errors are based on the assumption that the writer or typist knows how to spell the word, but may have typed the word in a rush [14]. Cogni-tive errors refer to situations where the writer or typist chooses an incorrect spell-ing due to lack of knowledge of the correct one. For example, the incorrect spell-ing of “piece” as “piexe” [14]. Phonetic errors can be seen to be a subset of cog-nitive errors. These errors are made when the writer substitutes letters into a word where the sound of it is mistakenly believed to be correct, which in fact leads to a misspelling. For example, spelling “naïve” as nyeve [2]. Optical Character Recognition (OCR) errors arise from OCR misinterpretations of the original docu-ment [21]. These errors include the merging and splitting of words and characters, or incorrect framing of characters that usually results in one-to-many mappings, insertions of characters, deletions of characters, and rejections of characters due to low confidence levels in recognition. OCR based errors are as follows: Threshold error: Since each OCR device has a minimal recognition threshold that deter-

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data 3

mines the confidence level in recognizing a particular character or symbol, some characters may not be understood. This is called Threshold error in OCR. Substi-tution error: This error appears when one character or symbol in a document is translated as another, i.e., different from the intended character or symbol. For ex-ample, an i can become a 1, or an o becoming a 0 (zero). Insertion or Deletion er-ror: This error happens when the OCR device picks up an additional character or deletes a character from the original document. The frequent insertion of white space characters, due to inconsistent font spacing, is good example of insertion or deletion type errors. Framing error: This is the case when the mapping between a character and its output is not one to one. For example, an m might become iii or the cl character set becoming a d [4].

The distinctions among these categories of spelling errors are quite subtle and, in fact, the categories overlap, making it impossible to categorize all errors pre-cisely. Fortunately, spelling correction techniques generally do not require that errors be placed precisely into one of the above cate-gories.

2.1 Techniques used for General Spelling Error Correction

Once a potential problem associated with a word has been detected, then the cor-rection of that word is another issue in the spelling correction process. As stated, there are many isolated-word error correction applications and these techniques decompose the problem into a sequence of three sub-problems [14]: detecting the error, generating the candidate corrections, and ranking the candidate solutions. Five main categories can be defined for the techniques used for isolated-word er-ror correction, as follows:

2.1.1 Minimum edit distance techniques

The edit distance is defined as the smallest number of insertions, deletions, and substitutions required for changing one string into another [13]. The edit distance from one string to another is calculated by the number of operations (replace-ments, insertions or deletions) that need to be carried out to transform one string to another [13]. Minimum edit distance techniques have been applied to virtually all spelling correction tasks, including text editing and natural language interfaces. The spelling correction accuracy varies with applications and algorithms. Shortly after the definition of the edit distance, Damerau [8] reports a 95 percent correction rate for single-error misspellings for a test set of 964 misspellings of medium and long words (length 5 or more characters) while using a lexicon of 1,593 words. However, his overall correction rate was 84 percent when multi-er-ror misspellings were counted. On the other hand, Durhaiw et.al. [9] reports an overall 27 percent correction rate for a very simple, fast, and plain single-error

4 Cihan Varol1, Coskun Bayrak1, Rick Wagner2 and Dana Goff2

correction algorithm accessing a keyword lexicon of about 100 entries. Although the rates seem low, the authors report a high degree of user satisfaction for this command language interface application due to their algorithm’s unobtrusiveness.

In recent work, Brill and Moore [5] report experiments with modeling more powerful edit operations, allowing generic string-to-string edits. Moreover, addi-tional heuristics are also used to complement techniques based on edit distance. For instance, in the case of typographic errors, the keyboard layout is very impor-tant. It is much more common to accidentally substitute a key by another if they are placed near each other on the keyboard.

AGREP [27, 28], which is a tool based on an extension of the Edit Distance al-gorithm to find the best match, uses several different algorithms for optimal per-formance with different search criteria. For simple patterns with errors, AGREP uses the Boyer-Moore algorithm with a partition scheme (see [28] for details of partitioning). AGREP essentially uses arrays of binary vectors and pattern match-ing, comparing each character of the query word in order to determine the best matching lexicon word.

Moreover, similar measures are used to compute a distance between DNA se-quences (strings over {A, C, G, T}), or proteins. The measures are used to find genes or proteins that may have shared functions or properties and to infer family relationships and evolutionary trees over different organisms [17].

2.1.2 Soundex and Phonetic Strategy

The SOUNDEX [19], which is used to correct phonetic spellings, maps a string into a key consisting of its first letter followed by a sequence of digits. It takes an English word and produces a four-digit representation, which is a primitive way to preserve the salient features of the phonetic pronunciation of the word. On the other hand, the metaphone algorithm is also a system for transforming words into codes based on phonetic properties [19, 20]. Unlike Soundex, which operates on a letter-by letter scheme, metaphone analyzes both single consonants and groups of letters called diphthongs according to a set of rules for grouping consonants and then maps groups to metaphone codes [14]. The disadvantage of the metaphone algorithm is that it is specific to the English language.

A significant and meaningful study by Veronis [26] devised a modified dy-namic-programming algorithm that differentially weights edit distances based on phonemic similarity. This modification is necessary because phonetic misspellings frequently result in greater deviation from the correct orthographic spelling [14].

2.1.3 Rule-based techniques

Rule-based techniques attempt to use the knowledge gained from spelling error patterns and write heuristics that take advantage of this knowledge. For example, if it is known that many errors occur from the letters “ie” being typed “ei”, then we may write a rule that represents this [29].

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data 5

2.1.4 N-gram-based techniques

The character n-gram-based technique coincides with the character n-gram analy-sis in non-word detection. However, instead of observing certain bi-grams and tri-grams of letters that never or rarely occur, this technique can calculate the likeli-hood of one character following another and use this information to find possible correct word candidates [24].

2.1.5 Probabilistic techniques and Neural Nets

Naturally, n-grams can be used to calculate probabilities and this has led to the probabilistic techniques demonstrated by Lee [12]. In particular, transition proba-bilities can be trained using n-grams from a large corpus and these n-grams can then represent the likelihood of one character following another. Confusion proba-bilities state the likelihood of one character being mistaken for another [10]. Neu-ral net techniques have emerged as likely candidates for spelling correctors due to their ability to do associative recall based on incomplete and noisy data. This means that they are trained on the spelling errors themselves and carry the ability to adapt to the specific spelling error patterns that they are trained upon [23].

2.2 Domain-Specific Correction

An approach to spelling error correction that has received relatively less attention is the domain-specific approach. Corpus-specific approaches may fall in the same category, although the text found in the same corpus does not necessary imply that they fall in the same domain. For example, the British National Corpus2 (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, and these samples do not pertain to any particular domain. In contrast to this, the ACL Anthology is a corpus in the domain of Computational Linguistics. Tillenius [22] reports on a method for generating and ranking spelling errors. This approach investigated an efficient way to combine the edit distance metric between two strings using a word frequency dictionary and word bi-grams to achieve better results for spelling error corrections. After having analyzed both a corpus of unedited news articles and a corpus of student essays, when using both edit distance and word frequencies, Tillenius [22] finds that word frequencies should be more important than edit distance according to the work. Tillenius [22] concludes that both modified edit distance and word frequencies gave a good ranking of valid candidate words for the spelling errors, with a combined correc-tion percentage of 76%. More recently Mihov et al. [15] report on using the web as a dynamic secondary dictionary, based on the assumption that the content of web pages belonging to a specific thematic area will provide a better basis for a relevant dictionary. They experiment with building the dictionary from words gathered from the body of web pages that were retrieved by sending a relevant query to the AllTheWeb3 search engine. The resulting top 100 web pages returned

6 Cihan Varol1, Coskun Bayrak1, Rick Wagner2 and Dana Goff2

would then be used to build the dynamic dictionary. Using this method, they re-port a slightly increased accuracy rate in error corrections when using the dynami-cally built domain-specific dictionary combined with a conventional one.

3 Individual Name Spelling Correction Algorithm: the Personal Name Recognition Strategy (PNRS)

The PNRS, introduced in this study combines not only strategies discussed above but also includes new ones to find the closest match (Figure 1). Before applying any techniques to suggest a valid word for a particular field, the information in the proper place needs to be free of non-ASCII characters. The PNRS approach for re-moving the non-ASCII characters requires (1) keeping the original blank spaces in the data and later using them as delimiters, (2) removing the non-ASCII characters from the records, and (3) consolidating the partitioned word pieces. After remov-ing the non-ASCII characters from the input data, PNRS invokes certain algo-rithms to produce the alternative suggestions.

Fig. 1. PNRS Data Correction Process

The near miss strategy is a fairly simple way to generate suggestions. Two words are considered near, if they can be made identical by inserting a blank space, interchanging two adjacent letters, changing one letter, deleting one letter or adding one letter [16]. If a valid word is generated using these techniques, then it is added to the temporary suggestion list. However, the near miss strategy does not provide the best list of suggestions when a word is truly misspelled. That is

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data 7

where the phonetic strategy takes place. As discussed above, a phonetic code is a rough approximation of how the word sounds [14]. The English written language is a truly phonetic code, which means each sound in a word is represented by a symbol or sound picture. The applied phonetic strategy compares the phonetic code of the misspelled word to all the words in the word list. If the phonetic codes match, then the word is added to the temporary suggestion list.

If the input data contain names, possibly of international scope, it is difficult to standardize the phonetic equivalents of certain letters. Therefore, an intelligent de-cision mechanism is placed before the ranking algorithm is applied. The input data is compared with a domain-specific English names dictionary.

• If the input data contains an English name, the algorithm that is used moves the results of the near miss strategy and phonetic strategies into the permanent suggestion pool at the same time and makes them of equal weight.

• If the input data contains an international name, then the phonetic results are omitted and the permanent suggestion pool consists only of the results from the near miss strategy.

Once the PNRS has a list of suggestions (based on the data type) an edit-dis-tance algorithm is used to rank the results in the pool. In order to provide mean-ingful suggestions, the threshold value is defined as 2. In case the first and last characters of a word do not match, we modified our approach to include an extra edit distance. The main idea behind this is that people generally can get the first character and last character correct when trying to spell a word.

At the final stage it is possible to see several possible candidate names which have an edit distance of one or two from the original mistyped word. Relying on edit distance doesn’t often provide the desired result. Therefore, we designed our decision mechanism based on the content of the input information and added the U.S Census Bureau decision mechanism to it. The decision mechanism data is a compiled list of popular first and last names which are scored based on the fre-quency of those names within the United States [7]. This allows the tool to choose a “best fit” suggestion to eliminate the need for user interaction. The strategies are applied to a custom dictionary which is designed particularly for the workflow au-tomation. A sample of the popular names file is shown in Table 1. For the sake of this investigation, the rank associated with a name will be called the census score. Names with lower census scores occur more frequently in the Census data.

Table 1. Sample from U.S. Census popular names file

Name Census ScoreJames 1Miller 7Margaret 9Eric 33Morgan 57

8 Cihan Varol1, Coskun Bayrak1, Rick Wagner2 and Dana Goff2

Annie 97Robinette 2428Kulaga 40211

The census score portion of the algorithm is implemented using the following steps:

1. Step : Compare the current suggested names to the census file. If there is a match, store the census score associated with the suggestion.

2. Step: Choose the name with the lowest census score as the “best fit” suggestion.

3. Step : If all suggestions are unmatched on the census file, choose the suggestion with the lowest edit distance.

4. Step : If there are several suggestions that share the lowest edit score, choose the first suggestion on the list.

3.1 Experiment Results

Experimental data is a real-life sample which involves personal and associated company names, addresses including zip code, city, state, and phone numbers of 1,850 individuals. Dirty data is present in total of 129 records, including mis-spelled names and some non-ASCII characters. In order to evaluate the effective-ness of the tool, experiments are conducted not only on current correction algo-rithm (NameCheck1) and PNRS but also on the well known ASPELL [2], JSpell HTML [11], and Ajax Spell Checkers [1]. As illustrated in Figure 2 and Table 2, PNRS achieved a 68 percent correction rate while the runner-up algorithm was only able to fix 47 percent of the records. Although JSPELL proposed the most number of corrections, its exact match rate is under 48 percent. While PNRS failed to provide any other suggestions for 28 individual names out of 129, only 5 of them were omitted and 23 of them qualified as valid names.

1 NameCheck is an algorithm currently being used by Axciom.

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data 9

Fig. 2. Comparison of the Tools

Table 2. Statistical Results

CorrectionsAlgorithms-Attributes 1Fixed | Matched 2Fixed | No Match 3Match | No MatchNameCheck 19 0 110PNRS 88 13 28ASPELL 61 46 22JSPELL 53 59 17Ajax Spell Checkers 57 51 211Exact correction of the misspelled name2Corrections that provide no match with the original name3Either the input is accepted as a valid name or the system failed to provide any suggestions

4 Conclusion

In today's environment of information explosion, it is very difficult to retrieve rel-evant data. As data sources become larger, current techniques of information re-trieval fail in terms of their efficiency. Also, extracting the most relevant informa-tion in a system is a complex process: data is free-formatted, voluminous, and consists of multiple media types. Although quite a number of spelling algorithms exist for general purpose use, it is hard to distinguish and correct individual names with them. In this study a custom name spelling checking algorithm is presented in order to overcome the lack of spelling correction tools for personal names. The

10 Cihan Varol1, Coskun Bayrak1, Rick Wagner2 and Dana Goff2

results are compared with other known tools to reflect the success of the study. Not only did PNRS provide the most Fixed | Matched corrections, but it also achieved 21 percent more exact corrections than the runner-up algorithm. Al-though the success rate of PNRS was about 68 percent, another 10 percent were fixed but provided different results (Figure 1). In future work, a combination of Kullback-Leibler divergence [18] and Information theoretic distortion measures [6] will be applied to predict how close the obtained results are to the expected ones.

5 Problems

1. List the main sources of isolated word errors and explain them briefly?2. What are the main techniques used to correct the spelling corrections?3. How important is the Phonetic Strategy in spelling correction? Discuss it in

terms of English language and individual names?4. Discuss the pros and cons of including an extra edit distance if the first and

last characters of a word do not match?5. Why matching of personal names are more challenging compared to

matching of general text?6. What improvements can be made to strengthen PNRS algorithm?

6 References

1. AJAX Spell Checker, http://www.broken-notebook.com/spell_checker/2. ASPELL, http://aspell.net/metaphone/3. Becchetti C, Ricotti LP (1999) Speech Recognition: Theory and C++ Implementation.

In: John Wiley & Sons4. Beitzel SM, Jensen EC, and Grossman, DA (2002) Retrieving OCR text: A survey of

current approaches. White Paper5. Brill E, Moore RC (2002) An improved error model for noisy channel spelling correc-

tion. In: Proceedings of ACL-2000, the 38th Annual Meeting of the Association for Computational Linguistics, pp 286-293

6. Cardinal J (2002) Quantization with an information-theoretic distortion measure. Tech-nical Report 491, ULB

7. Census Bureau Home Page, www.cencus.gov8. Damerau FJ (1990) Evaluating computer generated domain-oriented vocabularies. In-

formation Process. Management. 26 : 791 – 8019. Durhaiw I, Lamb DA, and Sax JB (1983) Spelling correction in user interfaces. ACM

26 : 764–77310. Golding A, Schabes Y (1996) Combining trigram based and feature-based methods for

context-sensitive spelling correction. In: Joshi A, and Palmer M, (eds.). Proceedings of the 34th Annual Meeting of the ACL. San Francisco

Application of the Near Miss Strategy and Edit Distance to Handle Dirty Data 11

11. JSPELL HTML, http://www.thesolutioncafe.com/html-spell-checker.html12. Lee L (1999) Measures of distributional similarity. In: Proceedings of the 37th Annual

Meeting of the ACL13. Levenshtein VI (1965) Binary codes capable of correcting deletions, insertions and re-

versals. Doklady Akademii Nauk SSSR 163 : 845-848, also {1966) Soviet Physics Doklady 10 : 707-710,

14. Kukich K (1992) Techniques for Automatically Correcting Words in Text. ACM Com-puting Surveys, Vol. 24, No. 4

15. Mihov S, Ringlstetter C, Schulz KU, and Strohmaier C (2003) Lexical post-correction of OCR-results: The web as a dynamic secondary dictionary? In: Document Analysis and Recognition Proceedings Volume 2, pp 03–06

16. Near Miss Strategy, http://www.codeproject.com/csharp/NetSpell.asp?print=true17. Needleman SB, Wunsch CD (1970) A general method applicable to the search for sim-

ilarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48: 443-453

18. Pedro JM, Purdy PH, Vasconcelos N (2004) A Kullback-Leibler divergence based ker-nel for SVM classification in multimedia application. In: Thrun S, Saul L, Scholkopf B (eds) Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA

19. Philips L (1990) Hanging on the metaphone. Computer Language, 7(12) : 39-4320. Philips L (2000) The double-metaphone search algorithm. C/C++ User's Journal, 18(6)21. Taghva K, Stofsky E (2001) OCRSpell: an interactive spelling correction system for

OCR errors in text. IJDAR, 3 : 125-13722. Tillenius M (1996) Efficient generation and ranking of spelling error corrections Mas-

ter’s thesis, Royal Institute of Technology, Stockholm, Sweden23. Trenkle JM and Vogt RC (1994) Disambiguation and spelling correction for a neural

network based character recognition system. In: Proceedings of SPIE. Volume 2181, pp 322-333

24. Ullman JR (1977) A Binary n-Gram Technique for Automatic Correction of Substitu-tion, Deletion, Insertion, and Reversal Errors in Words. Computer J., 20 (2) : 141-147

25. Varol C, Robinette C, Kulaga J, Bayrak C, Wagner R, Goff D (2006) Application of Near Miss Strategy and Edit Distance to Handle Dirty Data. In: ALAR Conference on Applied Research in Information Technology, March 3, Conway, Arkansas, USA

26. Veronis, J (1998) Morphosyntactic correction in natural language interfaces. In: Pro-ceedings of the 12th International Conference on Computational Linguistics. Budapest, Hungary, pp 708-713

27. Wu S, Manber U (1992) AGREP - A Fast Approximate Pattern Matching Tool. In: Proc. Usenix Winter 1992 Technical Conf., pp 153-162

28. Wu S, Manber U (1992) Fast Text Searching With Errors. Comm. ACM, Vol. 3529. Yannakoudakis EJ, Fawthrop D (1983) The rules of spelling errors. Information Pro-

cessing Management 19 (2) : 87–99