Text-Induced spelling correction

Authors

Martin Reynaert

Abstract

In this talk we present an overview of our PhD-work on Text-Induced Spelling Correction. The work presents a novel approximate string matching algorithm for indexed text search. The algorithm is based on a hashing function which uniquely identifies strings composed of the same subsets of characters, i.e. anagrams, by means of a numeric value. The numeric value allows for searching for character strings differing from a particular string by a predefined number of characters. This forms an ideal basis for a novel spelling error detection and correction algorithm, which we call Text-Induced Spelling Correction or TISC. Our system uses nothing but lexical and word cooccurrence information derived from a corpus, a very large collection of texts in a particular language, to perform context-sensitive spelling error correction of non-words. Non-words are word strings produced unintentionally by a typist that deviate from a convention about how words are to be spelled in order to be considered real-words within the language. We will highlight the differences between our character-based similarity key and the language specific similarity keys as employed in, for instance, the well-known Soundex and Phonix phonetic spelling systems. The spelling error detection and correction mechanism we propose uses not only isolated word information, but also context information. It performs context-sensitive error correction by deriving useful knowledge from the text to be spelling checked. This enables our system to correct typos for which it does not have the correct word in its dictionary. Apart from this, some typos are ambiguous in that they may resolve into two or more different words. We investigate in depth the relationship between a typo and its context and propose a new algorithm for ranking correction candidates that specifically makes use of the typo's context.

We further discuss the tension between the wish of developers of spelling correction systems of catering for phonetic spelling errors and the cost of this in terms of the system's precision. Extensive evaluations on both English and Dutch allow us to illustrate this by discussing the performance of Aspell and the Microsoft Proofing Tools in this regard.

Publication type

Presentation

Year of publication

2006

Conference location

Nijmegen

Conference name

Summer Meeting on Corpus-based Research 2006

Publisher

Nederlandse Vereniging voor Fonetische Wetenschappen