Assignments‎ > ‎

Spelling Correction

Due October 30, 2012

Problem 1 (5 points)

An automatic spelling corrector detects beeds as a typo in a document. Generate candidate corrections according to the following edits (one per type of edit):

(a) using one deletion

(b) using one substitution

(c) using one insertion

Also, do the following:

(d) give two candidates that are at least two edits of any kind (and not more that two edits) from beeds.

Problem 2 (30 points)

Calculate the minimum edit distance from the string gudmint a to each of the following words:

  • giant
  • judgment

Use only insertions, deletions, and substitutions. The costs you should use are as follows:

  • insertions: cost 1
  • deletions: cost 1
  • substitutions
    • no change (e.g. a→a, t→t) : cost 0
    • vowel → vowel: cost 0.5
    • consonant → constant: cost 0.5
    • vowel → consonant: cost 1
    • consonant → vowel: cost 1

For each of the pairs, write down:

  • the minimum edit distance, and
  • the acyclic graph which you use to calculate the minimum edit distance, with annotated costs for each node.
You do not need to include the string representation for each node in the graph (though feel free to include it if it is helpful to you).

Don't try to cram both graphs onto a single page – use a full page for each pair and give yourself plenty of space to write.

Problem 3 (10 points)

A person wanting to write straight errs and produces straihgtt instead. A spelling corrector that is trying to find candidates for the typo could generate candidates by exhaustively testing all possible character strings that are within one, two, three, or more edits.

(a) How many edits would it need to find straight in the list of candidates? 

Note: this does not mean the edit distance (cost) as defined in the previous problem. This means how many actual edit steps straight is from straihgtt, like if you were to type the edits on a keyboard.

(b) Do you think the n-gram approach to candidate generation would work for this case? And, would it be preferable to finding straight as a candidate using edits?

Note: recall that we can talk about n-grams as sequences of characters or sequences of words. Here, we are referring to them as characters, as discussed in class. Recall that what we do is take a typo like straihgtt, and add start-of-word and end-of-word boundaries, e.g. #straihgtt#, and then we look at all trigrams (3-grams) of characters, e.g. #st, str, tra, rai, aih, ihg, hgt, gtt, tt#. To find candidates based on this, we are looking for words in the dictionary that have a high degree of overlap in their character trigrams. For this, think about which words are going to be likely to be found that have similar trigrams, and whether the misspelled part of the word will have high similarity to this typo. Then, you should consider whether that will work out reasonable well in terms of finding straight as a candidate (you don't need to do a calculation), and contrast that with the edit distance approach to finding candidates.

For the edit distance approach, recall that: (1) We enumerate all candidates that are one edit away. Call this set Edit1. (b) We get all strings that are two edits away from straihgtt by look at ALL strings that are one edit away from EVERYTHING in the set Edit1. Call this Edit2. (c) And so on, in order to generate candidates that are farther away. Keep in mind that the set Edit1 could be quite large, so Edit2 will be *much* larger, and it keeps growing.

Note: The use of n-grams (of characters) here is *different* from using them as part of an n-gram language model (based on words). Don't confuse the two. The term "n-gram" just means "a sequence of n things" and doesn't imply a particular model or approach.

Problem 4 (10 points)

Consider the following (well-known) text:

would you eat them in a box ? would you eat them with a fox ? not in a box . not with a fox . not in a house . not with a mouse .

Given this text, and ignoring punctuation, what are the values for the following:

(a) P(not)

(b) P(box)

(c) P(box|a)  (the probability of box given that the previous word is a)

(d) P(box|in,a) (the probability of box given that the previous word is a, and that the word two before it is in)

Note: this is a very simple, easy problem, so don't overthink it. Look at the slides and it has the same structure as what you see there. The point is that these probabilities really are just coming from counting stuff!

Problem 5 (30 points)

You are building a spelling corrector for English and have already built an excellent error detection component. You are given the sentence “John's good gujmint helps him make great decisions.” and the detector spots ‘gudmint’ as a typo. You also have a candidate generator that gives you three candidates as possible corrections: giant, judgment, and garment Now you need to rank these candidates by using a model trained from data (given below).

You are given a corpus of text which has the following counts:
  • 21,354 word tokens
  • 272 tokens of giant
  • 61 tokens of judgment
  • 32 tokens of garment

Also assume that we know the following from an error model:

  • P(gudmint | giant) = .009
  • P(gudmint | judgment) = .039
  • P(gudmint | garment) = .058

(a) (15 points) Rank the candidates from most likely to least likely, assuming that we calculate P(candidate|type=gudmint) using a unigram language model based on the above training material? Show your work, and the values for each candidate.

(b) (15 points) The unigram model doesn't use the context of the sentence to choose the best candidate. Let's make it more sensitive to context by using a bigram language model, which for the above sentence means computing P(candidate|typo=gudmint,previous_word=good). In our training material, we observe the following:

  • good occurs 923 times
  • giant occurs after good 2 times
  • judgment occurs after good 21 times
  • garment occurs after good 4 times

Using the bigram probabilities for each candidate being preceded by good, which is the best candidate? Show your work and the values for each candidate.

Note: You will use the same error model probabilities for this calculation.

Problem 6 (15 points)

Automatic spelling correction can be used in many different contexts, and the balance between precision and recall might be quite different for them. Consider the two following uses:

  1. A search engine uses automatic spelling correction on web pages before it indexes them so that people searching for terms can find them even when the author of the web page misspells those terms. (So, this doesn't change the web page itself obviously, but it changes the words that the search engine stores in its index as being relevant to that page.)
  2. A phone doesn't perform interactive real-time spelling correction as the user is creating a text message but can automatically correct the message after it is typed and is being sent. (So, the user cannot correct any errors introduced by the spelling corrector.)

Assume that these spelling correctors can choose not to fix some words that they have detected as typos; that is, they can abstain when they have low confidence in their possible candidate corrections. Discuss whether precision or recall is more important for each case. If precision is more important, discuss which how the kind of error (see the slides) would influence whether the classifier should abstain (not make a correction) or not.

Note: There are no slides specifically addressing this question. The point of it is for you to apply the reasoning about precision and recall that we discussed for classification to the specific problem of automatic spelling correction.