Assignments‎ > ‎

Homework 1: Text Encoding

Due: Thursday, September 13

Problem 1 (20 points)

Do exercise 1 from chapter 1 (Encoding Language) of Dickinson, Brew and Meurers' book draft. However, instead of taking a paragraph from a novel, transliterate the following lines from Walt Whitman's poem A Song of the Rolling Earth:

Were you thinking that those were the words, those upright lines?
     those curves, angles, dots?
No, those are not the words, the substantial words are in the
     ground and sea...
They are in the air, they are in you.

Make sure to state the syllabary you chose, and write the transliteration on four lines as in the above excerpt. Then provide your answers to (a) and (b), making sure to mark them clearly.

Problem 2 (20 points)

Do exercise 2 (parts a, b, c, and d) from chapter 1 (Encoding Language) of Dickinson, Brew and Meurers' book draft. For part (a), give three things you would change, and briefly state why. For part (b), name two problems. For part (c), give an example of how you would write the word “language”. For part (d), you do not need to write down 100 words – just say what kinds of words, and give a couple of examples.

Problem 3 (10 points)

Give the base ten numbers for the following binary numbers. They are written in standard order, i.e., Big Endian.

  • (a) 11000101
  • (b) 01011110

Be sure to show your work.

Tip: there are many, many tutorials for binary-decimal conversion online that you can use in addition to the course slides.

Problem 4 (10 points)

Write out the word Bayes using ASCII code, in both ordinary numbers (decimal) and binary (base 2, in standard order). Keep in mind that lowercase and uppercase letters have different ASCII codes!

As an example, here is what this looks like for Austin.

                      Letter  ASCII number      bit notation
A 65 01000001
u 117 01110101
s 115 01110011
t 116 01110110
i 105 01101001
n 110 01101110

Be sure show your work.

Problem 5 (10 points)

(a) What is the largest base-10 number that can be encoded in 4 bytes using UTF-8?

(b) What is the UTF-8 representation for the Devanagari character म ("ma")? (Hint: You'll need to look up its decimal value, convert that to binary, and then embed it into the UTF-8 scheme described in the slides.)

Make sure to show your work.

Problem 6 (10 points)

The following table (which is from the first edition of Jurafsky and Martin's textbook) provides bigram probabilities. For example, P(want | i) = 0.22.

(a) Ignoring start and end probabilities, calculate the probabilities for the following sentences using a bigram model. So, don't worry about P(I) -- start with P(want|I) and work through to P(food|chinese) for (a) and P(to|food) for (b).

(i) I want to eat Chinese food
(ii) Chinese eat want I food to

Be sure show your work.

(b) Which is more probable? Does it make sense?

Problem 7 (20 points)

Do exercise 10 from chapter 1 of the draft textbook (p. 42). Make sure to clearly rank the 10 bigrams you've chosen, such that the bigram which you think has the most predictable next word is first. Do this before doing part (a). For part (b), compare the results you got from others to your first ranking.

As an alternative to asking friends for part (a), you can do a search for the bigram on a search engine (e.g., by searching for "to the" or "the United", with the quotes) and then listing the first word that follows from the snippet given for each search result.