extract sentences and numbers

Part I – download a book
1.    Download a copy of “The Adventures of Sherlock Holmes” by Arthur Conan Doyle from the Project Gutenberg website – https://www.gutenberg.org/ .

Make sure that the file you download is a UTF-8 format file. Using a UTF-8 format file will ensure that any “special” characters that fall outside of the standard English ASCII character set will be displayed correctly within R.

2.    Read the text of the book into R with the readLines function, specifying  the encoding as “UTF-8”. Note that “UTF-8” must be in ALL CAPS and must have a dash before the number 8.  For example:

    book <- readLines(“myfile.txt”, encoding=”UTF-8″)

At this point the variable book will contain a character vector that contains the lines of the book.

Part II – convert the vector of lines of the book into a vector of sentences
1.    Write a function called sentenceify that takes an argument, named lines. The lines argument is expected to be a character vector that contains lines of text as they might appear in a book. The function should return a new vector that contains each sentence in a different element of the vector. For example:

> linesOfABook <- c(“This is my wonderful book. I really”,  “hope you like it. This is”, “the end. Bye bye!”)

> linesOfABook
[1] “This is my wonderful book. I really”
[2]  “hope you like it. Did you like”
[3]  “it? Bye bye!”

> sentences <- sentenceify (linesOfABook)
> sentences
[1] “This is my wonderful book.”
[2] “I really hope you like it.”
[3] “Did you like it?”
[4] “Bye bye!”

2.    Keep in mind that abbreviations that end with a period could confuse your function. To address this, your function should take another argument, named abbreviationsVec. The default value of this argument should be NULL. If the argument is not specified then the function should continue to work as above. (To check to see if the user supplied a value for this argument, your code can use the is.null function.)  If a value for abbreviationsVec is specified then it is expected to be a vector of abbreviations (without the periods). If one of the abbreviations in the abbreviationsVec is followed by a period in the lines vector, then the function should NOT consider that period as the end of a sentence. For example:

> linesOfABook <-    c(    “Mt. Everest is the tallest mountain in”, 
            “the world.  Capt. Kirk and Mr. Spock are two”,
            “characters from the original Star Trek TV”,
            “series.”)

# Without specifying the abbreviationsVec argument
> sentences <- sentenceify (linesOfABook)
> sentences
[1] “Mt.”,
[2] “Everest is the tallest mountain in the world.”,
[3] “Capt.”,
[4] “Kirk and Mr.”,
[5] “Spock are two characters from the original Star Trek TV series.”

# Supply a list of abbreviations to the function to fix the problem
abbr = c(“Mr”, “Mrs”, “Maj”, “Capt”, “Mt”)
> sentences <- sentenceify (lines = linesOfABook, abbreviationsVec = abbr )
> sentences
[1] “Mt. Everest is the tallest mountain in the world.”,
[2] “Capt. Kirk and Mr. Spock are two characters from the original Star Trek TV series.”

Your functions should treat the abbreviations in a case-insensitive way. For example, if Mt is specified as an abbreviation then both Mt. and mt. should be considered an abbreviation in the text.

3.    Since there may be many abbreviations that you need to consider it would be a lot easier and more maintainable to add these abbreviations to a file than to type them into a vector. To support this option, add one more argument to the sentenceify function. The new argument, abbreviationsFile, is expected to be the name of a file that contains abbreviations, one per line. If abbreviationsFile is not specified, then the function should work as above. If abbreviationsFile is specified, then the function should use the readLines function to read the contents of the file and add those abbreviations to any that were specified in the abbreviationsVec argument. For example:

Contents of myfile.txt:
Mr
Mrs
Dr
Ms
ave
blvd
rd

> linesOfABook <-    c(    “Mt. Everest is the tallest mountain in”, 
            “the world. “, “Capt. Kirk and Mr. Spock are two”,
            “characters from the original Star Trek TV”,
            “series.”)

# All values that are either in the file or in the variable that is being passed to the
# abbreviationsVec argument will be considered abbreviations.
> sentences <- sentenceify (linesOfABook, abbreviationsVec = c(“Mt”, “st”), abbreviationsFile=”myfile.txt” )
> sentences
[1] “Mt. Everest is the tallest mountain in the world.”,
[2] “Capt. Kirk and Mr. Spock are two characters from the original Star Trek TV series.”

You can now easily modify the abbreviations file to add new abbreviations as necessary. If you want to quickly try an abbreviation, you can also specify it in the abbreviationsVec variable that is passed to the sentenceify function.

4.    On occasion, some texts may include several consecutive spaces or tabs. This is not a problem for people who are reading the text. However, it can make writing regular expressions to parse this type of text more complex. This generally shouldn’t be an issue for most published material but it couldn’t hurt to address the issue anyway.

Modify your function so that it replaces multiple consecutive spaces and tabs with a single space. You can use the regex “s+” (or “\s+” withing R) to match one or more spaces or tabs. This will make it much easier to process the text later. If you do this, you will be able to assume that there is a single space between every word when designing regular expressions to process the text later. Similarly, your function should eliminate any spaces or tabs that appear before the first word of the sentence or that appear after the last word of the sentence.

5.    In many texts published on Project Gutenberg’s website there are leading and trailing underscores
_like this_    or    _this_    througout the text. This usually is done to indicate that those words are intended to be bold. Sometimes there may be leading or trailing asterisks, *like this*  or  *this*  to indicate that the words should be italicized. If your text contains these underscores or asterisks, you should modify your sentenceify function to that it removes all the leading or trailing underscores and asterisks from words. This will make it easier to work with the text later when you write other regular expressions.

6.    Keep in mind that processing text in this fashion is not foolproof. Use your function to convert the book into a vector of sentences. Skim the results and look for possible mistakes. See if you can improve your function based on the results you are seeing by modifying the approaches taken above. Document and report on any changes that you make to the above algorithm.

Part II – finding numbers
1.    Numbers can be written in many different forms. For example, the following are all different ways of writing numbers:
a.    1234        (no comma)
b.    1,234        (with comma – note that 1,234 is a single number as is 123,456 as is 12,345,678.
        However, 12,34 are two numbers)
c.    1.234        (with decimal point)
d.    $1,234.00    (with dollar sign)
e.    one thousand two hundred and thirty four    (all lowercase,  uses “and”)
f.    One Thousand Two Hundred Thirty Four        (Mixed Case,  no “and”)
g.    MCCXXXIV    (roman numerals for 1,234)
h.    Other formats ???

Create a function named findNumbers. The function should take a single argument, named inputText, that is a character vector. The function should return a new character vector that contains only those elements of the inputText that contain a number. For example:

> stuff <- c(“My favorite numbers are one and  two, 3, 1,234 and fifty seven.”, 
    “Do you have any favorite numbers?”,
    “Roman numerals such as VIII and viii are interesting.”,
    “I also like colors such as blue, green and yellow.”,
    “This pen costs $1.99.”)

> findNumbers (stuff)
[1]  “My favorite numbers are one and  two, 3, 1,234 and fifty seven.”,
[2]  “Roman numerals such as VIII and viii are interesting.”,
[3]  “This pen costs $1.99.”

Note that it would be very difficult to get this function to work perfectly. For example, it would be almost impossible using simple algorithms to determine if the capital letter “I” by itself stands for the english word “I” or the Roman numeral for “1”. For this particular case, I would assume that most of the times that “I” appears by itself in an english book it is a reference to the English word “I” and not a Roman numeral for the number “1”. Try to get your function to work in as many cases as possible.

Run your function on the book from “Part I” of this assignment. scan your results and try to fix the function as much as possible. This will be an iterative process.

2.    Modify the findNumbers function from the previous step. In the new version of the function, add an argument named, addHighlights that has a default value of FALSE. If you use the function without specifying anything for addHighlights then the function should continue to work as in the previous question. However, if you set addHighlights=TRUE then the function should work as follows:

If addHighlights=TRUE then the function should add << before the number and >> after the number. Make sure that if a number is made up of multiple words, that you surround the COMPLETE number and not just part of it. If an element of the character vector contains more than one number then you should use multiple <<and >> markers. If the number represents an amount of money, include the dollar sign or other currency symbol inside the <<  and  >> markers. For example:

> stuff <- c(“My favorite numbers are one and  two, 3, 1,234 and fifty seven.”, 
    “Do you have any favorite numbers?”,
    “Roman numerals such as VIII and viii are interesting.”,
    “I also like colors such as blue, green and yellow.”,
    “This pen costs $1.99.”)

> findNumbers (stuff,  addHighlights=TRUE)
[1]  “My favorite numbers are <<one>> and  <<two>>, <<3>>, <<1,234>> and <<fifty seven>>.”,
[2]  “Roman numerals such as <<VIII>> and <<viii>> are interesting.”,
[3]  “This pen costs <<$1.99>>.”

Run your function on the book from “Part I” of this assignment. Scan your results and try to fix the function as much as possible. This will be an iterative process.

3.    Modify the findNumbers function from the previous step. Add an argument named, includeIndexNumbers, whose default value is FALSE. If includeIndexNumbers is FALSE then the function should continue to work as in the previous step. However, if includeIndexNumbers is TRUE, then the function should return a data.frame. The first column of the dataframe should be named “index” and should include the index number from the original vector. The 2nd column in the dataframe should be named “text” and include the matched text as described in the previous step. For example:

> stuff <- c(“My favorite numbers are one and  two, 3, 1,234 and fifty seven.”, 
    “Do you have any favorite numbers?”,
    “Roman numerals such as VIII and viii are interesting.”,
    “I also like colors such as blue, green and yellow.”,
    “This pen costs $1.99.”)

> findNumbers (stuff,  addHighlights=TRUE, includeIndexNumbers=TRUE)

  index text
1  1  “My favorite numbers are <<one>> and  <<two>>, <<3>>, <<1,234>> and <<fifty seven>>.”
2  3  “Roman numerals such as <<VIII>> and <<viii>> are interesting.”
3  5  “This pen costs <<$1.99>>.”

Part III – finding people’s names
1.    It can be challenging to come up with rules for searching for people’s names. It is relatively straight forward to search for a specific name such as “Sherlock” or “Sherlock Holmes”. However, it is much more difficult to find all of the names that exist in a large set of text if you don’t know what names to look for. However, you might come up with some rules, that help. For example,

a.    most of the time, if there are two words in a row, with no intervening punctuation (e.g. periods) that are capitalized, often that will indicate a person’s first and last names. However, if there are five or six capitalized words in a row, that will probably NOT be a person’s name.  Three or four words, it’s hard to say … e.g. Mr. John W. Smith .

b.    A capitalized word that precedes a past-tense verb such as “ate”, “sat”, “ran”, etc. would often be a person’s name. If there were two such capitalized words in a row without punctuation such as commas before the verb, it is reasonable to assume that the words comprise a single person’s, first and last names.

c.    Capitalized words that following titles, such as Mr., Mrs. Captain, Capt., Dr., etc
are probably people’s names.

2.    Create a function named findPersonNames. The function should take the same arguments as the findNumbers function that you wrote in the previous part of this assignment – i.e. inputText, addHighlights and includeIndexNumbers. The function should work in a similar fashion to the findNumbers function, however, instead of finding and highlighting numbers, this function will find and highlight people’s names.

You should also include the arguments, titlesFile and titlesVec that work similarly to the abbreviationsFile and abbreviationsVec that were described above regarding the sentenceify function. These arguments allow you to specify a list of titles (e.g. Mr, Mrs, Dr, etc) that could be used by the findPersonNames function to help identify names.

For this function use the highlight markers [[ and ]] instead of << and >>.  The markers should surround the COMPLETE name of a person. Any titles, such as Mr, Mrs., Dr., etc. should be considered to be part of the name.

For example:

> text <-    c(    “I am going to the store.”,
        “I think that I might meet Dr. Jones, Bob and Lisa when I’m there.”,
        “They all seem to go to that store a lot.”,
        “The store owner is Mr. Larry S. Higgins, a really nice guy.” )

> findPersonNames(text, addHighlights=TRUE, titlesVec=c(“Mr”,”Mrs”,”Dr”))
[1] “I think that I might meet [[Dr. Jones]],[[Bob]] and [[Lisa]] when I’m there.”
[2] “The store owner is [[Mr. Larry S. Higgins]], a really nice guy.”

Use the rules listed above and/or other rules that you come up with. You may add additional arguments to the function if you like. For example, you may add argument that accepts a vector and/or a file  of verbs to use in processing suggestion “b” listed above.

Part IV – get the unique names and numbers
1.    Write  R functions, getUniqueNames(vec), and getUniqueNumbers(vec) where vec is assumed to be a character vector that contains the lines of a book. These functions should return the sorted unique numbers and names of people in the book. You can use the findPersonNames function that you created to help implement this function.

Part V- testing your code and what to submit.
1.    In addition to the code for the functions you should add code to your script that does the following:
a.    Run your findNumbers and findPersonNames functions on the full text of “The Adventures of Sherlock Holmes”.  Use the addHighlights=TRUE and includeIndexNumbers=TRUE options.
Include the output of these function calls in the word document.

b.    Run the functions, getUniqueNumbers and getUniquePersonNames on the text of “The Adventures of Sherlock Holmes”

c.    Perform the same steps on the following books and paste your results in the word document.
i.    “The Hounds of the Baskervilles” (i.e. another Sherlock Holmes book).
ii.    “Gulliver’s Travels into Several Remote Nations of the World” by Jonathan Swift

2.    Submit a word document that contains
a.    all of the code for the functions you wrote as well as the commands for part 1 above.
b.    Run all of your code and paste the output into the Word document.

3.    Add an “Executive Summary” section at the beginning of the word document in which you very clearly summarize for “The Adventures of Sherlock Holmes”
a.    A list of the unique names and a list of the unique numbers (just cut and paste the output of the code from above)
b.    how many unique numbers and person names you found.
c.    How many were valid (based on your reading through the results) and
d.    how many were incorrectly identified as names or numbers (it’s ok, you are expected to have a few errors).