I wrote this script for a code challenge on TreeHouse. I already submitted my code, but I am looking for an after the fact feedback.
The challenge was to open a file of text, remove specific words (provided in another file) and do the following:
Output to the screen in a readable fashion
1. total word count after filtering
2. highest occurring word
3. longest word(s) and its / their length
4. sort the list from most occurring to least occurring then output that data to the screen as an unordered list
5. Bonus: Return the word list in JSON.
I decided to split the solution into two classes:
- One class to sanitise the string into words only
- Another class to do the analysis
This is the sanitiser class. Given text it should return a string of words only. It should remove anything that is not a letter or an apostrophe, and compress all white space into a single space character.
I accept that this has problems. For example, I found an instance of an apostrophe as a word. Perhaps it was a leading or a trailing one. But defining the boundaries of a word is a tricky proposition.
class WordsOnlySanitizer
# Allow for diacritics, hence p{Alpha} and not \w
# We should not split words on apostrophes either
WORDS_ONLY_REGEX = /[^\p{Alpha}']/i
# We want to reduce all white space into a single space
SPACE_ONLY_REGEX = /\s+/
def self.to_words(text)
text.gsub(WORDS_ONLY_REGEX, ' ').gsub(SPACE_ONLY_REGEX, ' ')
end
end
This is the analyser class. It's self explanatory. There is duplication in the longest_words
and highest_occuring_words
methods. But I'm not sure how to remove this duplication without making the code less readable.
The html_list
method also looks a little suspect, but I can't tell why.
require 'json'
class Analyser
def initialize(text, filter)
@words = text.split
@filter = filter.split
end
def word_count
filtered_words.size
end
def word_occurrences
@word_occurrences ||= filtered_words.inject(Hash.new(0)) do |result, word|
result[word] += 1
result
end
end
def highest_occurring_words
word_occurrences.group_by { |key, value| value }.max_by { |key, value| key }.last
end
def longest_words
filtered_words.inject({}) do |result, word|
result[word] = word.length
result
end.group_by { |key, value| value }.max_by { |key, value| key }.last
end
def html_list
list = ""
word_occurrences.sort_by { |key, value| value }.reverse.each do |key, value|
list << " <li>#{key}: #{value}</li>\n"
end
"<ul>\n" + list + "</ul>"
end
def json_list
JSON.parse(word_occurrences.to_json)
end
private
def filtered_words
@filtered_words ||= @words.reject do |word|
# Downcase so that Hello and hello count as two occurrences
word.downcase!
@filter.include?(word)
end
end
end
Usage
Here's how you would use this:
text = WordsOnlySanitizer.to_words(File.read('words.txt'))
filter = WordsOnlySanitizer.to_words(File.read('filter_words.txt'))
analyser = Analyser.new(text, filter)
puts "Word count after filtering is: #{analyser.word_count}"
puts "\n"
puts "The most frequent words are:"
analyser.highest_occurring_words.each do |key, value|
puts " - #{key}: #{value} occurences"
end
puts "\n"
puts "The longest words are:"
analyser.longest_words.each do |word|
puts " - #{word.first}: #{word.last} characters"
end
puts "\n"
puts "Word list:"
puts analyser.html_list
puts "JSON object:"
puts analyser.json_list
Here's a gist with all the files. Warning: There are large text files.
WORDS_ONLY_REGEX
? If so, it's because things like"!'
andi.
andii.
get included as words, when they are not. My intention was to clean up anything that's not alpha, compress all white space into one space, then split into an array. So in a sense, I am doing that. Perhaps I don't understand your question fully, though. – Mohamad Apr 3 '14 at 16:07\p{Alpha}
would still include stuffi.
andii.
, grabbing them without the period, correct? – Jonah Apr 3 '14 at 19:47'
character is not removed, regardless whether it is an apostrophe or not. – Mohamad Apr 3 '14 at 21:04