Number of Occurrences of Each Different Word in a Line of Text

Question

I am looking for feedback on a solution to the following problem posed from a book that I'm working through (Java: How To Program 9th Edition) :-

Write an application that reads a line of text and prints a table indicating the number of occurrences of each different word in the text. The application should include the words in the table in the same order in which they appear in the text. For example, the lines

To be, or not to be: that is the question: Whether 'tis nobler in the mind to suffer

contain the word “to” three times, the word “be” two times, the word “or” once, etc.

import java.util.Scanner;
import java.util.ArrayList;
import java.util.Arrays;

public class TextAnalysisC {

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    Scanner sc = new Scanner( System.in );
    System.out.println( "Please enter a line of text" );
    String userInput = sc.nextLine();

    userInput = userInput.toLowerCase();

    userInput = userInput.replaceAll( "\\W", " " );     // strip out any non words.
    userInput = userInput.replaceAll( "  ", " " );      // strip out any double spaces
                                                        //   created from stripping out non words
                                                        //   in the first place!
    String[] tokens = userInput.split( " " );
    System.out.println( userInput );

    ArrayList< String > items = new ArrayList< String >();

    items.addAll( Arrays.asList( tokens ) );

    int count = 0;

    for( int i = 0; i < items.size(); i++ )
    {
        System.out.printf( "%s: ", items.get( i ) );
        for( int j = 0; j < items.size(); j++ )
        {
            if( items.get( i ).equals( items.get( j ) ) )
                count++;
            if( items.get( i ).equals( items.get( j ) ) && count > 1 )
                items.remove( j );                      // after having counted at least
        }                                               // one, remove duplicates from List        
        System.out.printf( "%d\n", count );
        count = 0;
    }
}
}

I'm interested in all forms of feedback, can this be simplified? Is this easy to understand? What should I improve?

Edit(30/7/2013) - I regret to report that I have been remiss on the current scope of my knowledge as I work through my textbook (Java: 9th Edition) solutions should be within the limits of what I have covered so far, that being -

Introduction to Computers and Java
Introduction to Java Applications
Introduction to Classes, Objects, Methods and Strings
Control Statements: Part 1
Control Statements: Part 2
Methods: A Deeper Look
Arrays and ArrayLists
Classes and Objects: A Deeper Look
Object-Oriented Programming: Inheritance
Object-Oriented Programming: Polymorphism
Exception Handling: A Deeper Look
GUI Components: Part 1 (Swing)
Strings, Characters and Regular Expressions

your code has a bug. Try to give a b b b c as input and see the result. — tintinmj, Jul 28 '13 at 18:52
oh hell yes, well spotted @tintinmj, I think it has something to do with the line items.remove( j ) reordering the sequence of letters in the ArrayList. I think it could be confusing things, I'll try and fix it in the morning. A better algorithm might be in order. — cryptocurry, Jul 28 '13 at 19:39

bowmore · Accepted Answer · 2013-07-30 22:04:42Z

The easiest way would be to use a LinkedHashMap<String,AtomicInteger>, adding the words in order.

LinkedHashMap is a Map, so duplicate keys can easily be detected. AtomicInteger will allow incrementing the value without having to replace the value in the Map. Yet AtomicInteger isn't exactly intended to be used as a dumb counter, making your own Counter class is also a good option, as long as it has an increment() and a get() method, it'll make your interaction with the Map a lot smoother.

You can use String.split() to break up the input String in words.

Quick pseudo code:

split input String into words
loop the words
   if word not in map
      add word as key, with new counter
   get counter for word and increment
loop map entries
   print word and number of occurrences

Edit

Given the scope of your knowledge, I'll make a suggestion that only uses what you should know already. The idea is to replace the Map by an ArrayList of words and a matching array of int to hold the counts.

split input String into words
initialize an int[] array with the size equal to the number of words // max needed capacity
initialize an ArrayList of Strings
loop the words
   determine index of the word in ArrayList.
   if word not in ArrayList
      add word to ArrayList
   increment int in array at matching index as the word in the ArrayList
loop ArrayList
   print word and number of occurrences (from same index in array)

as a beginner I haven't covered these topics yet, (see edit above) my apologies for the lack of clarity. However I shall revisit your contribution when I come to the appropriate chapter and absorb your solution at that time. — cryptocurry, Jul 30 '13 at 17:57

user1542234 · Answer 2 · 2013-07-28 20:10:41Z

Both calls to replaceAll and the call to split could be replaced by 1 call to split. Remember that split accepts a regex and that regex patterns can endlessly combine matches for "one or more occurrences" of another pattern with matches for "one pattern or another", etc.

In any case, your regex usage would have to be a little more complicated if you wanted to support words like "don't".

Your counting algorithm does not scale up well for large inputs -- it's "O(n-squared)", and it uses a lot of custom code to make an ArrayList do the work of a more functional collection class. What you want is one scan over your tokens to build up a smarter collection of tokens with their running counts. For that, you want a collection that lets you efficiently find the running count for an "old" token without serially scanning for the token among all the tokens so far. After that, you can make one pass over the collection to print your output. That means that you want the collection to preserve the order in which you added each token.

There's a collection class for that. It's name currently escapes me. It'd probably be instructive for you to research it, anyway, since this is an educational project.

hey user1542234, the book I'm using doesn't seem to show how to do this with split(), I tried split.(" \\s*") to no avail. Searching for more info. I've not fully covered Collections yet, still a few chapters away, sorry for the lack of clarity (see edit above). — cryptocurry, Jul 30 '13 at 18:07

cryptocurry · Answer 3 · 2013-07-30 17:51:16Z

import java.util.Scanner;
import java.util.ArrayList;
import java.util.Arrays;

public class TextAnalysisC {

/**
 * @param args the command line arguments
 */
public static void main(String[] args) {
    Scanner sc = new Scanner( System.in );
    System.out.println( "Please enter a line of text" );
    String userInput = sc.nextLine();

    userInput = userInput.toLowerCase();

    userInput = userInput.replaceAll( "\\W", " " );     // strip out any non words.
    userInput = userInput.replaceAll( "  ", " " );      // strip out any double spaces
                                                        //   created from stripping out non words
                                                        //   in the first place!
    String[] tokens = userInput.split( " " );

    ArrayList< String > items = new ArrayList< String >();

    items.addAll( Arrays.asList( tokens ) );

    int count = 0;

    for( int i = 0; i < items.size(); i++ )
    {
        if( items.get( i ).equals( "***" ) )    // skip words marked as duplicate
            continue;                           //   not essential, only in the interests 
                                                //   of reducing clock cycles
        for( int j = 0; j < items.size(); j++ )
        {
            if( items.get( i ).equals( items.get( j ) ) )
                count++;
            if( items.get( i ).equals( items.get( j ) ) && count > 1 )
            {
                items.remove( j );
                items.add( j, "***" );                   
            }
        }                                                     
        count = 0;
    }

    for( int i = 0; i < items.size(); i++ )     // time to strip out "***"
        if( items.get( i ).equals( "***" ) )
            items.remove( i-- );            // remove then post decrement i
                                            //   due to shifting ArrayList indexes

    int[] countArray = new int[ items.size() ];

    for( int i = 0; i < items.size(); i++ )     // count frequency of words into countArray
        for( int j = 0; j < tokens.length; j++ )
            if( items.get( i ).equals( tokens[ j ]) )
                countArray[ i ]++;

    System.out.println();

    for( int i = 0; i < items.size(); i++ )     // print results into crude table
        System.out.printf( "%s: %d\n", items.get( i ), countArray[ i ] );
}
}

OK so this is the solution I came up with after a second look. It seems to work without breaking. I'm not so sure if it is good practice to mess with a for loop variable within the scope of said for loop? I did it anyway at the point where I was stripping out markers (see comments, post decrement) and it seems to work a treat. This was needed due to the shifting ArrayList indexes that occur when an item is removed from the list.

asked	10 months ago
viewed	3179 times
active	10 months ago

current community

your communities

more stack exchange communities

Number of Occurrences of Each Different Word in a Line of Text

3 Answers

Your Answer

Not the answer you're looking for? Browse other questions tagged java readability beginner or ask your own question.

Hot Network Questions

current community

your communities

more stack exchange communities

Number of Occurrences of Each Different Word in a Line of Text

3 Answers

Your Answer

Sign up or log in

Post as a guest

Not the answer you're looking for? Browse other questions tagged java readability beginner or ask your own question.

Related

Hot Network Questions