Use LINQ to select specific words from a file to build a dictionary in C#

Recently I needed a big word list so I searched around for public domain dictionaries. I found one that was close to what I needed in the file 6of12.txt in the 12Dicts package available here. That file has several problems that make it not quite prefect for my use:

  • It contains words that are too short and too long for my purposes.
  • It includes non-alphabetic characters at the end of some words to give extra information about them.
  • Some words contain embedded non-alphabetic characters as in A-bomb and bric-a-brac.

This example reads the file, removes unwanted characters, selects words of a specified length, and saves the result in a new file. The following code shows how the program does this.

// Select words that have the given minimum length.
private void btnSelect_Click(object sender, EventArgs e)
{
    // Remove non-alphabetic characters at the ends of words.
    Regex end_regex = new Regex("[^a-zA-Z]*$");
    string[] all_lines = File.ReadAllLines("6of12.txt");
    var end_query =
        from string word in all_lines
        select end_regex.Replace(word, "");

    // Remove words that still contain non-alphabetic characters.
    Regex middle_regex = new Regex("[^a-zA-Z]");
    var middle_query =
        from string word in end_query
        where !middle_regex.IsMatch(word)
        select word;

    // Make a query to select lines of the desired length.
    int min_length = (int)nudMinLength.Value;
    int max_length = (int)nudMaxLength.Value;
    var length_query =
        from string word in middle_query
        where (word.Length >= min_length) &&
              (word.Length <= max_length)
        select word;

    // Write the selected lines into a new file.
    string[] selected_lines = length_query.ToArray();
    File.WriteAllLines("Words.txt", selected_lines);

    MessageBox.Show("Selected " + selected_lines.Length +
        " words out of " + all_lines.Length + ".");
}

The code starts by using a LINQ query to remove non-alphabetic characters from the ends of words.

It then uses a second LINQ query to select words that now contain no non-alphabetic characters. (That eliminates A-bomb and bric-a-brac.)

A third LINQ query then selects words with lengths between those selected by the user.

Finally the code invokes the final query's ToArray method to convert the results into an array of words. It then uses File.WriteAllLines to write the words into a new file named Words.txt.

The code finishes by displaying the number of words in the new and original files.

   

 

What did you think of this article?




Trackbacks
  • No trackbacks exist for this post.
Comments
  • No comments exist for this post.
Leave a comment

Submitted comments are subject to moderation before being displayed.

 Name

 Email (will not be published)

 Website

Your comment is 0 characters limited to 3000 characters.