3
\$\begingroup\$

Follow Up question of : Correcting duplicate names in array

I have an array of file names. For example:

FileContent[] files =
{
    new FileContent() {Content = threeItems, Name = "one.zip" },
    new FileContent() {Content = fiveItems, Name = "one.zip" },
    new FileContent() {Content = sevenItems, Name = "one.zip" },
    new FileContent() {Content = threeItems, Name = "two.zip" },
    new FileContent() {Content = fiveItems, Name = "two.zip" },
    new FileContent() {Content = sevenItems, Name = "two.zip" },
};

Model:

public sealed class FileContent
{
    public byte[] Content { get; set; }

    public string Name { get; set; }
}

And I've developed the following method to correct duplicate names. What the method does is just change duplicate names. I am just adding incremented value for the next duplicated value. For example, my developed method ChangingDuplicateNames(string[] files)correct the previous array to:

FileContent[] files =
{
    new FileContent() {Content = threeItems, Name = "one.zip" },
    new FileContent() {Content = fiveItems, Name = "one(1).zip" },
    new FileContent() {Content = sevenItems, Name = "one(2).zip" },
    new FileContent() {Content = threeItems, Name = "two.zip" },
    new FileContent() {Content = fiveItems, Name = "two(1).zip" },
    new FileContent() {Content = sevenItems, Name = "two(2).zip" },
};

And implementation of ChangingDuplicateNames(FileContent[] files) is:

private FileContent[] ChangingDuplicateNames(FileContent[] files)
{
    //Creating a dicitonary to store duplicated values. "Key" of dictionary        
    //is duplicated name, "Value" of dictionary is number to add for name
    Dictionary<string, int> duplicateNames = files.GroupBy(x => x.Name)
            .Where(group => group.Count() > 1)
            .ToDictionary(grouped => grouped.Key, grouped => 0);

    if (duplicateNames.Count == 0)
       return files;

    int namesLength = files.Length;
    string actualName = string.Empty;
    for (int indexArray = 0; indexArray < namesLength; indexArray++)
    {
        int value;
        bool isDuplicate = duplicateNames
           .TryGetValue(files[indexArray].Name, out value);
        if (isDuplicate)
        {
           actualName = files[indexArray].Name;
           if (value == 0)
              files[indexArray].Name = files[indexArray].Name;
           else
           {
              //Adding increment to the mext duplicate name
              string fileNameWithoutExtension = Path
                  .GetFileNameWithoutExtension(files[indexArray].Name);
              string fileExtension = Path
                  .GetExtension(files[indexArray].Name);
              files[indexArray].Name = fileNameWithoutExtension + "(" + value + ")"
+ fileExtension;
           }
           duplicateNames[actualName] = ++value;
        }
    }
    return files;
}

And my question is: Is it possible to improve this algorithm? I mean could be this code smaller?

Maybe I should not iterate through the all array of names, but I cannot figure out how I can change names in files without iterating through all array. Thanks in advance.

\$\endgroup\$
1
  • \$\begingroup\$ files[indexArray].Name = files[indexArray].Name; - really? (Then, I'd prefer current or name over currentName over actualName. Rather than assigning a descriptively named (conceptual) const (isDuplicate), I'd define a predicate…) Don't know about improve … smaller, but why iterate the names twice? \$\endgroup\$ Commented May 5, 2017 at 17:37

2 Answers 2

3
\$\begingroup\$

Because we are dealing now with objects, we need to create a copy of each FileContent to achieve the same result like in your previous question.

Instead of having var currentFile = file where file had been a string, we now use var currentFile = new FileContent() { Content = file.Content, Name = file.Name };.

In addition, we are using the HashSet<T> only for look up and return an IEnumerable<FileContent> instead of an array.

private IEnumerable<FileContent> ChangingDuplicateNames(FileContent[] files)
{
    var hashSet = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
    foreach (var file in files)
    {
        var currentFile = new FileContent() { Content = file.Content, Name = file.Name };
        int counter = 0;
        while (!hashSet.Add(currentFile.Name))
        {
            currentFile.Name = CreateFileName(file.Name, ref counter);
        }
        yield return currentFile;
    }
} 

and call it like so

FileContent[] result = ChangingDuplicateNames(files).ToArray();
\$\endgroup\$
4
  • \$\begingroup\$ Why do we use yield here? \$\endgroup\$ Commented May 5, 2017 at 10:13
  • 1
    \$\begingroup\$ Because we don't want to have a second hashset on which we call ToArray() later on. If you have an IEqualityComparer<FileContent> we could use a HashSet<FileContent> and then we could change the method back to returning FileContent[] by calling hashSet.ToArray(). \$\endgroup\$ Commented May 5, 2017 at 10:17
  • \$\begingroup\$ 1. Why do you prefer return type IEnumerable<FileContent> not FileContent[] of method ChangingDuplicateNames(...)? 2. Why don't you prefer return type IEnumerable<FileContent> not FileContent[] in ChangingDuplicateNames(...) as a parameter? \$\endgroup\$ Commented May 5, 2017 at 10:25
  • 1
    \$\begingroup\$ 1. Its easier and cleaner this waym (IMO) 2. I guess you mean passing as a parameter, which isn't a problem. Just change FileContent[] to IEnumerable<FileContent> \$\endgroup\$ Commented May 5, 2017 at 10:29
3
\$\begingroup\$

Edit: I ran the original code, the answer and my solution here over 2400 records. I didn't find any significant differences in perf, but more rigorous profiling should be done if the number of records is magnitudes larger. My improvements are based on the following observations:

  • You do a lot of work to make a unique name. Unless that format is required, I would go with appending a Guid to those duplicates or something. It is simply less code that I have to write.

  • You have a lot of variables capturing intermediate state: duplicateNames namesLength actualName incrementedValue isDuplicate fileNameWithoutExtension fileExtension indexArray. If we could eliminate those, the code would be easier to read. I ended up with uniqueNames and i

  • I like that you remove duplicates in place in the original files list.

  • Another goal of mine was to see if I could reduce the amount of nesting of code.

Assuming we want to keep the original list in the same order, I use a HashSet to store unique names. Assuming we only care about making names unique, I create those using GUIDs. Here the code:

private static FileContent[] ChangingDuplicateNamesIII(FileContent[] files)
        {
            var uniqueNames = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
            for (int i = 0; i < files.Count(); i++)
            {
                if (uniqueNames.Contains(files[i].Name))
                {
                    // Duplicate
                    files[i].Name = Path.GetFileNameWithoutExtension(files[i].Name)
                        + "_"
                        + Guid.NewGuid().ToString()
                        + Path.GetExtension(files[i].Name);
                }
                else
                {
                    uniqueNames.Add(files[i].Name);
                }
            }
            return files;
        }

    }

Given the following input array:

FileContent[] filesSmall =
            {

                new FileContent() {Content = "1", Name = "one.zip" },
                new FileContent() {Content = "2", Name = "one.zip" },
                new FileContent() {Content = "3", Name = "two.zip" },
                new FileContent() {Content = "4", Name = "two.zip" },
                new FileContent() {Content = "5", Name = "two.zip" },
                new FileContent() {Content = "6", Name = "one.zip" },
                new FileContent() {Content = "7", Name = "one.zip" },
                new FileContent() {Content = "8", Name = "one.zip" },
                new FileContent() {Content = "9", Name = "two.zip" },
                new FileContent() {Content = "10", Name = "two.zip" },
                new FileContent() {Content = "11", Name = "two.zip" },
            };

It produces the following output, with the first column representing the Content (in our simple case a string).

Output from ChangingDuplicateNamesIII

\$\endgroup\$
3
  • 1
    \$\begingroup\$ I don't think attaching a GUID instead of a counter is such a good idea. Besides the review part is missing. This is virtually the same code as in the other answer. With this solution there is also no way of knowing that a file with a certain name exists multiple times because of the long meainingless GUID. On top of that there is no review. \$\endgroup\$ Commented May 6, 2017 at 6:19
  • 1
    \$\begingroup\$ Thanks for the feedback. This was my first time answering a question here and I could have done a better job. I'm looking forward to learning from others and doing better, because that is a significant part of what code reviews should be about. Post edited with more details. My solution is very different from the answer, although I did read and run it - no negations, no IEnumerable with a yield, no counter variable, no method to create a file name, no foreach and, last but not least, I replace duplicate names inline in the original array. I hope this helps and appreciate the feedback! \$\endgroup\$ Commented May 6, 2017 at 20:07
  • \$\begingroup\$ Downvote undone ;-) \$\endgroup\$ Commented May 6, 2017 at 20:10

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.