Correcting duplicate names in generic array - follow up

Question

Follow Up question of : Correcting duplicate names in array

I have an array of file names. For example:

FileContent[] files =
{
    new FileContent() {Content = threeItems, Name = "one.zip" },
    new FileContent() {Content = fiveItems, Name = "one.zip" },
    new FileContent() {Content = sevenItems, Name = "one.zip" },
    new FileContent() {Content = threeItems, Name = "two.zip" },
    new FileContent() {Content = fiveItems, Name = "two.zip" },
    new FileContent() {Content = sevenItems, Name = "two.zip" },
};

Model:

public sealed class FileContent
{
    public byte[] Content { get; set; }

    public string Name { get; set; }
}

And I've developed the following method to correct duplicate names. What the method does is just change duplicate names. I am just adding incremented value for the next duplicated value. For example, my developed method ChangingDuplicateNames(string[] files)correct the previous array to:

FileContent[] files =
{
    new FileContent() {Content = threeItems, Name = "one.zip" },
    new FileContent() {Content = fiveItems, Name = "one(1).zip" },
    new FileContent() {Content = sevenItems, Name = "one(2).zip" },
    new FileContent() {Content = threeItems, Name = "two.zip" },
    new FileContent() {Content = fiveItems, Name = "two(1).zip" },
    new FileContent() {Content = sevenItems, Name = "two(2).zip" },
};

And implementation of ChangingDuplicateNames(FileContent[] files) is:

private FileContent[] ChangingDuplicateNames(FileContent[] files)
{
    //Creating a dicitonary to store duplicated values. "Key" of dictionary        
    //is duplicated name, "Value" of dictionary is number to add for name
    Dictionary<string, int> duplicateNames = files.GroupBy(x => x.Name)
            .Where(group => group.Count() > 1)
            .ToDictionary(grouped => grouped.Key, grouped => 0);

    if (duplicateNames.Count == 0)
       return files;

    int namesLength = files.Length;
    string actualName = string.Empty;
    for (int indexArray = 0; indexArray < namesLength; indexArray++)
    {
        int value;
        bool isDuplicate = duplicateNames
           .TryGetValue(files[indexArray].Name, out value);
        if (isDuplicate)
        {
           actualName = files[indexArray].Name;
           if (value == 0)
              files[indexArray].Name = files[indexArray].Name;
           else
           {
              //Adding increment to the mext duplicate name
              string fileNameWithoutExtension = Path
                  .GetFileNameWithoutExtension(files[indexArray].Name);
              string fileExtension = Path
                  .GetExtension(files[indexArray].Name);
              files[indexArray].Name = fileNameWithoutExtension + "(" + value + ")"
+ fileExtension;
           }
           duplicateNames[actualName] = ++value;
        }
    }
    return files;
}

And my question is: Is it possible to improve this algorithm? I mean could be this code smaller?

Maybe I should not iterate through the all array of names, but I cannot figure out how I can change names in files without iterating through all array. Thanks in advance.

files[indexArray].Name = files[indexArray].Name; - really? (Then, I'd prefer current or name over currentName over actualName. Rather than assigning a descriptively named (conceptual) const (isDuplicate), I'd define a predicate…) Don't know about improve … smaller, but why iterate the names twice? — greybeard
– greybeard, Commented May 5, 2017 at 17:37

Heslacher · Accepted Answer · 2017-05-05 09:57:39Z

3

Because we are dealing now with objects, we need to create a copy of each FileContent to achieve the same result like in your previous question.

Instead of having var currentFile = file where file had been a string, we now use var currentFile = new FileContent() { Content = file.Content, Name = file.Name };.

In addition, we are using the HashSet<T> only for look up and return an IEnumerable<FileContent> instead of an array.

private IEnumerable<FileContent> ChangingDuplicateNames(FileContent[] files)
{
    var hashSet = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
    foreach (var file in files)
    {
        var currentFile = new FileContent() { Content = file.Content, Name = file.Name };
        int counter = 0;
        while (!hashSet.Add(currentFile.Name))
        {
            currentFile.Name = CreateFileName(file.Name, ref counter);
        }
        yield return currentFile;
    }
}

and call it like so

FileContent[] result = ChangingDuplicateNames(files).ToArray();

answered May 5, 2017 at 9:57

Heslacher

50.9k5 gold badges83 silver badges177 bronze badges

\$\begingroup\$ Why do we use yield here? \$\endgroup\$

StepUp
– StepUp

2017-05-05 10:13:37 +00:00
Commented May 5, 2017 at 10:13
1

\$\begingroup\$ Because we don't want to have a second hashset on which we call ToArray() later on. If you have an IEqualityComparer<FileContent> we could use a HashSet<FileContent> and then we could change the method back to returning FileContent[] by calling hashSet.ToArray(). \$\endgroup\$

Heslacher
– Heslacher

2017-05-05 10:17:04 +00:00
Commented May 5, 2017 at 10:17
\$\begingroup\$ 1. Why do you prefer return type IEnumerable<FileContent> not FileContent[] of method ChangingDuplicateNames(...)? 2. Why don't you prefer return type IEnumerable<FileContent> not FileContent[] in ChangingDuplicateNames(...) as a parameter? \$\endgroup\$

StepUp
– StepUp

2017-05-05 10:25:58 +00:00
Commented May 5, 2017 at 10:25
1

\$\begingroup\$ 1. Its easier and cleaner this waym (IMO) 2. I guess you mean passing as a parameter, which isn't a problem. Just change FileContent[] to IEnumerable<FileContent> \$\endgroup\$

Heslacher
– Heslacher

2017-05-05 10:29:12 +00:00
Commented May 5, 2017 at 10:29

Add a comment |

Community · Accepted Answer · 2020-06-10 13:24:26Z

Edit: I ran the original code, the answer and my solution here over 2400 records. I didn't find any significant differences in perf, but more rigorous profiling should be done if the number of records is magnitudes larger. My improvements are based on the following observations:

You do a lot of work to make a unique name. Unless that format is required, I would go with appending a Guid to those duplicates or something. It is simply less code that I have to write.

You have a lot of variables capturing intermediate state: duplicateNames namesLength actualName incrementedValue isDuplicate fileNameWithoutExtension fileExtension indexArray. If we could eliminate those, the code would be easier to read. I ended up with uniqueNames and i

I like that you remove duplicates in place in the original files list.

Another goal of mine was to see if I could reduce the amount of nesting of code.

Assuming we want to keep the original list in the same order, I use a HashSet to store unique names. Assuming we only care about making names unique, I create those using GUIDs. Here the code:

private static FileContent[] ChangingDuplicateNamesIII(FileContent[] files)
        {
            var uniqueNames = new HashSet<string>(StringComparer.OrdinalIgnoreCase);
            for (int i = 0; i < files.Count(); i++)
            {
                if (uniqueNames.Contains(files[i].Name))
                {
                    // Duplicate
                    files[i].Name = Path.GetFileNameWithoutExtension(files[i].Name)
                        + "_"
                        + Guid.NewGuid().ToString()
                        + Path.GetExtension(files[i].Name);
                }
                else
                {
                    uniqueNames.Add(files[i].Name);
                }
            }
            return files;
        }

    }

Given the following input array:

FileContent[] filesSmall =
            {

                new FileContent() {Content = "1", Name = "one.zip" },
                new FileContent() {Content = "2", Name = "one.zip" },
                new FileContent() {Content = "3", Name = "two.zip" },
                new FileContent() {Content = "4", Name = "two.zip" },
                new FileContent() {Content = "5", Name = "two.zip" },
                new FileContent() {Content = "6", Name = "one.zip" },
                new FileContent() {Content = "7", Name = "one.zip" },
                new FileContent() {Content = "8", Name = "one.zip" },
                new FileContent() {Content = "9", Name = "two.zip" },
                new FileContent() {Content = "10", Name = "two.zip" },
                new FileContent() {Content = "11", Name = "two.zip" },
            };

It produces the following output, with the first column representing the Content (in our simple case a string).

I don't think attaching a GUID instead of a counter is such a good idea. Besides the review part is missing. This is virtually the same code as in the other answer. With this solution there is also no way of knowing that a file with a certain name exists multiple times because of the long meainingless GUID. On top of that there is no review. — t3chb0t
– t3chb0t, Commented May 6, 2017 at 6:19
Thanks for the feedback. This was my first time answering a question here and I could have done a better job. I'm looking forward to learning from others and doing better, because that is a significant part of what code reviews should be about. Post edited with more details. My solution is very different from the answer, although I did read and run it - no negations, no IEnumerable with a yield, no counter variable, no method to create a file name, no foreach and, last but not least, I replace duplicate names inline in the original array. I hope this helps and appreciate the feedback! — Andrew Byrne
– Andrew Byrne, Commented May 6, 2017 at 20:07

Stack Exchange Network

Correcting duplicate names in generic array - follow up

2 Answers 2

Your Answer

Linked

Hot Network Questions

Correcting duplicate names in generic array - follow up

2 Answers 2

Your Answer

Sign up or log in

Post as a guest

Linked

Related

Hot Network Questions