Take the 2-minute tour ×
Programmers Stack Exchange is a question and answer site for professional programmers interested in conceptual questions about software development. It's 100% free, no registration required.

I have a set of n elements (1,000 <= n <= 100,000) and I can compute the grade of similarity between each pair, that is a value from 0 (very similar) to 1 (very different). I would like to cluster the elements based on their grade of similarity.

I thought about representing them as a graph, the elements are the vertices and the weighted edges are the similarity between them. I read about the MCL algorithm but I think it isn't the best approach since my graph is complete.

On the other hand, as there are a lot of elements, maybe computing the similarity between each pair is not the best practice (I want a fast algorithm). I also read something about leader clustering algorithms but, again, I am not sure if it is the best approach because, as far as I know, it is quite prone to fail due to its greediness (I would like something more robust).

Edit: I forgot to mention that I know a threshold for which when the comparison between two elements is higher than it, then I know that they belong to different clusters.

share|improve this question
1  
If I know Similarity(a,b) and Similarity(b,c), does that tell me (or allow me to bound) Similarity(a,c)? –  Ben Aaronson Mar 19 at 13:49
    
No. Sometimes it does, but I never know when it is likely to occur –  ibci Mar 19 at 14:03
3  
The type of clustering algorithms you might want to use (given your requirements) is probably complete-linkage clustering which is necessary to avoid the chaining problem. If it is more complex than that, you may have to reconsider how that "similarity" is defined - e.g. if it is originally computed by a vector of scores, you may have to use that whole vector instead of a single similarity value. –  rwong Mar 19 at 15:33

Your Answer

 
discard

By posting your answer, you agree to the privacy policy and terms of service.

Browse other questions tagged or ask your own question.