Deciding input values to DBSCAN algorithm

Question

I am written code in python to implement DBSCAN clustering algorithm. My dataset consists of 14k users with each user represented by 10 features. I am unable to decide what exactly to keep the value of Min_samples and epsilon as input How should I decide that? Similarity measure is euclidean distance.(Hence it becomes even more tough to decide.) Any pointers?

Evaluate the Euclidean distance on your data set. Does it work? What is a sensible similarity threshold? Then use this threshold as epsilon for DBSCAN. — Anony-Mousse, Apr 15 '12 at 18:57
@Anony-Mousse: I was thinking of this: Would it make sense to normalize the euclidean distances within 0-1. Now the distances might go up to something like 10k+ which make sit difficult to decide threshold. But I am not sure how to normalize it. Any ideas? — Maxwell, Apr 15 '12 at 21:58
You might want to read up on the curse of dimensionality, and use some entirely different distance function. Euclidean distance makes sense in the physical world, but not in arbitrary spaces. — Anony-Mousse, Apr 16 '12 at 5:25

Charles Menguy · Answer 1 · 2012-04-14 17:15:10Z

DBSCAN is pretty often hard to estimate its parameters.

Did you think about the OPTICS algorithm? You only need in this case Min_samples which would correspond to the minimal cluster size.

Otherwise for DBSCAN I've done it in the past by trial and error : try some values and see what happens. A general rule to follow is that if your dataset is noisy, you should have a larger value, and it is also correlated with the number of dimensions (10 in this case).

asked	1 year ago
viewed	733 times
active	1 year ago

Explore our sites

Deciding input values to DBSCAN algorithm

1 Answer

Your Answer

Not the answer you're looking for? Browse other questions tagged python cluster-analysis dbscan or ask your own question.

Linked

Hot Network Questions

Explore our sites

Deciding input values to DBSCAN algorithm

1 Answer

Your Answer

Sign up or login

Post as a guest

Not the answer you're looking for? Browse other questions tagged python cluster-analysis dbscan or ask your own question.

Linked

Related

Hot Network Questions