Grouping clusters using Mahalanobis distances

Question

I need to know the "number of clusters" in a dataaset.
To find the number of clusters, I am using a Gaussian Mixture model fitting,
bear with me,
Because the underlying distributions (each cluster) are not Gaussian, the GM tends to give very bad fits, because it is trying to compensate for the skewness of the data by increasing the variance of the Gaussian it fits and things like this. I figured, I may be able to solve this by fitting more Gaussians than are clusters expected to be in the data and then based on the distance of clusters find out which ones are true clusters and which ones are fits to the same cluster.

Now my problem is I have a distance matrix (Mahalanobis distances), distances between the fitted Gaussians coming out of the Gaussian mixture model, but I have no reliable way of counting the clusters,

to make it a bit more clear, if I have two real clusters in the dataset, and I fit 6 Gaussians, I expect to get 1-5 of them fall on top of one of the real clusters and the remaining 5-1 of them on top of the other cluster. This means looking at the distance matrix I expect to see quite a few (maybe 5) large distances [these are between-cluster distances] and lots of small distances [within-cluster distances].

here is a sample of the distance matrices I have, the dendrograms are just to help "see" the structure and have no additional information.

enter image description here

if anyone is interested in the raw(more raw) data, I can provide the data but if you don't have the domain specific knowledge (this is a spike sorting problem, in a neurophysiology context) it will not be easy for me to describe the data event in a few pages.

Any comments?

can you provide examples of your data? it might be easier to assess your approach if we knew exactly what you used it for, and also, easier to give suggestions for different approaches. — penelope, Jan 07 '13 at 16:05
@penelope thanks, you mean actual data? this is a "spike sorting" problem to start with, but this particular stage, could be described like: in a multi dimensional space, I have clusters with distinct means + a non gaussian source of noise, I want to count the number of clusters, (also eventually do clustering), to do so, I fit a mixture of Gaussians model with 6 clusters to the data (the actual number of clusters in the data is 1-4). I end up with a 6 by 6 Mahalanobis distance matrix, I want to count the clusters using this 6x6 matrix. — Ali, Jan 07 '13 at 16:18
No offence, but TLDR. Give actual data: it's always good. Or your own hand-made examples (minimal examples that still exhibit the problem). Also, while I would classify your question formatting as "good" and not "bad", I still encourage you to put more effort in to it: bullets, number-lists, emphasized parts and generally structured text is more appealing to read and so to answer as well. Nice outline of your a) problem and b) approach + examples are a sure way to get better and faster help ;) — penelope, Jan 07 '13 at 16:34
@Ali Do you know exactly how many clusters you're expecting? — Phonon, Jan 08 '13 at 19:18
@Phonon no that is the question, I want to "count" the clusters which are between 1 and 4, using the distance matices. — Ali, Jan 09 '13 at 16:52

Grouping clusters using Mahalanobis distances

0 Answers0