Corresponding clustering method for 1 or 2-dimensional data

I have a dataset that I created that consists of extracted masses (well, m / z, but this is not so important) of values ​​and time. I am extracting data from a file, however, repeated measurements can be obtained, and this leads to a lot of redundancy in the data set. I am looking for a way to group them in order to group those that are related to each other based on either similarities in mass or similarities in mass and time.

Example data to be combined together:

m / z time

337.65 1524.6

337.65 1524.6

337.65 1604.3

However, I have no way to determine how many clusters I will have. Does anyone know of an effective way to achieve this, perhaps using a simple distance metric? I am not very familiar with clustering algorithms.

+3
source share
3 answers

http://en.wikipedia.org/wiki/Cluster_analysis

http://en.wikipedia.org/wiki/DBSCAN

Read the section on hierarchical clustering and also see DBSCAN if you really don't want to specify how many clusters are in advance. You will need to determine the distance metric, and at this point you will determine which of the functions or combinations of functions you will be clustering.

+2
source

Why don't you just set a threshold?

If consecutive values ​​(in time) do not differ by at least +-0.1(in m / s), they are grouped together. Alternatively, use a relative threshold: differ by less than +- .1%. Set these thresholds according to your domain knowledge.

.

"" . , , . , , . ( !) , .

+1

K (http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm) . K. K K vs , K, "" . (, ).

You can easily expand K-Means to multidimensional data. But you should beware of scaling individual dimensions. For instance. Among points (1KG, 1KM) (2KG, 2KM), the nearest point (1,7KG, 1,4KM) is (2KG, 2KM) with these scales. But as soon as you start expressing the second element in meters, the alternative is probably true.

0
source

All Articles