Implementing a custom agglomeration algorithm from scratch

I know about agglomerative clustering algorithms, how it starts from each data point as separate clusters, and then combines the points to form clusters.

Now I have an n-dimensional space and several data points that matter in each of these dimensions. I want to group two points / clusters based on business rules, for example:

  • Clustering of two points c1 and c2, if the distance between clusters of size 1 is <T1, and the distance between size 2 <T2, ... and the distance of dimension n <Ton
  • If the rule for dimension 1 is met and the rule for size 2 is met, then group them without worrying about other dimensions ...

.... and similar user rules.

In addition, I have my own way of determining and measuring the distance between any two clusters in any particular dimension. A dimension can only contain rows, and I want to define my own label of distance by rows. In another dimension, it may contain location names, and the distance between two points along this dimension is the geographical distance between the named name, etc. For other measurements.

Is there an environment / software that allows me to implement this way of defining custom distance metrics and then implement agglomerative clustering? Of course, agglomerative clustering stops when business rules are not executed at some point in time, and we have clusters formed in n-dimensional space at the end.

Thanks Abhishek S

+5
2

Weka.

setDistanceFunction(DistanceFunction distanceFunction).

Weka: Cobweb, EM, FarthestFirst, FilteredClusterer, MakeDensityBasedClusterer, RandomizableClusterer, RandomizableDensityBasedClusterer, RandomizableSingleClustererEnhancer, SimpleKMeans, SingleClustererEnhancer.

, NormalizableDistance :

  /** Index in ranges for MIN. */
  public static final int R_MIN = 0;

  /** Index in ranges for MAX. */

  public static final int R_MAX = 1;

  /** Index in ranges for WIDTH. */
  public static final int R_WIDTH = 2;

  /** the instances used internally. */
  protected Instances m_Data = null;

  /** True if normalization is turned off (default false).*/
  protected boolean m_DontNormalize = false;

  /** The range of the attributes. */
  protected double[][] m_Ranges;

  /** The range of attributes to use for calculating the distance. */
  protected Range m_AttributeIndices = new Range("first-last");

  /** The boolean flags, whether an attribute will be used or not. */
  protected boolean[] m_ActiveIndices;

  /** Whether all the necessary preparations have been done. */
  protected boolean m_Validated;


public double distance(Instance first, Instance second, double cutOffValue, PerformanceStats stats) {
    double distance = 0;
    int firstI, secondI;
    int firstNumValues = first.numValues();
    int secondNumValues = second.numValues();
    int numAttributes = m_Data.numAttributes();
    int classIndex = m_Data.classIndex();

    validate();

    for (int p1 = 0, p2 = 0; p1 < firstNumValues || p2 < secondNumValues; ) {
      if (p1 >= firstNumValues)
        firstI = numAttributes;
      else
        firstI = first.index(p1); 

      if (p2 >= secondNumValues)
        secondI = numAttributes;
      else
        secondI = second.index(p2);

      if (firstI == classIndex) {
        p1++; 
        continue;
      }
      if ((firstI < numAttributes) && !m_ActiveIndices[firstI]) {
        p1++; 
        continue;
      }

      if (secondI == classIndex) {
        p2++; 
        continue;
      }
      if ((secondI < numAttributes) && !m_ActiveIndices[secondI]) {
        p2++;
        continue;
      }

      double diff;

      if (firstI == secondI) {
        diff = difference(firstI,
                  first.valueSparse(p1),
                  second.valueSparse(p2));
        p1++;
        p2++;
      }
      else if (firstI > secondI) {
        diff = difference(secondI, 
                  0, second.valueSparse(p2));
        p2++;
      }
      else {
        diff = difference(firstI, 
                  first.valueSparse(p1), 0);
        p1++;
      }
      if (stats != null)
        stats.incrCoordCount();

      distance = updateDistance(distance, diff);
      if (distance > cutOffValue)
        return Double.POSITIVE_INFINITY;
    }

    return distance;
  }

, ( Weka). , /.

-, . , , Double.positiveInfinity, - .

+4

ELKI - . , Weka ( ). Wiki, , ( ): .

, "-" ...

+2

All Articles