.net Distinct () and complex conditions

Suppose I have a class

public class Audio
{
    public string artist   { get; set; }
    public string title    { get; set; }
    // etc.
}

Now I want to filter duplicates in the list of such sound according to the similarity condition (not in accordance with EXACT). Basically, this is the Levenshtein distance with the correction of three-holes along the total length of the string. The problem is that the general tip about IEqualityComparer is "Always use both GetHashCode and Compare." I obviuosly cannot calculate the line spacing in GetHashCode because it is not a comparison method at all. However, in this case, even a similar sound will return different hashes, and Distinct () will consider it as different objects, and the compare () method does not start.

I tried to get GetHashCode to always return 0, so Compare is called for each of the objects in the collection, but this is slow. So finally, the question is: is there anything I can do with .net out of the box or do I need to find a good filtering algorithm?

+5
source share
2 answers

I would suggest (above all) not to use Distinct or GetHashCode .

GetHashCode is too strict for your business (as @Gabe notes). What can you do:

  • Suppose you have to compare the entire complexity of a triangle (O (n ^ 2)) of pairs of instances using the Levenshtein method
  • , : Levenshtein ( Audio , , )?

( ) darn good GetHashCode. GetHashCode, :

bool AreSimilar(Audio me, Audio you) {
  int cheapLevenshtein = Math.Abs(me.AbsoluteQuasiLevenshtein - you.AbsoluteQuasiLevenshtein);

  if (cheapLevenshtein < THRESHOLD) {

    int expensiveLevenshtein = Audio.LevenshteinBetween(me, you);
    var result = (expensiveLevenshtein < LIMIT);
    return result;

  } else
    return false;
}

. , : Distinct(). , , .

AbsoluteQuasiLevenshtein , : "ab" "zy", "ab" "blahblahblahblah" , , , ( GetHashCode + - GetHashCode).

+3

BKTree, "# " #:

https://bitbucket.org/ptasz3k/bktree

VS 2012.

, (x = > x.Key.ToLowerInvariant() ), levenshtein .

, :

var bk = BKTree.CSharp.CreateBK(x => x.artist, audios);
var allArtists = audios.Select(x => x.artist);
var possibleDuplicates = allArtists.Select(x => new 
    { Key = x, Similiar = BKTree.CSharp.FindInBk(bk, x, treshold).ToList());

, .

+1

All Articles