How to find an abnormal identifier from a large number of identifiers

We are launching an affiliate program. Users who register can earn points when they successfully recruit other users. However, spammers abuse this program and automatically sign a large number of accounts. We want to prevent this by explicitly typing accounts. My idea is to write a program to identify the names created by the machine, or at least choose a subset for manual inspection.

So far, we have discovered that there are two types of anomalous identifiers:

  • The first is that some identifiers are very similar to others, for example:

    • wss12345
    • wss12346
    • wss12347
    • test1
    • test2
    • ...
  • Secondly, some ides look like randomly generated without rules, for example:

    • MiDjiSxxiDekiE
    • NiMjKhJixLy
    • DAFDAB7643
    • ...

For the first, I use the Levenshtein distance (edit). This method may find some identifiers that were illustrated in type 1. (I did this and can get good performance)

For the second, I can calculate the probability for identifiers, like:

id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)

Therefore, I can use the probability of filtering out anomalous identifiers. (Just an idea, I have not tried.)

Can someone give me other suggestions on this topic? How else can I approach this problem? Can you see the flaws or omissions in my attempts?

+5
source share
2 answers
  • , , / , .

  • IP- , .

  • - , , , , .

  • 1. - , , , . , , , @larsmans .

, ( 3).

+1

, , ; Qaru .

, , . ; , , , .

0

All Articles