We are launching an affiliate program. Users who register can earn points when they successfully recruit other users. However, spammers abuse this program and automatically sign a large number of accounts. We want to prevent this by explicitly typing accounts. My idea is to write a program to identify the names created by the machine, or at least choose a subset for manual inspection.
So far, we have discovered that there are two types of anomalous identifiers:
The first is that some identifiers are very similar to others, for example:
- wss12345
- wss12346
- wss12347
- test1
- test2
- ...
Secondly, some ides look like randomly generated without rules, for example:
- MiDjiSxxiDekiE
- NiMjKhJixLy
- DAFDAB7643
- ...
For the first, I use the Levenshtein distance (edit). This method may find some identifiers that were illustrated in type 1. (I did this and can get good performance)
For the second, I can calculate the probability for identifiers, like:
id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)
Therefore, I can use the probability of filtering out anomalous identifiers. (Just an idea, I have not tried.)
Can someone give me other suggestions on this topic? How else can I approach this problem? Can you see the flaws or omissions in my attempts?
source
share