How to find an abnormal identifier from a large number of identifiers

Question

How to find an abnormal identifier from a large number of identifiers

We are launching an affiliate program. Users who register can earn points when they successfully recruit other users. However, spammers abuse this program and automatically sign a large number of accounts. We want to prevent this by explicitly typing accounts. My idea is to write a program to identify the names created by the machine, or at least choose a subset for manual inspection.

So far, we have discovered that there are two types of anomalous identifiers:

The first is that some identifiers are very similar to others, for example:
- wss12345
- wss12346
- wss12347
- test1
- test2
- ...
Secondly, some ides look like randomly generated without rules, for example:
- MiDjiSxxiDekiE
- NiMjKhJixLy
- DAFDAB7643
- ...

For the first, I use the Levenshtein distance (edit). This method may find some identifiers that were illustrated in type 1. (I did this and can get good performance)

For the second, I can calculate the probability for identifiers, like:

id = "DAFDAB7643:
p(id) = p(D)*p(A|D)*p(F|A)*p(D|F)*...*p(3|4)

Therefore, I can use the probability of filtering out anomalous identifiers. (Just an idea, I have not tried.)

Can someone give me other suggestions on this topic? How else can I approach this problem? Can you see the flaws or omissions in my attempts?

+5

machine-learning spam prevention spam

Tim Aug 29 '12 at 6:08

source share

2 answers

Dave · Answer 1 · 2012-08-29T20:43:09+0000

, , / , .
IP- , .
- , , , , .
1. - , , , . , , , @larsmans .

, ( 3).

tripleee · Answer 2 · 2012-08-30T03:42:26+0000

, , ; Qaru .

, , . ; , , , .

How to find an abnormal identifier from a large number of identifiers

More articles: