What is the fastest way to check for duplicates in python?

Using a dictionary seems perfect.

eg:.

history = {}
for i in collection:
    if i not in history:
        history[i] = None
        # fancy computation here

Will using type set () be as fast; set () does not require me to add stupid None values ​​to hash keys.

+3
source share
3 answers

Yes, you must use the kit.


Will using type set () be as fast

No, it will not be so fast. It will be faster.


Update

Some people have published tests showing that typing is slower than dict. I think this is a bit surprising, as they basically have the same basic implementation, except that the set is simpler. I think I found the reason for the slowness:

def set_way():
    my_set = set()
    my_set_add = my_set.add   # remember the method
    for ele in x:
        if ele not in my_set:
            my_set_add(ele)   # call the method directly

Results:

dict time : 1.896939858077399
set time : 1.8587076107880456

, .

+6

, , .

import timeit
import random as rn

x  = [rn.choice(xrange(10000)) for i in xrange(1000)]

def set_way():
    my_set = set()
    for ele in x:
        if ele in my_set:
            return True
        else:
            my_set.add(ele)
    else:
        return False

def dict_way():
    dicto = {}
    for ele in x:
        if ele in dicto:
            return True
        else:
            dicto[ele] = None
    else:
        return False



num = 10000

set_time = timeit.timeit(set_way, number = num)
print 'set time :', set_time
dict_time = timeit.timeit(dict_way, number = num)
print 'dict time :', dict_time

:

set time : 0.619757678699
dict time : 0.466664548148
+3

, :

import timeit

setup = """
x = range(10000)
s = set(range(5000))
d = dict.fromkeys(range(5000))
"""

print '# set', timeit.timeit('for i in x: z = i in s', setup, number=1000)
print '# dic', timeit.timeit('for i in x: z = i in d', setup, number=1000)

# set 1.18897795677
# dic 1.1489379406

However, if performance is not absolutely critical, you should use kits for readability.

Of course, as your question suggests, we are talking about hashing types. Unpacked types, such as containers, will require other methods.

For completeness, here are the standards of various modification methods:

import timeit

setup = """
x = range(10000)
s = set(range(5000))
d = dict.fromkeys(range(5000))

add_method = s.add
"""

print '# set-add     ', timeit.timeit('for i in x: s.add(i)', setup, number=1000)
print '# set-closure ', timeit.timeit('for i in x: add_method(i)', setup, number=1000)
print '# dict []     ', timeit.timeit('for i in x: d[i]=None', setup, number=1000)
print '# d.setdefault', timeit.timeit('for i in x: d.setdefault(i)', setup, number=1000)

# set-add      1.96829080582
# set-closure  1.2261030674
# dict []      0.982795000076
# d.setdefault 2.27355480194

dict[i] is the fastest, but this time it is not surprising because no function call is involved.

+1
source

All Articles