It seems to me that this should be easy, but after many searches and attempts, I can not understand the answer. Basically, I have a very large number of elements that I want to try in random order without replacement. In this case, they are cells in a 2D array. The solution that I would use for a smaller array is not translated, because it requires shuffling the array in memory. If the number I had to try was small, I could just randomly sort the elements and save the list of values I tried. Unfortunately, I often have to try a very large fraction of all cells, like everyone else.
What I would like to create is an iterator using some combination of itertools, numpy and / or random, which gives the next random cell (indices x and y). Another possible solution would be to create an iterator that would give the next random number (without replacement) between 0 and (x_count * y_count), which I could map back to the cell. None of them seem easily accomplished.
Thanks for any suggestions!
Here is my current solution.
import numpy as np
import itertools as itr
import random as rdm
x_count = 10
y_count = 5
x_indices = np.arange(x_count)
y_indices = np.arange(y_count)
cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)
for i in range(25):
print list_cell_indices[i]
So, based on the current answers and my attempt to translate perl, which I don't know anything about, I understand that the best I can do is the following:
import numpy as np
import itertools as itr
import random as rdm
x_count = 10000
y_count = 5000
sample_count = 10000
keep_probability = 0.01
tried_cells = set()
kept_cells = set()
while len(kept_cells) < sample_count:
x = rdm.randint(0, x_count)
y = rdm.randint(0, y_count)
if (x, y) in tried_cells:
pass
else:
tried_cells.add((x, y))
keep = rdm.random() < keep_probability
if keep:
kept_cells.add((x,y))
print "worked"
In most cases, processing time and memory usage are not so bad. Perhaps I could check the middle cell keep_probability and sample_count and throw an error for complex cases.