Efficient random number iterator with memory without replacement

It seems to me that this should be easy, but after many searches and attempts, I can not understand the answer. Basically, I have a very large number of elements that I want to try in random order without replacement. In this case, they are cells in a 2D array. The solution that I would use for a smaller array is not translated, because it requires shuffling the array in memory. If the number I had to try was small, I could just randomly sort the elements and save the list of values ​​I tried. Unfortunately, I often have to try a very large fraction of all cells, like everyone else.

What I would like to create is an iterator using some combination of itertools, numpy and / or random, which gives the next random cell (indices x and y). Another possible solution would be to create an iterator that would give the next random number (without replacement) between 0 and (x_count * y_count), which I could map back to the cell. None of them seem easily accomplished.

Thanks for any suggestions!

Here is my current solution.

import numpy as np
import itertools as itr
import random as rdm

#works great
x_count = 10
y_count = 5

#good luck!
#x_count = 10000
#y_count = 20000

x_indices = np.arange(x_count)
y_indices = np.arange(y_count)

cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)

for i in range(25):
    print list_cell_indices[i]

So, based on the current answers and my attempt to translate perl, which I don't know anything about, I understand that the best I can do is the following:

import numpy as np
import itertools as itr
import random as rdm

x_count = 10000
y_count = 5000

sample_count = 10000
keep_probability = 0.01


tried_cells = set()
kept_cells = set()

while len(kept_cells) < sample_count:
    x = rdm.randint(0, x_count)
    y = rdm.randint(0, y_count)

    if (x, y) in tried_cells:
        pass
    else:
        tried_cells.add((x, y))
        keep = rdm.random() < keep_probability
        if keep:
            kept_cells.add((x,y))


print "worked"

In most cases, processing time and memory usage are not so bad. Perhaps I could check the middle cell keep_probability and sample_count and throw an error for complex cases.

+3
5

. x * y 2-D. , , , 0 (x * y).

import numpy

x_count = 10000
y_count = 20000

x_indices = numpy.arange(x_count)
y_indices = numpy.arange(y_count)

large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
print large_table

def get_random_item(sample_size):
    from random import sample
    for i in sample(xrange(y_count * x_count), sample_size):
        y,x = divmod(i, y_count)
        yield (x,y)

for x,y in get_random_item(10):
    print '%12i   x: %5i y: %5i' % (large_table[x][y],  x,y)

:

( 2- , )

[[        0         1         2 ...,      9997      9998      9999]
 [    10000     10001     10002 ...,     19997     19998     19999]
 [    20000     20001     20002 ...,     29997     29998     29999]
 ..., 
 [199970000 199970001 199970002 ..., 199979997 199979998 199979999]
 [199980000 199980001 199980002 ..., 199989997 199989998 199989999]
 [199990000 199990001 199990002 ..., 199999997 199999998 199999999]]

2- , [x] [y]

   154080675   x: 15408 y:   675
   186978188   x: 18697 y:  8188
   157506087   x: 15750 y:  6087
   168859259   x: 16885 y:  9259
    29775768   x:  2977 y:  5768
    94167866   x:  9416 y:  7866
    15978144   x:  1597 y:  8144
    91964007   x:  9196 y:  4007
   163462830   x: 16346 y:  2830
    62613129   x:  6261 y:  3129

sample() , " ", " : (xrange (10000000), 60)". python random.

, , get_random_item() , () , - y * x + sample_size, .

+2

, , R * C. , , . random.sample ; , , 2- numpy . ( , ints , a la hexparrot - .)

>>> a = numpy.arange(25).reshape((5, 5))
>>> random.sample(a.ravel(), 5)
[0, 13, 8, 18, 4]

random.sample, , , perl- - , . , , , , , - Fisher-Yates shuffle, , sample_size (.. , , ).

, , , random.sample, , - c.

- , : numpy.random.choice. , , c. ? Numpy 1.7!

+3

O (N = R * C), O (N) . , . , , , , .

( : , "", . , , . , , " " ( , .))

+1

, , , . , , , .

Instead, you might be better off applying your test to the entire grid and picking a random element from the ones that pass it.

This function returns a random element that passes the test (or None if all of them fail). It uses very little memory.

def findRandomElementThatPasses(iterable, testFunc):
    value = None
    passed = 0

    for element in iterable:
        if testFunc(element):
            passed += 1
            if random.random() > 1.0/passed:
                value = element

    return value

You would call it something like:

e = findRandomElementThatPasses((x,y) for x in xrange(X_SIZE)
                                      for y in xrange(Y_SIZE),
                                someFunctionTakingAnXYTuple)

If you are using Python 3, use a range instead of xrange.

+1
source

in perl you can do something like this:

# how many you got and need
$xmax = 10000000;
$ymax = 10000000;
$sampleSize = 10000;

# build hash, dictionary in python
$cells = {};
$count = 0;
while($count < $sampleSize) {
  $x = rand($xmax);
  $y = rand($ymax);
  if(! $cells->{$x}->{$y}) {
    $cells->{$x}->{$y}++;
    $count++;
  }
}
# now grab the stuff
foreach ($x keys %{$cells}) {
  foreach ($y keys %{$cells->{$x}}) {
    getSample($x, $y);
  }
}

No duplicates, rather random, not too hard in memory.

0
source

All Articles