Python: select n items that are better distributed from multiple points

Question

Python: select n items that are better distributed from multiple points

I have a numerical array of points in the XY plane, for example: distribution

I want to select n points (say 100) that are better distributed from all of these points. I want the density of dots to be constant anywhere.

Something like that:

enter image description here

Is there any pythonic way or any numpy / scipy function for this?

+3

python numpy scipy

Josep Bosch Jan 28 '14 at 15:24

source share

2 answers

@EMS , , .

( EMS !), - , .

, , . , .

pandas "" , , , "" numpy.

, , : ( ).

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

total_num = 100000
x, y = np.random.normal(0, 1, (2, total_num))

# We'll always get fewer than this number for two reasons.
# 1) We're choosing a square grid, and "subset_num" may not be a perfect square
# 2) There won't be data in every cell of the grid
subset_num = 1000

# Bin points onto a rectangular grid with approximately "subset_num" cells
nbins = int(np.sqrt(subset_num))
xbins = np.linspace(x.min(), x.max(), nbins+1)
ybins = np.linspace(y.min(), y.max(), nbins+1)

# Make a dataframe indexed by the grid coordinates.
i, j = np.digitize(y, ybins), np.digitize(x, xbins)
df = pd.DataFrame(dict(x=x, y=y), index=[i, j])

# Group by which cell the points fall into and choose a random point from each
groups = df.groupby(df.index)
new = groups.agg(lambda x: np.random.permutation(x)[0])

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].plot(x, y, 'k.')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(new.x, new.y, 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(new)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

@EMS , .

, , , .

scipy.stats.gaussian_kde ( ). . (, ..). . , 1e5 .

, , . , , , .

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import gaussian_kde

total_num = 100000
subset_num = 1000
x, y = np.random.normal(0, 1, (2, total_num))

# Let approximate the PDF of the point distribution with a kernel density
# estimate. scipy.stats.gaussian_kde is slow for large numbers of points, so
# you might want to use another implementation in some cases.
xy = np.vstack([x, y])
dens = gaussian_kde(xy)(xy)

# Try playing around with this weight. Compare 1/dens,  1-dens, and (1-dens)**2
weight = 1 / dens
weight /= weight.sum()

# Draw a sample using np.random.choice with the specified probabilities.
# We'll need to view things as an object array because np.random.choice
# expects a 1D array.
dat = xy.T.ravel().view([('x', float), ('y', float)])
subset = np.random.choice(dat, subset_num, p=weight)

# Plot the results
fig, axes = plt.subplots(ncols=2, sharex=True, sharey=True)
axes[0].scatter(x, y, c=dens, edgecolor='')
axes[0].set_title('Original $(n={})$'.format(total_num))
axes[1].plot(subset['x'], subset['y'], 'k.')
axes[1].set_title('Subset $(n={})$'.format(len(subset)))
plt.setp(axes, aspect=1, adjustable='box-forced')
fig.tight_layout()
plt.show()

+4

Joe Kington 28 . '14 18:40

ely · Accepted Answer · 2014-01-28T18:19:27+0000

If you do not give a specific criterion for determining the "best distribution", we cannot give a definite answer.

" " , . ? , , .

:

, , , Bonacich.
100.
1-4, , "" .

SciPy, NetworkX scikits.learn NumPy.

, --. , , QMC . , , , .

, K- K = 100. 100 ( ). , . 100 , , . , 100 , , .

Python: select n items that are better distributed from multiple points

More articles: