For the project, I need to copy some results that currently exist in the Stata output files (.dta) and were computed from an older Stata script. The new version of the project should be written in Python.
The specific part I come across is comparing the quantile breakpoint calculations based on the weighted version of the Stata command . Note that the relationships between the data points will not matter for the weights, and the weights I use are derived from a continuous quantity, so the relationships are extremely unlikely (and there are no relationships in my test data set). So incorrect classification due to connections is not the case. xtile
I read an article in Wikipedia on the weighted percentiles, as well as the publication of a cross-check, describes an alternative algorithm that is to play the quantile of the R-type 7.
I implemented both weighted algorithms (the code below), but I'm still not very good at comparing the calculated quantiles in the Stata output.
Does anyone know the specific algorithm used by the Stata routine? The documents did not describe this clearly. It says something about taking the average value on the flat sections of the CDF to invert it, but it hardly describes the real algorithm and ambiguously says whether it performs any other interpolation.
Please note that numpy.percentilethey scipy.stats.mstats.mquantilesdo not accept weighting coefficients and can not perform weighted quantiles, only ordinary peers. The essence of my problem is the need to use weights.
Note: I debugged both methods below, but feel free to suggest an error in the comment if you see it. I tested both methods on smaller datasets and the results are good and also match the R output for cases where I can guarantee which method R uses. The code is not so elegant yet and it copies too much between the two types, but all this will be fixed later when I believe that the output is what I need.
, , xtile Stata xtile, Stata xtile .
, :
import numpy as np
def mark_weighted_percentiles(a, labels, weights, type):
if type == 1:
N = len(a)
sort_indx = np.argsort(a)
tmp_a = a[sort_indx].copy()
tmp_weights = weights[sort_indx].copy()
num_categories = len(labels)
breaks = np.linspace(0, 1, num_categories+1)
cu_weights = np.cumsum(tmp_weights)
p_vals = (1.0/cu_weights[-1])*(cu_weights - 0.5*tmp_weights)
ret = np.repeat(0, len(a))
if(len(a)<num_categories):
return ret
quantiles = []
for brk in breaks:
if brk <= p_vals[0]:
i_low = 0; i_high = 0;
elif brk >= p_vals[-1]:
i_low = N-1; i_high = N-1;
else:
for ii in range(N-1):
if (p_vals[ii] <= brk) and (brk < p_vals[ii+1]):
i_low = ii
i_high = ii + 1
if i_low == i_high:
v = tmp_a[i_low]
else:
v = tmp_a[i_low] + ((brk-p_vals[i_low])/(p_vals[i_high]-p_vals[i_low]))*(tmp_a[i_high]-tmp_a[i_low])
quantiles.append(v)
for i in range(0, len(quantiles)-1):
lower = quantiles[i]
upper = quantiles[i+1]
ret[ np.logical_and(a>=lower, a<upper) ] = labels[i]
ret[a<=quantiles[0]] = labels[0]
ret[a>=quantiles[-1]] = labels[-1]
return ret
elif type == 2:
N = len(a)
sort_indx = np.argsort(a)
tmp_a = a[sort_indx].copy()
tmp_weights = weights[sort_indx].copy()
num_categories = len(labels)
breaks = np.linspace(0, 1, num_categories+1)
cu_weights = np.cumsum(tmp_weights)
s_vals = [0.0];
for ii in range(1,N):
s_vals.append( ii*tmp_weights[ii] + (N-1)*cu_weights[ii-1])
s_vals = np.asarray(s_vals)
norm_s_vals = (1.0/s_vals[-1])*s_vals
ret = np.repeat(0, N)
if(N < num_categories):
return ret
quantiles = []
for brk in breaks:
if brk <= norm_s_vals[0]:
i_low = 0; i_high = 0;
elif brk >= norm_s_vals[-1]:
i_low = N-1; i_high = N-1;
else:
for ii in range(N-1):
if (norm_s_vals[ii] <= brk) and (brk < norm_s_vals[ii+1]):
i_low = ii
i_high = ii + 1
if i_low == i_high:
v = tmp_a[i_low]
else:
v = tmp_a[i_low] + (( (brk*s_vals[-1])-s_vals[i_low])/(s_vals[i_high]-s_vals[i_low]))*(tmp_a[i_high]-tmp_a[i_low])
quantiles.append(v)
for i in range(0, len(quantiles)-1):
lower = quantiles[i]
upper = quantiles[i+1]
ret[ np.logical_and( a >= lower, a < upper ) ] = labels[i]
ret[a<=quantiles[0]] = labels[0]
ret[a>=quantiles[-1]] = labels[-1]
return ret