Vectorized assignment in Numpy


Vectorized assignment in Numpy



Let's assume I have a large 2D numpy array, e.g. 1000x1000 elements. I also have two 1D integer arrays of length L, and a float 1D arrray of the same length. If I want to simply assign floats to different positions in the original array according to integer array, I could write:


mat = np.zeros((1000,1000))
int1 = np.random.randint(0,999,size=(50000,))
int2 = np.random.randint(0,999,size=(50000,))
f = np.random.rand(50000)
mat[int1,int2] = f



But if there were collisions i.e. multiple floats corresponding to single location, all but the last would be overwritten. Is there a way to somehow aggregate all the collisions, e.g. mean or median of all the floats falling at the same location? I would like to take advantage of vectorization and hopefully avoid interpreter loops.



Thanks!





Consider the ufunc .at method, e.g. np.add.at indexing with array
– hpaulj
2 days ago


ufunc


.at





If you want the mean and there is no maximum number of times the entry could be updated, you'll need a 3D array to store all the values and then take the mean at the end.
– Kyle Roth
2 days ago




3 Answers
3



Building on hpaulj's suggestion, here's how to get the mean value in case of collisions:


import numpy as np

mat = np.zeros((2,2))
int1 = np.zeros(2, dtype=int)
int2 = np.zeros(2, dtype=int)
f = np.array([0,1])

np.add.at(mat, [int1, int2], f)
n = np.zeros((2,2))
np.add.at(n, [int1, int2], 1)
mat[int1, int2] /= n[int1, int2]
print(mat)

array([[0.5, 0. ],
[0. , 0. ]])





Very clever and efficient! There is no version for median, though?
– Cindy Almighty
2 days ago





I had a small play with median too but couldn't think of an easy way to get it. (Doesn't mean it doesn't exist :). The main reason is you need to keep a list of all collisions to compute a median, which forces you (I believe) to use python lists which don't integrate well with numpy vectorization...
– Julien
2 days ago



You can manipulate your data in pandas and then assign.


pandas



Starting from


mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)



You can define a function


def get_aggregated_collisions(a,b,c):
df = pd.DataFrame({'x':a, 'y':b, 'v':c})
df['coord'] = df[['x','y']].apply(tuple,1)
d = df.groupby('coord').agg({"v":'mean','x':'first', 'y':'first'}).to_dict('list')
return d



and then


d = get_aggregated_collisions(a,b,c)
mat[d['x'], d['y']] = d['v']



The whole operation (including generating the matrixes, np.random etc) ran quite ok


np.random


1.05 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



The idea behind making a tuple of coordinates was to have a hashable option to group values by their coordinates. Maybe there is even a smarter way to do this :) always open to suggestions.


tuple





This solution works for median just as well?
– Cindy Almighty
2 days ago





You don't need to create tuple. Just do the grouping based on both x and y columns.
– Tai
2 days ago


tuple


x


y





@CindyAlmighty Yes, it does ! :) Just change "mean" for "meadian" or whatever operation you might want.
– RafaelC
2 days ago


"mean"


"meadian"





@Tai handn't slept in the past 2 days haha:thanks for pointing out. I won't edit not to make your answer redundant. Thanks ;}
– RafaelC
2 days ago





@RafaelC sounds like you have rough days. Feel free to edit your answer to provide users with better information. I would not mind.
– Tai
2 days ago




My trial based on RafaelC's answer.



First do groupby on ["x", "y"], then take mean or median of each group, and finally reset the index with reset_index().


groupby


mean


median


reset_index()


import pandas as np
# setup
mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)
# Start here
df = pd.DataFrame({"x":a, "y":b, "val":c})
v = df.groupby(["x", "y"]).mean().reset_index()
mat[v["x"], v["y"]] += v["val"]



If medians are needed, modify v to be


v


v = df.groupby(["x", "y"]).median().reset_index()






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

Opening a url is failing in Swift