Vectorized assignment in Numpy
Vectorized assignment in Numpy
Let's assume I have a large 2D numpy array, e.g. 1000x1000 elements. I also have two 1D integer arrays of length L, and a float 1D arrray of the same length. If I want to simply assign floats to different positions in the original array according to integer array, I could write:
mat = np.zeros((1000,1000))
int1 = np.random.randint(0,999,size=(50000,))
int2 = np.random.randint(0,999,size=(50000,))
f = np.random.rand(50000)
mat[int1,int2] = f
But if there were collisions i.e. multiple floats corresponding to single location, all but the last would be overwritten. Is there a way to somehow aggregate all the collisions, e.g. mean or median of all the floats falling at the same location? I would like to take advantage of vectorization and hopefully avoid interpreter loops.
Thanks!
ufunc
.at
If you want the mean and there is no maximum number of times the entry could be updated, you'll need a 3D array to store all the values and then take the mean at the end.
– Kyle Roth
2 days ago
3 Answers
3
Building on hpaulj's suggestion, here's how to get the mean value in case of collisions:
import numpy as np
mat = np.zeros((2,2))
int1 = np.zeros(2, dtype=int)
int2 = np.zeros(2, dtype=int)
f = np.array([0,1])
np.add.at(mat, [int1, int2], f)
n = np.zeros((2,2))
np.add.at(n, [int1, int2], 1)
mat[int1, int2] /= n[int1, int2]
print(mat)
array([[0.5, 0. ],
[0. , 0. ]])
Very clever and efficient! There is no version for median, though?
– Cindy Almighty
2 days ago
I had a small play with median too but couldn't think of an easy way to get it. (Doesn't mean it doesn't exist :). The main reason is you need to keep a list of all collisions to compute a median, which forces you (I believe) to use python lists which don't integrate well with numpy vectorization...
– Julien
2 days ago
You can manipulate your data in pandas
and then assign.
pandas
Starting from
mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)
You can define a function
def get_aggregated_collisions(a,b,c):
df = pd.DataFrame({'x':a, 'y':b, 'v':c})
df['coord'] = df[['x','y']].apply(tuple,1)
d = df.groupby('coord').agg({"v":'mean','x':'first', 'y':'first'}).to_dict('list')
return d
and then
d = get_aggregated_collisions(a,b,c)
mat[d['x'], d['y']] = d['v']
The whole operation (including generating the matrixes, np.random
etc) ran quite ok
np.random
1.05 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The idea behind making a tuple
of coordinates was to have a hashable option to group values by their coordinates. Maybe there is even a smarter way to do this :) always open to suggestions.
tuple
This solution works for median just as well?
– Cindy Almighty
2 days ago
You don't need to create
tuple
. Just do the grouping based on both x
and y
columns.– Tai
2 days ago
tuple
x
y
@CindyAlmighty Yes, it does ! :) Just change
"mean"
for "meadian"
or whatever operation you might want.– RafaelC
2 days ago
"mean"
"meadian"
@Tai handn't slept in the past 2 days haha:thanks for pointing out. I won't edit not to make your answer redundant. Thanks ;}
– RafaelC
2 days ago
@RafaelC sounds like you have rough days. Feel free to edit your answer to provide users with better information. I would not mind.
– Tai
2 days ago
My trial based on RafaelC's answer.
First do groupby
on ["x", "y"], then take mean
or median
of each group, and finally reset the index with reset_index()
.
groupby
mean
median
reset_index()
import pandas as np
# setup
mat = np.zeros((1000,1000))
a = np.random.randint(0,999,size=(50000,))
b = np.random.randint(0,999,size=(50000,))
c = np.random.rand(50000)
# Start here
df = pd.DataFrame({"x":a, "y":b, "val":c})
v = df.groupby(["x", "y"]).mean().reset_index()
mat[v["x"], v["y"]] += v["val"]
If medians are needed, modify v
to be
v
v = df.groupby(["x", "y"]).median().reset_index()
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Consider the
ufunc
.at
method, e.g. np.add.at indexing with array– hpaulj
2 days ago