Vectorized assignment in Numpy

Let's assume I have a large 2D numpy array, e.g. 1000x1000 elements. I also have two 1D integer arrays of length L, and a float 1D arrray of the same length. If I want to simply assign floats to different positions in the original array according to integer array, I could write:

mat = np.zeros((1000,1000)) int1 = np.random.randint(0,999,size=(50000,)) int2 = np.random.randint(0,999,size=(50000,)) f = np.random.rand(50000) mat[int1,int2] = f

But if there were collisions i.e. multiple floats corresponding to single location, all but the last would be overwritten. Is there a way to somehow aggregate all the collisions, e.g. mean or median of all the floats falling at the same location? I would like to take advantage of vectorization and hopefully avoid interpreter loops.

Thanks!

Consider the ufunc .at method, e.g. np.add.at indexing with array
– hpaulj
2 days ago

ufunc

.at

If you want the mean and there is no maximum number of times the entry could be updated, you'll need a 3D array to store all the values and then take the mean at the end.
– Kyle Roth
2 days ago

3 Answers
3

Building on hpaulj's suggestion, here's how to get the mean value in case of collisions:

import numpy as np mat = np.zeros((2,2)) int1 = np.zeros(2, dtype=int) int2 = np.zeros(2, dtype=int) f = np.array([0,1]) np.add.at(mat, [int1, int2], f) n = np.zeros((2,2)) np.add.at(n, [int1, int2], 1) mat[int1, int2] /= n[int1, int2] print(mat) array([[0.5, 0. ], [0. , 0. ]])

Very clever and efficient! There is no version for median, though?
– Cindy Almighty
2 days ago

I had a small play with median too but couldn't think of an easy way to get it. (Doesn't mean it doesn't exist :). The main reason is you need to keep a list of all collisions to compute a median, which forces you (I believe) to use python lists which don't integrate well with numpy vectorization...
– Julien
2 days ago

You can manipulate your data in pandas and then assign.

pandas

Starting from

mat = np.zeros((1000,1000)) a = np.random.randint(0,999,size=(50000,)) b = np.random.randint(0,999,size=(50000,)) c = np.random.rand(50000)

You can define a function

def get_aggregated_collisions(a,b,c): df = pd.DataFrame({'x':a, 'y':b, 'v':c}) df['coord'] = df[['x','y']].apply(tuple,1) d = df.groupby('coord').agg({"v":'mean','x':'first', 'y':'first'}).to_dict('list') return d

and then

d = get_aggregated_collisions(a,b,c) mat[d['x'], d['y']] = d['v']

The whole operation (including generating the matrixes, np.random etc) ran quite ok

np.random

1.05 s ± 30.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

The idea behind making a tuple of coordinates was to have a hashable option to group values by their coordinates. Maybe there is even a smarter way to do this :) always open to suggestions.

tuple

This solution works for median just as well?
– Cindy Almighty
2 days ago

You don't need to create tuple. Just do the grouping based on both x and y columns.
– Tai
2 days ago

tuple

x

y

@CindyAlmighty Yes, it does ! :) Just change "mean" for "meadian" or whatever operation you might want.
– RafaelC
2 days ago

"mean"

"meadian"

@Tai handn't slept in the past 2 days haha:thanks for pointing out. I won't edit not to make your answer redundant. Thanks ;}
– RafaelC
2 days ago

@RafaelC sounds like you have rough days. Feel free to edit your answer to provide users with better information. I would not mind.
– Tai
2 days ago

My trial based on RafaelC's answer.

First do groupby on ["x", "y"], then take mean or median of each group, and finally reset the index with reset_index().

groupby

mean

median

reset_index()

import pandas as np # setup mat = np.zeros((1000,1000)) a = np.random.randint(0,999,size=(50000,)) b = np.random.randint(0,999,size=(50000,)) c = np.random.rand(50000) # Start here df = pd.DataFrame({"x":a, "y":b, "val":c}) v = df.groupby(["x", "y"]).mean().reset_index() mat[v["x"], v["y"]] += v["val"]

If medians are needed, modify v to be

v

v = df.groupby(["x", "y"]).median().reset_index()

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk