select index value from groupby on a pandas dataframe in python

I have the following dataframe:

df = pd.DataFrame({'place' : ['A', 'B', 'C', 'D', 'E', 'F'], 'population': [10 , 20, 30, 15, 25, 35], 'region': ['I', 'II', 'III', 'I', 'II', 'III']})

And it looks like this:

place population region 0 A 10 I 1 B 20 II 2 C 30 III 3 D 15 I 4 E 25 II 5 F 35 III

I would like to select the place with the smallest population from the region with the highest population.

df.groupby('region').population.sum()

Returns:

region I 25 II 45 III 65 Name: population, dtype: int64

But I have no clue how to proceed from here (using .groupby / .loc / .iloc)

Any suggestion?

2 Answers
2

First add a column for region population:

df['region_pop'] = df.groupby('region')['population'].transform(sum)

Then sort your dataframe and extract the first row:

res = df.sort_values(['region_pop', 'population'], ascending=[False, True]) .head(1)

Result:

place population region region_pop 2 C 30 III 65

I believe this would be faster than my solution. +1 :)
– Harv Ipan
Jun 20 at 15:10

Thanks, nice! Is there a way to do it in one line of code (with method chaining)?
– Rene
Jun 20 at 17:27

@Rene, Probably, but it'll be an unreadable mess I would have difficulty understanding.
– jpp
Jun 20 at 17:29

df.assign(region_population = df.groupby('region')['population'].transform(sum)).sort_values(['region_population', 'population'], ascending=[False, True]).iloc[0].place
– Rene
Jun 20 at 17:37

@Rene, Yep, that would be the one-liner. But don't let it make you believe it's more efficient. You are just moving an explicit series definition to pd.DataFrame.assign.
– jpp
Jun 20 at 17:37

pd.DataFrame.assign

You need to find the region with highest population. Then groupby place to the subset of data with that region and find the place with lowest population. (Assuming place would be repetitive in real data)

region

population

groupby

place

population

high_reg = df.groupby('region')['population'].sum().reset_index(name='count').sort_values('count').iloc[-1]['region'] df.loc[df['region']==high_reg].groupby('place')['population'].sum().reset_index(name='count').sort_values('count').iloc[0]['place']

Output:

'C'

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk