select index value from groupby on a pandas dataframe in python


select index value from groupby on a pandas dataframe in python



I have the following dataframe:


df = pd.DataFrame({'place' : ['A', 'B', 'C', 'D', 'E', 'F'],
'population': [10 , 20, 30, 15, 25, 35],
'region': ['I', 'II', 'III', 'I', 'II', 'III']})



And it looks like this:


place population region
0 A 10 I
1 B 20 II
2 C 30 III
3 D 15 I
4 E 25 II
5 F 35 III



I would like to select the place with the smallest population from the region with the highest population.


df.groupby('region').population.sum()



Returns:


region
I 25
II 45
III 65
Name: population, dtype: int64



But I have no clue how to proceed from here (using .groupby / .loc / .iloc)



Any suggestion?




2 Answers
2



First add a column for region population:


df['region_pop'] = df.groupby('region')['population'].transform(sum)



Then sort your dataframe and extract the first row:


res = df.sort_values(['region_pop', 'population'], ascending=[False, True])
.head(1)



Result:


place population region region_pop
2 C 30 III 65





I believe this would be faster than my solution. +1 :)
– Harv Ipan
Jun 20 at 15:10





Thanks, nice! Is there a way to do it in one line of code (with method chaining)?
– Rene
Jun 20 at 17:27





@Rene, Probably, but it'll be an unreadable mess I would have difficulty understanding.
– jpp
Jun 20 at 17:29





df.assign(region_population = df.groupby('region')['population'].transform(sum)).sort_values(['region_population', 'population'], ascending=[False, True]).iloc[0].place
– Rene
Jun 20 at 17:37





@Rene, Yep, that would be the one-liner. But don't let it make you believe it's more efficient. You are just moving an explicit series definition to pd.DataFrame.assign.
– jpp
Jun 20 at 17:37



pd.DataFrame.assign



You need to find the region with highest population. Then groupby place to the subset of data with that region and find the place with lowest population. (Assuming place would be repetitive in real data)


region


population


groupby


place


population


high_reg = df.groupby('region')['population'].sum().reset_index(name='count').sort_values('count').iloc[-1]['region']
df.loc[df['region']==high_reg].groupby('place')['population'].sum().reset_index(name='count').sort_values('count').iloc[0]['place']



Output:


'C'






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Opening a url is failing in Swift

Export result set on Dbeaver to CSV