select index value from groupby on a pandas dataframe in python
select index value from groupby on a pandas dataframe in python
I have the following dataframe:
df = pd.DataFrame({'place' : ['A', 'B', 'C', 'D', 'E', 'F'],
'population': [10 , 20, 30, 15, 25, 35],
'region': ['I', 'II', 'III', 'I', 'II', 'III']})
And it looks like this:
place population region
0 A 10 I
1 B 20 II
2 C 30 III
3 D 15 I
4 E 25 II
5 F 35 III
I would like to select the place with the smallest population from the region with the highest population.
df.groupby('region').population.sum()
Returns:
region
I 25
II 45
III 65
Name: population, dtype: int64
But I have no clue how to proceed from here (using .groupby / .loc / .iloc)
Any suggestion?
2 Answers
2
First add a column for region population:
df['region_pop'] = df.groupby('region')['population'].transform(sum)
Then sort your dataframe and extract the first row:
res = df.sort_values(['region_pop', 'population'], ascending=[False, True])
.head(1)
Result:
place population region region_pop
2 C 30 III 65
Thanks, nice! Is there a way to do it in one line of code (with method chaining)?
– Rene
Jun 20 at 17:27
@Rene, Probably, but it'll be an unreadable mess I would have difficulty understanding.
– jpp
Jun 20 at 17:29
df.assign(region_population = df.groupby('region')['population'].transform(sum)).sort_values(['region_population', 'population'], ascending=[False, True]).iloc[0].place
– Rene
Jun 20 at 17:37
@Rene, Yep, that would be the one-liner. But don't let it make you believe it's more efficient. You are just moving an explicit series definition to
pd.DataFrame.assign
.– jpp
Jun 20 at 17:37
pd.DataFrame.assign
You need to find the region
with highest population
. Then groupby
place
to the subset of data with that region and find the place with lowest population
. (Assuming place would be repetitive in real data)
region
population
groupby
place
population
high_reg = df.groupby('region')['population'].sum().reset_index(name='count').sort_values('count').iloc[-1]['region']
df.loc[df['region']==high_reg].groupby('place')['population'].sum().reset_index(name='count').sort_values('count').iloc[0]['place']
Output:
'C'
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
I believe this would be faster than my solution. +1 :)
– Harv Ipan
Jun 20 at 15:10