Pandas: How to group (and sum) by category and retain information from subcategory


Pandas: How to group (and sum) by category and retain information from subcategory



This is a follow-up question to Pandas: How to subset (and sum) top N observations within subcategories? There it was demonstrated how you could find the sum of the top 3 months for each year in this dataframe:



Example dataframe


year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
5 1949 June 135
.
.
.
137 1960 June 535
138 1960 July 622
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432



So that you'd end up with this:


year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779
5 1954 859
6 1955 1026
7 1956 1192
8 1957 1354
9 1958 1431
10 1959 1579
11 1960 176



The number 432 for 1949 is the sum of 148+148+136 for the months July, August and September.
My question now is this:


432 for 1949


148+148+136 for the months July, August and September.



Is it possible to do the same calculations, and at the same time keep the corresponding subcategories as a list in it's own column?



Desired output



(I've only checked the actual sum for 1949. 1950 is made up):


year passengers months
0 1949 432 July, August, September
1 1950 498 August, September, December
2 1951 582 .
3 1952 690 .
4 1953 779 .
5 1954 859 .
6 1955 1026 .
7 1956 1192 .
8 1957 1354 .
9 1958 1431 .
10 1959 1579 .
11 1960 176 .



Reproducible code and data:


import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
print(df.head())

df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print(df2.head())



df:


year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121



df2:


year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779



Thank you for any suggestions!




4 Answers
4



Use custom function with GroupBy.apply, idea is sorting first by sort_values and then call head for top values per groups:


GroupBy.apply


sort_values


head


def f(x):
x = x.head(3)
names = ['passengers','months']
return pd.Series([x['passengers'].sum(), ', '.join(x['month'])], index=names)

df2 = df.sort_values('passengers', ascending=False).groupby('year').apply(f).reset_index()
print(df2.head())
year passengers months
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June



You could


In [69]: df.groupby('year').apply(lambda x:
x.nlargest(3, 'passengers').agg(
{'passengers': 'sum', 'month': lambda x: ', '.join(x.values)}
)).reset_index()
Out[69]:
year passengers month
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June
5 1954 859 July, August, June
6 1955 1026 July, August, June
7 1956 1192 July, August, June
8 1957 1354 August, July, June
9 1958 1431 August, July, June
10 1959 1579 August, July, June
11 1960 1763 July, August, June



Here's one solution using nlargest.


nlargest


def largest(x, k):
vals = x.nlargest(n=k, columns=['passengers'])
return [vals['passengers'].sum(), vals['month'].tolist()]

g = df.groupby('year').apply(largest, k=3).reset_index()
joiner = pd.DataFrame(g[0].values.tolist(), columns=['passengers', 'months'])

res = g.drop(0, axis=1).join(joiner)

print(res)

year passengers months
0 1949 382 [March, April, May]



I intentionally keep months as a list, you can convert to a comma separated string if you wish.


months



Alternatively - group, then apply using pd.DataFrame.nlargest instead of a custom function/lambda then regroup on the index and apply suitable aggs, eg:


pd.DataFrame.nlargest


agg


new_df = (
df.groupby('year').apply(pd.DataFrame.nlargest, 3, 'passengers')
.groupby(level=0).agg({'passengers': 'sum', 'month': ', '.join})
# optionally reset index
# .reset_index()
)



That'll then give you:


passengers month
year
1949 432 July, August, September
1950 498 July, August, September
1951 582 July, August, September
1952 690 August, July, June
1953 779 August, July, June
1954 859 July, August, June
...



Seems that year as an index makes sense in the resultant frame, but apply a .reset_index() if not.


year


.reset_index()






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

The forked VM terminated without saying properly goodbye. VM crash or System.exit called