Pandas: How to group (and sum) by category and retain information from subcategory

This is a follow-up question to Pandas: How to subset (and sum) top N observations within subcategories? There it was demonstrated how you could find the sum of the top 3 months for each year in this dataframe:

Example dataframe

year month passengers 0 1949 January 112 1 1949 February 118 2 1949 March 132 3 1949 April 129 4 1949 May 121 5 1949 June 135 . . . 137 1960 June 535 138 1960 July 622 139 1960 August 606 140 1960 September 508 141 1960 October 461 142 1960 November 390 143 1960 December 432

So that you'd end up with this:

year passengers 0 1949 432 1 1950 498 2 1951 582 3 1952 690 4 1953 779 5 1954 859 6 1955 1026 7 1956 1192 8 1957 1354 9 1958 1431 10 1959 1579 11 1960 176

The number 432 for 1949 is the sum of 148+148+136 for the months July, August and September.
My question now is this:

432 for 1949

148+148+136 for the months July, August and September.

Is it possible to do the same calculations, and at the same time keep the corresponding subcategories as a list in it's own column?

Desired output

(I've only checked the actual sum for 1949. 1950 is made up):

year passengers months 0 1949 432 July, August, September 1 1950 498 August, September, December 2 1951 582 . 3 1952 690 . 4 1953 779 . 5 1954 859 . 6 1955 1026 . 7 1956 1192 . 8 1957 1354 . 9 1958 1431 . 10 1959 1579 . 11 1960 176 .

Reproducible code and data:

import pandas as pd import seaborn as sns df = sns.load_dataset('flights') print(df.head()) df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index() print(df2.head())

df:

year month passengers 0 1949 January 112 1 1949 February 118 2 1949 March 132 3 1949 April 129 4 1949 May 121

df2:

year passengers 0 1949 432 1 1950 498 2 1951 582 3 1952 690 4 1953 779

Thank you for any suggestions!

4 Answers
4

Use custom function with GroupBy.apply, idea is sorting first by sort_values and then call head for top values per groups:

GroupBy.apply

sort_values

head

def f(x): x = x.head(3) names = ['passengers','months'] return pd.Series([x['passengers'].sum(), ', '.join(x['month'])], index=names) df2 = df.sort_values('passengers', ascending=False).groupby('year').apply(f).reset_index() print(df2.head()) year passengers months 0 1949 432 July, August, September 1 1950 498 July, August, September 2 1951 582 July, August, September 3 1952 690 August, July, June 4 1953 779 August, July, June

You could

In [69]: df.groupby('year').apply(lambda x: x.nlargest(3, 'passengers').agg( {'passengers': 'sum', 'month': lambda x: ', '.join(x.values)} )).reset_index() Out[69]: year passengers month 0 1949 432 July, August, September 1 1950 498 July, August, September 2 1951 582 July, August, September 3 1952 690 August, July, June 4 1953 779 August, July, June 5 1954 859 July, August, June 6 1955 1026 July, August, June 7 1956 1192 July, August, June 8 1957 1354 August, July, June 9 1958 1431 August, July, June 10 1959 1579 August, July, June 11 1960 1763 July, August, June

Here's one solution using nlargest.

nlargest

def largest(x, k): vals = x.nlargest(n=k, columns=['passengers']) return [vals['passengers'].sum(), vals['month'].tolist()] g = df.groupby('year').apply(largest, k=3).reset_index() joiner = pd.DataFrame(g[0].values.tolist(), columns=['passengers', 'months']) res = g.drop(0, axis=1).join(joiner) print(res) year passengers months 0 1949 382 [March, April, May]

I intentionally keep months as a list, you can convert to a comma separated string if you wish.

months

Alternatively - group, then apply using pd.DataFrame.nlargest instead of a custom function/lambda then regroup on the index and apply suitable aggs, eg:

pd.DataFrame.nlargest

agg

new_df = ( df.groupby('year').apply(pd.DataFrame.nlargest, 3, 'passengers') .groupby(level=0).agg({'passengers': 'sum', 'month': ', '.join}) # optionally reset index # .reset_index() )

That'll then give you:

passengers month year 1949 432 July, August, September 1950 498 July, August, September 1951 582 July, August, September 1952 690 August, July, June 1953 779 August, July, June 1954 859 July, August, June ...

Seems that year as an index makes sense in the resultant frame, but apply a .reset_index() if not.

year

.reset_index()

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk