Pandas: How to group (and sum) by category and retain information from subcategory
Pandas: How to group (and sum) by category and retain information from subcategory
This is a follow-up question to Pandas: How to subset (and sum) top N observations within subcategories? There it was demonstrated how you could find the sum of the top 3 months for each year in this dataframe:
Example dataframe
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
5 1949 June 135
.
.
.
137 1960 June 535
138 1960 July 622
139 1960 August 606
140 1960 September 508
141 1960 October 461
142 1960 November 390
143 1960 December 432
So that you'd end up with this:
year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779
5 1954 859
6 1955 1026
7 1956 1192
8 1957 1354
9 1958 1431
10 1959 1579
11 1960 176
The number 432 for 1949
is the sum of 148+148+136 for the months July, August and September.
My question now is this:
432 for 1949
148+148+136 for the months July, August and September.
Is it possible to do the same calculations, and at the same time keep the corresponding subcategories as a list in it's own column?
Desired output
(I've only checked the actual sum for 1949. 1950 is made up):
year passengers months
0 1949 432 July, August, September
1 1950 498 August, September, December
2 1951 582 .
3 1952 690 .
4 1953 779 .
5 1954 859 .
6 1955 1026 .
7 1956 1192 .
8 1957 1354 .
9 1958 1431 .
10 1959 1579 .
11 1960 176 .
Reproducible code and data:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('flights')
print(df.head())
df2 = df.groupby('year')['passengers'].apply(lambda x: x.nlargest(3).sum()).reset_index()
print(df2.head())
df:
year month passengers
0 1949 January 112
1 1949 February 118
2 1949 March 132
3 1949 April 129
4 1949 May 121
df2:
year passengers
0 1949 432
1 1950 498
2 1951 582
3 1952 690
4 1953 779
Thank you for any suggestions!
4 Answers
4
Use custom function with GroupBy.apply
, idea is sorting first by sort_values
and then call head
for top values per groups:
GroupBy.apply
sort_values
head
def f(x):
x = x.head(3)
names = ['passengers','months']
return pd.Series([x['passengers'].sum(), ', '.join(x['month'])], index=names)
df2 = df.sort_values('passengers', ascending=False).groupby('year').apply(f).reset_index()
print(df2.head())
year passengers months
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June
You could
In [69]: df.groupby('year').apply(lambda x:
x.nlargest(3, 'passengers').agg(
{'passengers': 'sum', 'month': lambda x: ', '.join(x.values)}
)).reset_index()
Out[69]:
year passengers month
0 1949 432 July, August, September
1 1950 498 July, August, September
2 1951 582 July, August, September
3 1952 690 August, July, June
4 1953 779 August, July, June
5 1954 859 July, August, June
6 1955 1026 July, August, June
7 1956 1192 July, August, June
8 1957 1354 August, July, June
9 1958 1431 August, July, June
10 1959 1579 August, July, June
11 1960 1763 July, August, June
Here's one solution using nlargest
.
nlargest
def largest(x, k):
vals = x.nlargest(n=k, columns=['passengers'])
return [vals['passengers'].sum(), vals['month'].tolist()]
g = df.groupby('year').apply(largest, k=3).reset_index()
joiner = pd.DataFrame(g[0].values.tolist(), columns=['passengers', 'months'])
res = g.drop(0, axis=1).join(joiner)
print(res)
year passengers months
0 1949 382 [March, April, May]
I intentionally keep months
as a list, you can convert to a comma separated string if you wish.
months
Alternatively - group, then apply using pd.DataFrame.nlargest
instead of a custom function/lambda then regroup on the index and apply suitable agg
s, eg:
pd.DataFrame.nlargest
agg
new_df = (
df.groupby('year').apply(pd.DataFrame.nlargest, 3, 'passengers')
.groupby(level=0).agg({'passengers': 'sum', 'month': ', '.join})
# optionally reset index
# .reset_index()
)
That'll then give you:
passengers month
year
1949 432 July, August, September
1950 498 July, August, September
1951 582 July, August, September
1952 690 August, July, June
1953 779 August, July, June
1954 859 July, August, June
...
Seems that year
as an index makes sense in the resultant frame, but apply a .reset_index()
if not.
year
.reset_index()
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
Comments
Post a Comment