Python, Pandas, sum only taking uniques
Python, Pandas, sum only taking uniques
An Excel spreadsheet like below (note: ID the column A has duplicated values). I want to find out sum of each Contract_type, taking each ID is counted once only (unique).
data = {'ID': ["380689","380689","480562","480562","480562","14805","47089","56251","56251","56251","322624","322624","322624","85964","85964","85964","342225","342225","4589","23591","23591","235225"],
'Contract_type' : ["Other","Other","Type-I","Type-I","Type-I","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-III","Type-III","Type-III","Part-time","Part-time","Part-time","Full-time","Full-time","Full-time"],
'Unit_Weight': [335,335,119,119,119,119,52,452,452,452,19,19,19,165,165,165,165,165,165,724,724,16],
'Test_time' : ["16:26","07:39","18:48","22:32","03:54","03:30","09:57","18:52","19:03","18:06","18:52","03:51","04:00","22:02","13:35","13:43","10:29","06:30","12:20","12:52","17:30","13:10"],
'Tested' : [1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0],
'Internal' : [1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1]}
df = pd.DataFrame(data)
I tried:
print pd.pivot_table(df, index = ["Contract_type", "ID"]).Unit_Weight
It gives:
Contract_type ID
Full-time 23591 724
235225 16
Other 380689 335
....
But I only want it to show something like: Full-time 740 etc.
I also tried:
print pd.pivot_table(df, index = ["Contract_type"], values=["Unit_Weight"], aggfunc = np.sum)
It gives:
Full-time 1464 # this is not considering the duplicated IDs
What's the right way to right the line? Thank you.
Please, no data as images, as always...
– SpghttCd
2 days ago
@jezrael, hello sir. like Full-time 740, Other 335, Part-time 330, Type-I 119, Type-II 642, Type-III 165
– Mark K
2 days ago
Why is other -> 335 and not 670?
– SpghttCd
2 days ago
Ah, that's my misunderstanding, thanks for clarifying
– SpghttCd
2 days ago
2 Answers
2
It looks like you want to only take each (ID, Contract-Type) couple into account only once, so I don't think df.groupby(['Contract_type', 'ID]).Unit_Weight.sum()
would work.
df.groupby(['Contract_type', 'ID]).Unit_Weight.sum()
Instead you can try:
df.drop_duplicates(['Contract_type', 'ID']).groupby('Contract_type').Unit_Weight.sum()
thank you for your help, and clarification. very good skill!
– Mark K
2 days ago
I think need:
df1 = (df.drop_duplicates(['Contract_type', 'ID'])
.set_index('Contract_type')['Unit_Weight']
.sum(level=0)
.reset_index())
print (df1)
Contract_type Unit_Weight
0 Other 335
1 Type-I 119
2 Type-II 642
3 Type-III 165
4 Part-time 330
5 Full-time 740
thank you. it's the perfect answer!
– Mark K
2 days ago
This answer won't work since
x.unique()
will apply on the Unit_weight column, which means that two rows with different IDs but the same contract_type and same unit_weight will only be accounted for once– Zuma
2 days ago
x.unique()
@jezrael, sir, the Part-time: 165 from your answer is wrong. could you please have a look?
– Mark K
2 days ago
@jezrael, there are 2 unique IDs shall give 330.
– Mark K
2 days ago
@MarkK - Now understand, need check dupes for both columns. Check edited answer, similar by Alexandre Uzan solution.
– jezrael
2 days ago
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
What is expected output?
– jezrael
2 days ago