Python, Pandas, sum only taking uniques


Python, Pandas, sum only taking uniques



An Excel spreadsheet like below (note: ID the column A has duplicated values). I want to find out sum of each Contract_type, taking each ID is counted once only (unique).



enter image description here


data = {'ID': ["380689","380689","480562","480562","480562","14805","47089","56251","56251","56251","322624","322624","322624","85964","85964","85964","342225","342225","4589","23591","23591","235225"],
'Contract_type' : ["Other","Other","Type-I","Type-I","Type-I","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-III","Type-III","Type-III","Part-time","Part-time","Part-time","Full-time","Full-time","Full-time"],
'Unit_Weight': [335,335,119,119,119,119,52,452,452,452,19,19,19,165,165,165,165,165,165,724,724,16],
'Test_time' : ["16:26","07:39","18:48","22:32","03:54","03:30","09:57","18:52","19:03","18:06","18:52","03:51","04:00","22:02","13:35","13:43","10:29","06:30","12:20","12:52","17:30","13:10"],
'Tested' : [1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0],
'Internal' : [1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1]}

df = pd.DataFrame(data)



I tried:


print pd.pivot_table(df, index = ["Contract_type", "ID"]).Unit_Weight



It gives:


Contract_type ID
Full-time 23591 724
235225 16
Other 380689 335
....



But I only want it to show something like: Full-time 740 etc.



I also tried:


print pd.pivot_table(df, index = ["Contract_type"], values=["Unit_Weight"], aggfunc = np.sum)



It gives:


Full-time 1464 # this is not considering the duplicated IDs



What's the right way to right the line? Thank you.





What is expected output?
– jezrael
2 days ago





Please, no data as images, as always...
– SpghttCd
2 days ago





@jezrael, hello sir. like Full-time 740, Other 335, Part-time 330, Type-I 119, Type-II 642, Type-III 165
– Mark K
2 days ago





Why is other -> 335 and not 670?
– SpghttCd
2 days ago





Ah, that's my misunderstanding, thanks for clarifying
– SpghttCd
2 days ago




2 Answers
2



It looks like you want to only take each (ID, Contract-Type) couple into account only once, so I don't think df.groupby(['Contract_type', 'ID]).Unit_Weight.sum() would work.


df.groupby(['Contract_type', 'ID]).Unit_Weight.sum()



Instead you can try:


df.drop_duplicates(['Contract_type', 'ID']).groupby('Contract_type').Unit_Weight.sum()





thank you for your help, and clarification. very good skill!
– Mark K
2 days ago



I think need:


df1 = (df.drop_duplicates(['Contract_type', 'ID'])
.set_index('Contract_type')['Unit_Weight']
.sum(level=0)
.reset_index())

print (df1)
Contract_type Unit_Weight
0 Other 335
1 Type-I 119
2 Type-II 642
3 Type-III 165
4 Part-time 330
5 Full-time 740





thank you. it's the perfect answer!
– Mark K
2 days ago





This answer won't work since x.unique() will apply on the Unit_weight column, which means that two rows with different IDs but the same contract_type and same unit_weight will only be accounted for once
– Zuma
2 days ago


x.unique()





@jezrael, sir, the Part-time: 165 from your answer is wrong. could you please have a look?
– Mark K
2 days ago





@jezrael, there are 2 unique IDs shall give 330.
– Mark K
2 days ago





@MarkK - Now understand, need check dupes for both columns. Check edited answer, similar by Alexandre Uzan solution.
– jezrael
2 days ago







By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Export result set on Dbeaver to CSV

The forked VM terminated without saying properly goodbye. VM crash or System.exit called