Python, Pandas, sum only taking uniques

An Excel spreadsheet like below (note: ID the column A has duplicated values). I want to find out sum of each Contract_type, taking each ID is counted once only (unique).

enter image description here

data = {'ID': ["380689","380689","480562","480562","480562","14805","47089","56251","56251","56251","322624","322624","322624","85964","85964","85964","342225","342225","4589","23591","23591","235225"], 'Contract_type' : ["Other","Other","Type-I","Type-I","Type-I","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-II","Type-III","Type-III","Type-III","Part-time","Part-time","Part-time","Full-time","Full-time","Full-time"], 'Unit_Weight': [335,335,119,119,119,119,52,452,452,452,19,19,19,165,165,165,165,165,165,724,724,16], 'Test_time' : ["16:26","07:39","18:48","22:32","03:54","03:30","09:57","18:52","19:03","18:06","18:52","03:51","04:00","22:02","13:35","13:43","10:29","06:30","12:20","12:52","17:30","13:10"], 'Tested' : [1,1,1,1,1,0,0,0,0,0,0,0,0,1,1,1,1,1,1,0,0,0], 'Internal' : [1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,1]} df = pd.DataFrame(data)

I tried:

print pd.pivot_table(df, index = ["Contract_type", "ID"]).Unit_Weight

It gives:

Contract_type ID Full-time 23591 724 235225 16 Other 380689 335 ....

But I only want it to show something like: Full-time 740 etc.

I also tried:

print pd.pivot_table(df, index = ["Contract_type"], values=["Unit_Weight"], aggfunc = np.sum)

It gives:

Full-time 1464 # this is not considering the duplicated IDs

What's the right way to right the line? Thank you.

What is expected output?
– jezrael
2 days ago

Please, no data as images, as always...
– SpghttCd
2 days ago

@jezrael, hello sir. like Full-time 740, Other 335, Part-time 330, Type-I 119, Type-II 642, Type-III 165
– Mark K
2 days ago

Why is other -> 335 and not 670?
– SpghttCd
2 days ago

Ah, that's my misunderstanding, thanks for clarifying
– SpghttCd
2 days ago

2 Answers
2

It looks like you want to only take each (ID, Contract-Type) couple into account only once, so I don't think df.groupby(['Contract_type', 'ID]).Unit_Weight.sum() would work.

df.groupby(['Contract_type', 'ID]).Unit_Weight.sum()

Instead you can try:

df.drop_duplicates(['Contract_type', 'ID']).groupby('Contract_type').Unit_Weight.sum()

thank you for your help, and clarification. very good skill!
– Mark K
2 days ago

I think need:

df1 = (df.drop_duplicates(['Contract_type', 'ID']) .set_index('Contract_type')['Unit_Weight'] .sum(level=0) .reset_index()) print (df1) Contract_type Unit_Weight 0 Other 335 1 Type-I 119 2 Type-II 642 3 Type-III 165 4 Part-time 330 5 Full-time 740

thank you. it's the perfect answer!
– Mark K
2 days ago

This answer won't work since x.unique() will apply on the Unit_weight column, which means that two rows with different IDs but the same contract_type and same unit_weight will only be accounted for once
– Zuma
2 days ago

x.unique()

@jezrael, sir, the Part-time: 165 from your answer is wrong. could you please have a look?
– Mark K
2 days ago

@jezrael, there are 2 unique IDs shall give 330.
– Mark K
2 days ago

@MarkK - Now understand, need check dupes for both columns. Check edited answer, similar by Alexandre Uzan solution.
– jezrael
2 days ago

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk