Removing duplicates from list of dictionaries with multiple key/value pairs by comparing on some of the values [on hold]


Removing duplicates from list of dictionaries with multiple key/value pairs by comparing on some of the values [on hold]



having some trouble getting this done. I would go for fully functional approach as a nested for loops but that easily gets out of hand (n^2). I have a list of dictionaries (think phonebook even if it isn't) with each dict having this kind of key,values:


{'mat':'name_of_the_material', 'tex': [list_of_textures], 'geo':[list_of_geo]}



The goal is to reduce the list of dictionaries and match them on first two keys ('mat','tex') and combine the third list when de-duplicating. So mat, tex wouldn't repeat, and geo list would be a merge of N items from the original list of dict.


mat


tex


geo



Please edit the question to limit it to a specific problem with enough detail to identify an adequate answer. Avoid asking multiple distinct questions at once. See the How to Ask page for help clarifying this question. If this question can be reworded to fit the rules in the help center, please edit the question.





And you ultimately want to end up with one dictionary?
– doctorlove
Jun 29 at 8:43





give your input and expected output
– sachin dubey
Jun 29 at 8:43





can you give more example ?
– Vikas Damodar
Jun 29 at 8:44





It's not clear what you want to do with tex if they aren't identical? DOes that need merging too?
– doctorlove
Jun 29 at 8:49


tex




3 Answers
3



The most efficient way is to build a new dictionary to act as an index, and then run through the list updating the dictionary as you go. This code will modify the original dictionaries, but you can easily avoid that by copying on insert if you wish. It's not fully functional.


def make_key(item):
tex = item['tex']
tex = tex.sort()
return (item['mat'], ','.join(tex))

index = {}
for item in list_of_dict:
key = make_key(item)
item2 = index.get(key)
if item2:
item2['geo'] += item['geo']
else:
index[key] = item



This code assumes all the fields are always present.



You should just use ('mat', 'tex') as keys in a dictionnary and aggregate the values like so :


('mat', 'tex')


from collections import defaultdict

l = [
{'mat': '0', 'tex': ['a', 'b'], 'geo': ['A', 'B', 'C']},
{'mat': '0', 'tex': ['a', 'b'], 'geo': ['C', 'D']},
{'mat': '1', 'tex':['b', 'c'], 'geo': ['A', 'C']}
]

d = defaultdict(list)
for item in l:
mat = item['mat']
tex = tuple(item['tex']) #you need to make it a tuple because list is not a hashable type
geo = item['geo']

d[(mat, tex)] += geo

d = dict(d)
print(d)

>>> {('0', ('a', 'b')): ['A', 'B', 'C', 'C', 'D'], ('1', ('b', 'c')): ['A', 'C']}

final_list =
for key, value in d.items():
final_list.append({
'mat': key[0],
'tex': list(key[1]),
'geo': value
})
print(final_list)

>>> [{'mat': '0', 'tex': ['a', 'b'], 'geo': ['A', 'B', 'C', 'C', 'D']}, {'mat': '1', 'tex': ['b', 'c'], 'geo': ['A', 'C']}]



Also if the order of the values in 'tex'doesn't matter you can sort the list before making it a tuple.


'tex'



You can also use sets instead of list for the geo if you don't want values to be duplicated


geo



I'd use a dictionary with dict.setdefault:


dict.setdefault


L = [{'mat': 'Mat1', 'tex': [1, 2], 'geo': [3, 4]},
{'mat': 'Mat2', 'tex': [1, 2], 'geo': [5, 6]},
{'mat': 'Mat1', 'tex': [1, 2], 'geo': [7, 8]},
{'mat': 'Mat2', 'tex': [3, 4], 'geo': [9, 10]},
{'mat': 'Mat2', 'tex': [1, 2], 'geo': [11, 12]}]

reduced = {}

for entry in L:
mat = reduced.setdefault(entry['mat'], {})
geo = mat.setdefault(tuple(entry['tex']), )
geo.extend(entry['geo'])

from pprint import pprint
pprint(reduced)



Output:


{'Mat1': {(1, 2): [3, 4, 7, 8]},
'Mat2': {(1, 2): [5, 6, 11, 12], (3, 4): [9, 10]}}

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Opening a url is failing in Swift

Export result set on Dbeaver to CSV