How to assign a unique ID to detect repeated rows in a pandas dataframe?


How to assign a unique ID to detect repeated rows in a pandas dataframe?



I am working with a large pandas dataframe, with several columns pretty much like this:


A B C D

John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
Homer Bart 2 3
Lisa John 5 0
Homer Bart 2 3
Homer Bart 2 3
Tom Maggie 1 4



How can I assign an unique id to each repeated row? For example:


A B C D new_id

John Tom 0 1.2 1
Homer Bart 2 3.0 2
Tom Maggie 1 4.2 3
Lisa John 5 0 4
Homer Bart 2 3 5
Lisa John 5 0 4
Homer Bart 2 3.0 2
Homer Bart 2 3.0 2
Tom Maggie 1 4.1 6



I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:


duplicate


df.assign(id=(df.columns).astype('category').cat.codes)
df



However, is not working. How can I get a unique id for detecting groups of duplicated rows?





In your example, I believe new_id should be 1, 2, 3, 4, 2, etc ?
– jpp
Jun 29 at 22:46



new_id





Possible duplicate of How to label same pandas dataframe rows?
– RafaelC
Jun 29 at 23:14




2 Answers
2



For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.


pd.factorize


df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1



groupby is more efficient for larger dataframes:


groupby


df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1



Group by the columns you are trying to find duplicates over and use ngroup:


ngroup


df['new_id'] = df.groupby(['A','B','C','D']).ngroup()






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Opening a url is failing in Swift

Export result set on Dbeaver to CSV