How to assign a unique ID to detect repeated rows in a pandas dataframe?

I am working with a large pandas dataframe, with several columns pretty much like this:

A B C D John Tom 0 1 Homer Bart 2 3 Tom Maggie 1 4 Lisa John 5 0 Homer Bart 2 3 Lisa John 5 0 Homer Bart 2 3 Homer Bart 2 3 Tom Maggie 1 4

How can I assign an unique id to each repeated row? For example:

A B C D new_id John Tom 0 1.2 1 Homer Bart 2 3.0 2 Tom Maggie 1 4.2 3 Lisa John 5 0 4 Homer Bart 2 3 5 Lisa John 5 0 4 Homer Bart 2 3.0 2 Homer Bart 2 3.0 2 Tom Maggie 1 4.1 6

I know that I can use duplicate to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:

duplicate

df.assign(id=(df.columns).astype('category').cat.codes) df

However, is not working. How can I get a unique id for detecting groups of duplicated rows?

In your example, I believe new_id should be 1, 2, 3, 4, 2, etc ?
– jpp
Jun 29 at 22:46

new_id

Possible duplicate of How to label same pandas dataframe rows?
– RafaelC
Jun 29 at 23:14

2 Answers
2

For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize.

pd.factorize

df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1

groupby is more efficient for larger dataframes:

groupby

df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1

Group by the columns you are trying to find duplicates over and use ngroup:

ngroup

df['new_id'] = df.groupby(['A','B','C','D']).ngroup()

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Search This Blog

Mgiyuk