How to assign a unique ID to detect repeated rows in a pandas dataframe?
How to assign a unique ID to detect repeated rows in a pandas dataframe?
I am working with a large pandas dataframe, with several columns pretty much like this:
A B C D
John Tom 0 1
Homer Bart 2 3
Tom Maggie 1 4
Lisa John 5 0
Homer Bart 2 3
Lisa John 5 0
Homer Bart 2 3
Homer Bart 2 3
Tom Maggie 1 4
How can I assign an unique id to each repeated row? For example:
A B C D new_id
John Tom 0 1.2 1
Homer Bart 2 3.0 2
Tom Maggie 1 4.2 3
Lisa John 5 0 4
Homer Bart 2 3 5
Lisa John 5 0 4
Homer Bart 2 3.0 2
Homer Bart 2 3.0 2
Tom Maggie 1 4.1 6
I know that I can use duplicate
to detect the duplicated rows, however I can not visualize were are reapeting those rows. I tried to:
duplicate
df.assign(id=(df.columns).astype('category').cat.codes)
df
However, is not working. How can I get a unique id for detecting groups of duplicated rows?
new_id
Possible duplicate of How to label same pandas dataframe rows?
– RafaelC
Jun 29 at 23:14
2 Answers
2
For small dataframes, you can convert your rows to tuples, which can be hashed, and then use pd.factorize
.
pd.factorize
df['new_id'] = pd.factorize(df.apply(tuple, axis=1))[0] + 1
groupby
is more efficient for larger dataframes:
groupby
df['new_id'] = df.groupby(df.columns.tolist(), sort=False).ngroup() + 1
Group by the columns you are trying to find duplicates over and use ngroup
:
ngroup
df['new_id'] = df.groupby(['A','B','C','D']).ngroup()
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
In your example, I believe
new_id
should be 1, 2, 3, 4, 2, etc ?– jpp
Jun 29 at 22:46