Using str_split and removing duplicate values from the same record in R


Using str_split and removing duplicate values from the same record in R



I have a data frame of book genres. It began as two columns, one for the title and one for a character string containing multiple genres, similar to this:


titles <- c("Harry Potter 1", "To Kill A Mockingbird", "The Hunger Games 1")
genres <- c("Fantasy, Young Adult, Fantasy, Magic", "Classics, Fiction, Historical, Historical Fiction, Academic", "Young Adult, Fiction, Science Fiction, Dystopia, Science Fiction")
books <- tibble(
title = titles,
genre = genres)
books

# A tibble: 3 x 2
title genre
<chr> <chr>
1 Harry Potter 1 Fantasy, Young Adult, Fantasy, Magic
2 To Kill A Mockingbird Classics, Fiction, Historical, Historical Fiction, Academic
3 The Hunger Games 1 Young Adult, Fiction, Science Fiction, Dystopia, Science Fiction



Currently, the genres are in order of how many people categorized them as that genre. I'd like to split the string of genres into multiple columns indicating a primary genre, a secondary genre, etc., but remove duplicates. Splitting the genres into multiple columns is easy enough, and I'm sure there is some way to get a function like unique() to work row-wise and omit duplicates, but I'm stuck. The desired output would be like this:


# A tibble: 3 x 6
title genre1 genre2 genre3 genre4 genre5
<chr> <chr> <chr> <chr> <chr> <chr>
1 Harry Potter and the Sorcerors Stone Fantasy Young Adult Magic NA NA
2 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
3 The Hunger Games Young Adult Fiction Science Fiction Dystopia NA




5 Answers
5



Here a solution using data.table and base R.


data.table


library(data.table)
setDT(books)

books = unique(books[, strsplit(genre, ", "), by = title])
books[, genre:= paste0("genre_", seq_along(V1)), by = title]
dcast(books, title ~ genre, value.var = "V1")
# title genre_1 genre_2 genre_3 genre_4 genre_5
# 1: Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
# 2: The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia <NA>
# 3: To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic





@Ryan Thanks for adding the missing line! I forgot to copy the part of setting up the data structure into SO example.
– Manuel Bickel
Jun 29 at 21:01



You can use separate with a step to remove non-unique genres before using separate.


separate


separate


library(tidyverse)
library(magrittr)

#remove non-unique genres
books %<>% mutate(genre = map(str_split(genre, ', '), ~ paste(unique(.x), collapse = ',')))

#separate into columns
books %>%
separate(col = 2, into = paste0('genre', seq(max(str_count(books$genre, ',')) + 1L))
, sep = ',')


# # A tibble: 3 x 6
# title genre1 genre2 genre3 genre4 genre5
# <chr> <chr> <chr> <chr> <chr> <chr>
# 1 Harry Potter 1 Fantasy Young Adult Magic NA NA
# 2 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
# 3 The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia NA





The only problem with using separate is that you hard-code the maximum number of genres as 5, whereas you actually might need to scale either higher or lower
– camille
Jun 29 at 20:58


separate





@camille Good point. I edited my answer so the number of columns isn't hard-coded.
– Ryan
Jun 29 at 20:59






slight performance gain with max(str_count(books$genre, ','))+1 instead of max(str_count(books$genre, ',') + 1)
– useR
Jun 29 at 21:03


max(str_count(books$genre, ','))+1


max(str_count(books$genre, ',') + 1)



We can paste the columns together and use data.table::fread magic, then rename our fields.


data.table::fread


library(data.table)
dt <- fread(paste(books$title, books$genre, sep=", ",collapse="n"),header = FALSE,fill=TRUE,sep=",")
setNames(as.data.frame(dt),c("title",paste0("genre",seq(ncol(dt)-1))))
# title genre1 genre2 genre3 genre4 genre5
# 1 Harry Potter 1 Fantasy Young Adult Fantasy Magic
# 2 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic
# 3 The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia Science Fiction



You can do this with stringr::str_split to make a list-column of genres. genre will become a list of character vectors, which you can then unnest, then take distinct observations.


stringr::str_split


genre


library(tidyverse)

books %>%
mutate(genre = str_split(genre, ", ")) %>%
unnest(genre) %>%
distinct()
#> # A tibble: 12 x 2
#> title genre
#> <chr> <chr>
#> 1 Harry Potter 1 Fantasy
#> 2 Harry Potter 1 Young Adult
#> 3 Harry Potter 1 Magic
#> 4 To Kill A Mockingbird Classics
#> 5 To Kill A Mockingbird Fiction
#> 6 To Kill A Mockingbird Historical
#> 7 To Kill A Mockingbird Historical Fiction
#> 8 To Kill A Mockingbird Academic
#> 9 The Hunger Games 1 Young Adult
#> 10 The Hunger Games 1 Fiction
#> 11 The Hunger Games 1 Science Fiction
#> 12 The Hunger Games 1 Dystopia



A shortcut here that I always forget about is separate_rows, which does the splitting and unnesting in one step:


separate_rows


books %>%
separate_rows(genre, sep = ", ") %>%
distinct()



is equivalent to the previous block.



To get this to a wide format, you can use tidyr::spread. To make column names "genre1", "genre2", etc. dynamically, I grouped by title and then numbered the unique genres for each title. That way, you don't need to know how many genre columns you need, like you would if you used tidyr::separate to split the column instead.


tidyr::spread


"genre1"


"genre2"


tidyr::separate


books %>%
mutate(genre = str_split(genre, ", ")) %>%
unnest(genre) %>%
distinct() %>%
group_by(title) %>%
mutate(num = row_number() %>% paste0("genre", .)) %>%
spread(key = num, value = genre)
#> # A tibble: 3 x 6
#> # Groups: title [3]
#> title genre1 genre2 genre3 genre4 genre5
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
#> 2 The Hunger Games 1 Young Adult Fiction Science … Dystopia <NA>
#> 3 To Kill A Mockingbird Classics Fiction Historic… Historic… Acade…



Created on 2018-06-29 by the reprex package (v0.2.0).



Its always better to work on data in long-format. Hence, one option is to change data in long format using tidyr::gather and then remove duplicate before converting data back to wide-format.


tidyr::gather


wide-format


library(tidyverse)
library(splitstackshape)

books %>% cSplit("genre") %>% mutate_if(is.factor, as.character) %>%
gather(key, value, - title) %>% distinct(title, value) %>%
group_by(title) %>%
mutate(key = paste0("genre",row_number())) %>%
spread(key, value) %>% as.data.frame()

# title genre1 genre2 genre3 genre4 genre5
# 1 Harry Potter 1 Fantasy Young Adult Magic <NA> <NA>
# 2 The Hunger Games 1 Young Adult Fiction Science Fiction Dystopia <NA>
# 3 To Kill A Mockingbird Classics Fiction Historical Historical Fiction Academic






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Opening a url is failing in Swift

Export result set on Dbeaver to CSV