Sum the total number of strings separated by comma [duplicate]


Sum the total number of strings separated by comma [duplicate]



This question already has an answer here:


structure(list(Other = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_),
Years = c("2005, 2005, 2006, 2006, 2007", "2011, 2014",
"2007", "2011, 2011, 2011, 2012, 2012, 2012",
"2006, 2006, 2012, 2012, 2015")),
.Names = c("Other", "Years"), row.names = 1:4, class = "data.frame")



Given the above data frame, the second column has a bunch of years arranged with commas. I'd like to create a new column which adds the total number of years in each element in the column. So the final data frame looks like this:


structure(list(Other = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_),
Years = c("2005, 2005, 2006, 2006, 2007","2011, 2014", "2007",
"2011, 2011, 2011, 2012, 2012, 2012",
"2006, 2006, 2012, 2012, 2015"),
yearlength = c(5, 2, 1, 6, 5)),
.Names = c("Other", "Years", "yearlength"), row.names = 1:4, class = "data.frame")



I've tried with solution such as stack$yearlength <- count.fields(textConnection(stack), sep = ",") but I can't quite get it to work.


stack$yearlength <- count.fields(textConnection(stack), sep = ",")



This question has been asked before and already has an answer. If those answers do not fully address your question, please ask a new question.




2 Answers
2



One approach is to count the commas and add 1


1


df$yearlength <- stringr::str_count(df$Years, ",")+1
df
#output
Other Years yearlength
1 <NA> 2005, 2005, 2006, 2006, 2007 5
2 <NA> 2011, 2014 2
3 <NA> 2007 1
4 <NA> 2011, 2011, 2011, 2012, 2012, 2012 6
5 <NA> 2006, 2006, 2012, 2012, 2015 5



another would be to count the spans of digits:


df$yearlength <- stringr::str_count(df$Years, "d+")



A third option (thanks to Sotos's comment) would be to count the words:


stringi::stri_count_words(df$Years)



or


stringr::str_count(df$Years, "w+")



Fourth option is to count the non spaces:


stringr::str_count(df$Years, "S+")

all.equal(stringr::str_count(df$Years, ",")+1,
stringr::str_count(df$Years, "d+"),
stringi::stri_count_words(df$Years),
stringr::str_count(df$Years, "w+"),
stringr::str_count(df$Years, "S+"))



EDIT: when NA present in the data set:


df[3,2] <- NA



all of the above solutions produce
#output
5 2 NA 6 5



to change NA to 0:


df$yearlength[is.na(df$yearlength)] <- 0
#output
Other Years yearlength
1 <NA> 2005, 2005, 2006, 2006, 2007 5
2 <NA> 2011, 2014 2
3 <NA> <NA> 0
4 <NA> 2011, 2011, 2011, 2012, 2012, 2012 6
5 <NA> 2006, 2006, 2012, 2012, 2015 5



Data (since the data in the question is corrupt):


df <- structure(list(Other = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), Years = c("2005, 2005, 2006, 2006, 2007",
"2011, 2014", "2007", "2011, 2011, 2011, 2012, 2012, 2012", "2006, 2006, 2012, 2012, 2015"
)), .Names = c("Other", "Years"), row.names = 1:5, class = "data.frame")





You could also use stringi and do stringi::stri_count_words
– Sotos
Jun 29 at 9:40


stringi


stringi::stri_count_words





Thanks for the answer. My problem arises when I try to apply it to NA values. It seems to count those as 1 rather than 0.
– WoeIs
Jun 29 at 9:59





all of the proposed solutions do not count them: like: stringr::str_count(df$Years, "w+") but produce NA in place. See edit how to replace NA with 0.
– missuse
Jun 29 at 10:06



stringr::str_count(df$Years, "w+")


NA


NA


0



You can split according to a comma and then just find length of the vector.


> sapply(strsplit(xy$Years, ","), length)
[1] 5 2 1 6 5



Added to account for an NA (example from @missuse):


xy <- structure(list(Other = c(NA_character_, NA_character_, NA_character_,
NA_character_, NA_character_), Years = c("2005, 2005, 2006, 2006, 2007",
"2011, 2014", "2007", "2011, 2011, 2011, 2012, 2012, 2012", "2006, 2006, 2012, 2012, 2015"
)), .Names = c("Other", "Years"), row.names = 1:4, class = "data.frame")

xy[3, 2] <- NA

sapply(strsplit(xy$Years, ","), FUN = function(x) {
length(na.omit(x))
})

[1] 5 2 0 6 5





or lengths(strsplit(xy$Years, ","))
– Jaap
Jun 29 at 9:32


lengths(strsplit(xy$Years, ","))





Thanks for the answer. Is there any way to make it not count NA values?
– WoeIs
Jun 29 at 9:54





@WoeIs this is why I wrapped the result into an sapply. Instead of length you can specify an anonymous function where you can process each row/element however you please.
– Roman Luštrik
Jun 29 at 11:21


sapply


length

Comments

Popular posts from this blog

paramiko-expect timeout is happening after executing the command

Possible Unhandled Promise Rejection (id: 0): ReferenceError: user is not defined ReferenceError: user is not defined

Opening a url is failing in Swift