Make matrix from two columns of a dataframe and populate it by third without nested for loop
Make matrix from two columns of a dataframe and populate it by third without nested for loop
Let's say I have a dataframe with three of its columns being
> df
A B C
1232 27.3 0.42
1232 27.3 0.36
1232 13.1 0.15
7564 13.1 0.09
7564 13.1 0.63
The required output is:
[1232] [7564]
[13.1] 0.15 0.36
[27.3] 0.39 0
I need to make a matrix with unique values in A and B as my rows and columns. The value for any cell in the matrix is to be calculated by subsetting the original dataframe for the particular value of A and B and calculating the mean of column C.
My code is:
mat <- matrix(rep(0), length(unique(df$A)), nrow = length(sort(unique(df$B))))
# sort is to avoid NA
colnames(mat) <- unique(df$A)
rownames(mat) <- unique(df$B)
for (row in rownames(mat)) {
for (col in colnames(mat)) {
x <- subset(df, A == col & B == row)
mat[row, col] = mean(df$C)
}
}
This is very slow, considering I have to deal with a matrix that has thousands of rows and columns. How can I make this run faster?
3 Answers
3
You can use a combination of aggregate()
and xtabs()
:
aggregate()
xtabs()
df <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"A B C
1232 27.3 0.42
1232 27.3 0.36
1232 13.1 0.15
7564 13.1 0.09
7564 13.1 0.63")
xtabs(C ~ B + A, data=aggregate(C ~ B + A, data=df, FUN=mean))
# > xtabs(C ~ B + A, data=aggregate(C ~ B + A, data=df, FUN=mean))
# A
# B 1232 7564
# 13.1 0.15 0.36
# 27.3 0.39 0.00
For other solutions read: How to reshape data from long to wide format?
Thank you. Worked like a charm!
– Sachin S Rawat
2 days ago
Tidyverse solution:
library(tidyverse)
df %>%
group_by(A, B) %>%
summarise(C = mean(C)) %>%
spread(A, C)
You probably want something like this: (using data.table)
n <- 1e3
v <- LETTERS[1:5]
set.seed(42)
df <- data.frame(A = sample(v, n, replace = T),
B = sample(v, n, replace = T),
C = sample.int(1e2, n, replace = T))
require(data.table)
dt <- as.data.table(df)
r <- dt[, .(v = mean(C)), keyby = .(A, B)] # calculate mean for each combination
r <- dcast(r, B ~ A, value.var = 'v') # transform to your structure
rmat <- as.matrix(r[, -1]) # to matrix
rownames(rmat) <- r[[1]] # add row names
rmat[1:5, 1:5]
# A B C D E
# A 53.00000 42.71739 53.11538 49.35000 53.14286
# B 50.62745 58.41379 60.43590 48.75000 56.56410
# C 43.75000 42.93548 55.45000 52.63415 44.27907
# D 50.00000 49.84314 57.48276 50.37143 53.16667
# E 43.95122 55.46667 55.38095 43.85366 53.22222
P.S. your posted code was not correct. the loops should be:
for (row in rownames(mat)) {
for (col in colnames(mat)) {
x <- subset(df, A == col & B == row)
mat[row, col] = mean(x$C)
}
}
P.S.S. the loop could be optimized like:
for (row in rownames(mat)) {
for (col in colnames(mat)) {
i <- (df$A == col & df$B == row)
mat[row, col] <- mean(df[i, 'C'])
}
}
By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.
We need a reproducible example to solve your problem because as it is we could answer to a problem that is not what you are facing in real life. Also explicitly present what is the expected solution.
– Seymour
Jun 29 at 7:21