r - Normalizing a multivalued column in a large table -
i'm having issue converting vba code seen in this post r-script.
the problem follows, have column (from source database, not choice) contains multiple values attribute. i'd normalize table , retain order in each value occurs in each cell.
an example dataset:
dat <- data.frame( id = c(1:3), multi = c("val1 val2 val3","val2 val3","val3 val1") ,stringsasfactors=false) id multi 1 1 val1 val2 val3 2 2 val2 val3 3 3 val2 val3 val1
the pseudocode like:
- loop on each row
- split string multi space separator
- for each splitted string, append new row in separate data.frame id, order within total string , value.
the result like:
id order multi 1 1 1 val1 2 1 2 val2 3 1 3 val3 4 2 1 val2 5 2 2 val3 6 3 1 val2 7 3 2 val3 8 3 3 val1
i'm looking @ doing data.frame, i'm thinking data.table more appropriate table have approximately 400.000 of these rows.
i apologize not having code ready, i'm still contemplating whether need use apply family, data.table or simple loop. i'll keep post updated progress make.
here couple of ways...
in base r:
x <- setnames(strsplit(as.character(dat$multi), " "), dat$id) x1 <- stack(x) x1$order <- ave(x1$ind, x1$ind, fun = seq_along) x1 # values ind order # 1 val1 1 1 # 2 val2 1 2 # 3 val3 1 3 # 4 val2 2 1 # 5 val3 2 2 # 6 val2 3 1 # 7 val3 3 2 # 8 val1 3 3
or (better):
x <- strsplit(as.character(dat[, "multi"]), " ", fixed = true) len <- vapply(x, length, 1l) data.frame(id = rep(dat[, "id"], len), order = sequence(len), multi = unlist(x, use.names=false))
using concat.split.multiple
"splitstackshape" package (probably not efficient on 400,000 rows though).
library(splitstackshape) out <- concat.split.multiple(dat, "multi", " ", "long") out[order(out$id, out$time), ] # id time multi # 1 1 1 val1 # 4 1 2 val2 # 7 1 3 val3 # 2 2 1 val2 # 5 2 2 val3 # 8 2 3 <na> # 3 3 1 val2 # 6 3 2 val3 # 9 3 3 val1
and, since requested "data.table":
library(data.table) dt <- data.table(dat) dtl <- dt[, list(unlist(strsplit(as.character(multi), " "))), = id] dtl[, order := sequence(.n), = id] dtl # id v1 order # 1: 1 val1 1 # 2: 1 val2 2 # 3: 1 val3 3 # 4: 2 val2 1 # 5: 2 val3 2 # 6: 3 val2 1 # 7: 3 val3 2 # 8: 3 val1 3
update: timings
i didn't bother testing "splitstackshape" approach, because uses read.table
under hood, know won't stand demands of performance.
however, base r still seems win out!
sample "big" data:
dat_big <- do.call(rbind, replicate(floor(4000/3), dat, simplify = false)) dat_big <- do.call(rbind, replicate(100, dat_big, simplify = false)) dat_big$id <- make.unique(as.character(dat_big$id)) dt <- data.table(dat) dt_big <- data.table(dat_big)
functions test:
fun1 <- function(indf) { x <- strsplit(as.character(indf[, "multi"]), " ", fixed = true) len <- vapply(x, length, 1l) data.frame(id = rep(indf[, "id"], len), order = sequence(len), multi = unlist(x, use.names=false)) } fun2 <- function(indt) { dtl <- indt[, list(unlist(strsplit(as.character(multi), " ", fixed = true))), = id] dtl[, order := sequence(.n), = id] dtl }
the results base r:
system.time(outdf <- fun1(dat_big)) # user system elapsed # 6.418 0.000 6.454 dim(outdf) # [1] 1066400 3 head(outdf) # id order multi # 1 1 1 val1 # 2 1 2 val2 # 3 1 3 val3 # 4 2 1 val2 # 5 2 2 val3 # 6 3 1 val2 tail(outdf) # id order multi # 1066395 1.133299 3 val3 # 1066396 2.133299 1 val2 # 1066397 2.133299 2 val3 # 1066398 3.133299 1 val2 # 1066399 3.133299 2 val3 # 1066400 3.133299 3 val1
the results "data.table":
system.time(outdt <- fun2(dt_big)) # user system elapsed # 14.035 0.000 14.057 dim(outdt) # [1] 1066400 3 outdt # id v1 order # 1: 1 val1 1 # 2: 1 val2 2 # 3: 1 val3 3 # 4: 2 val2 1 # 5: 2 val3 2 # --- # 1066396: 2.133299 val2 1 # 1066397: 2.133299 val3 2 # 1066398: 3.133299 val2 1 # 1066399: 3.133299 val3 2 # 1066400: 3.133299 val1 3
Comments
Post a Comment