r - Normalizing a multivalued column in a large table -

- May 15, 2010

i'm having issue converting vba code seen in this post r-script.

the problem follows, have column (from source database, not choice) contains multiple values attribute. i'd normalize table , retain order in each value occurs in each cell.

an example dataset:

dat <- data.frame(   id = c(1:3),   multi = c("val1 val2 val3","val2 val3","val3 val1")   ,stringsasfactors=false)    id          multi 1  1 val1 val2 val3 2  2      val2 val3 3  3 val2 val3 val1

the pseudocode like:

loop on each row
split string multi space separator
for each splitted string, append new row in separate data.frame id, order within total string , value.

the result like:

      id   order    multi 1      1       1     val1 2      1       2     val2 3      1       3     val3 4      2       1     val2 5      2       2     val3 6      3       1     val2 7      3       2     val3 8      3       3     val1

i'm looking @ doing data.frame, i'm thinking data.table more appropriate table have approximately 400.000 of these rows.

i apologize not having code ready, i'm still contemplating whether need use apply family, data.table or simple loop. i'll keep post updated progress make.

here couple of ways...

in base r:

x <- setnames(strsplit(as.character(dat$multi), " "), dat$id) x1 <- stack(x) x1$order <- ave(x1$ind, x1$ind, fun = seq_along) x1 #   values ind order # 1   val1   1     1 # 2   val2   1     2 # 3   val3   1     3 # 4   val2   2     1 # 5   val3   2     2 # 6   val2   3     1 # 7   val3   3     2 # 8   val1   3     3

or (better):

x <- strsplit(as.character(dat[, "multi"]), " ", fixed = true) len <- vapply(x, length, 1l) data.frame(id = rep(dat[, "id"], len), order = sequence(len),             multi = unlist(x, use.names=false))

using concat.split.multiple "splitstackshape" package (probably not efficient on 400,000 rows though).

library(splitstackshape) out <- concat.split.multiple(dat, "multi", " ", "long") out[order(out$id, out$time), ] #   id time multi # 1  1    1  val1 # 4  1    2  val2 # 7  1    3  val3 # 2  2    1  val2 # 5  2    2  val3 # 8  2    3  <na> # 3  3    1  val2 # 6  3    2  val3 # 9  3    3  val1

and, since requested "data.table":

library(data.table) dt <- data.table(dat) dtl <- dt[, list(unlist(strsplit(as.character(multi), " "))), = id] dtl[, order := sequence(.n), = id] dtl #    id   v1 order # 1:  1 val1     1 # 2:  1 val2     2 # 3:  1 val3     3 # 4:  2 val2     1 # 5:  2 val3     2 # 6:  3 val2     1 # 7:  3 val3     2 # 8:  3 val1     3

update: timings

i didn't bother testing "splitstackshape" approach, because uses read.table under hood, know won't stand demands of performance.

however, base r still seems win out!

sample "big" data:

dat_big <- do.call(rbind, replicate(floor(4000/3), dat, simplify = false)) dat_big <- do.call(rbind, replicate(100, dat_big, simplify = false)) dat_big$id <- make.unique(as.character(dat_big$id))  dt <- data.table(dat) dt_big <- data.table(dat_big)

functions test:

fun1 <- function(indf) {   x <- strsplit(as.character(indf[, "multi"]), " ", fixed = true)   len <- vapply(x, length, 1l)   data.frame(id = rep(indf[, "id"], len), order = sequence(len),               multi = unlist(x, use.names=false)) }  fun2 <- function(indt) {   dtl <- indt[, list(unlist(strsplit(as.character(multi), " ", fixed = true))), = id]   dtl[, order := sequence(.n), = id]   dtl }

the results base r:

system.time(outdf <- fun1(dat_big)) #    user  system elapsed  #   6.418   0.000   6.454  dim(outdf) # [1] 1066400       3 head(outdf) #   id order multi # 1  1     1  val1 # 2  1     2  val2 # 3  1     3  val3 # 4  2     1  val2 # 5  2     2  val3 # 6  3     1  val2 tail(outdf) #               id order multi # 1066395 1.133299     3  val3 # 1066396 2.133299     1  val2 # 1066397 2.133299     2  val3 # 1066398 3.133299     1  val2 # 1066399 3.133299     2  val3 # 1066400 3.133299     3  val1

the results "data.table":

system.time(outdt <- fun2(dt_big)) #    user  system elapsed  #  14.035   0.000  14.057  dim(outdt) # [1] 1066400       3 outdt #                id   v1 order #       1:        1 val1     1 #       2:        1 val2     2 #       3:        1 val3     3 #       4:        2 val2     1 #       5:        2 val3     2 #      ---                     # 1066396: 2.133299 val2     1 # 1066397: 2.133299 val3     2 # 1066398: 3.133299 val2     1 # 1066399: 3.133299 val3     2 # 1066400: 3.133299 val1     3

Search This Blog

Sp

r - Normalizing a multivalued column in a large table -

update: timings

Comments

Post a Comment

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

c++11 - Intel compiler and "cannot have an in-class initializer" when using constexpr -

symfony - imagine_filter() not generating the correct url in LiipImagineBundle -