performance - How do I parse a large data block into memory in Haskell? -


on reflection, entire question can boiled down more concise. i'm looking haskell data structure that

  • looks list
  • has o(1) lookup
  • has either o(1) element replacement or o(1) element append (or prepend... reverse index lookups if case). can write later algorithms 1 or other in mind.
  • has little memory overhead

i'm trying build image file parser. file format basic 8-bit color ppm file, though intend support 16-bit color files , png , jpeg files. existing netpbm library, despite lot of unboxing annotations, consumes available memory when trying load files work with:

3-10 photographs, smallest being 45mb , largest being 110mb.

now, can't understand optimizations put netpbm code, decided have own try @ it. it's simple file format...

i have started deciding no matter file format, i'm going store final image uncompressed in format:

import data.vector.unboxed (vector) data pixelmap = rgb8 {       width :: int     , height :: int     , redchannel :: vector word8     , greenchannel :: vector word8     , bluechannel :: vector word8     } 

then wrote parser works on 3 vectors so:

import data.attoparsec.bytestring data progress = progress {       addr      :: int     , size      :: int     , redc      :: vector word8     , greenc    :: vector word8     , bluec     :: vector word8     }  parsecolorbinary :: progress -> parser progress parsecolorbinary progress@progress{..}     | addr == size = return progress     | addr < size =         !redv <- anyword8         !greenv <- anyword8         !bluev <- anyword8         parsecolorbinary progress { addr    = addr + 1                                   , redc    = redc v.// [(addr, redv)]                                   , greenc  = greenc v.// [(addr, greenv)]                                   , bluec   = bluec v.// [(addr, bluev)] } 

and @ end of parser, construct rgb8 so:

progress{..} <- parsecolorbinary $ ... return $ rgb8 width height redc greenc bluec 

written this, program, loading single 1 of these 45mb images, consume 5gb or more of memory. if change definition of progress redc, greenc, , bluec !(vector word8), program remains within reasonable memory confines, takes long load single file haven't allowed finish. finally, if replace vectors here standard lists, memory usage shoots 5gb per file (i assume... run out of space before hit that), and load time on order of 6 seconds. ubuntu's preview application, once started, loads , renders file instantly.

on theory each call v.// copying vector every single time, tried switching data.vector.unboxed.mutable, but... can't typecheck. documentation nonexistent , understanding data types doing going require fighting multiple other libraries well. , don't know if solve problems, i'm reluctant try.

the fundamental problem pretty straightforward:

how quickly, , without using obscene amount of memory, read, retain, , potentially manipulate large data structure? of examples have found generating temporarily huge data , getting rid of possible.

in principal, want final representation immutable, don't care if have use mutable structure there.


just completeness, complete code (bsd3-licensed) on bitbucket in https://bitbucket.org/savannidgerinel/photo-tools . performance branch contains strict version of parser, can made unstrict quick change in progress data structure of codec.image.netpbm.

to run performance test

ulimit -sv 6000000 -- set ulimit of 6gb, or change whatever makes sense cabal build dist/build/perf-test/perf-test +rts -p -sstderr 

i first thought reading whole chunk of bytestring , unzipping contents unboxed vectors enough. indeed, parsing code posted rather bad without mysterious space leak: copy entirety of 3 vectors on every single byte of input! talk quadratic complexity.

so wrote following:

chunksof3 :: [a] -> [(a, a, a)] chunksof3 (a:b:c:xs) = (a, b, c) : chunksof3 xs chunksof3 _          = []  parsergb :: int -> atto.parser (vector word8, vector word8, vector word8) parsergb size =     input <- atto.take (size * 3)     let (rs, gs, bs) = unzip3 $ chunksof3 $ b.unpack input     return (v.fromlist rs, v.fromlist gs, v.fromlist bs) 

and tested 45 mb file of random bytes. admit surprised code resulted in gigabytes of ram usage. i'm curious space leaks.

mutable vectors work though. following code uses 133 mb ram , criterion benchmarks 60 ms file reading included. included explanations in comments. there ample material on st monad , mutable vectors on , elsewhere (i concur though library documentations unfriendly beginners).

import data.vector.unboxed (vector) import data.bytestring (bytestring)  import qualified data.vector.unboxed v import qualified data.bytestring b import qualified data.vector.unboxed.mutable mv  import control.monad.st.strict  import data.word import control.monad import control.deepseq  -- benchmarking stuff import criterion.main (defaultmainwith, bench, whnfio) import criterion.config (defaultconfig, config(..), ljust)  -- part parses 3 vectors colors. -- of course, can embed attoparsec computation taking  -- current input, feeding parsergb, or can take right  -- sized chunk in parser , omit "maybe" test code below.  parsergb :: int -> bytestring -> maybe (vector word8, vector word8, vector word8) parsergb size input      | 3* size > b.length input = nothing     | otherwise = $ runst $          -- allocating 3 mutable vectors of size "size"         -- bit of pain new users, because have         -- specify correct type somewhere, , it's not simple type.         -- in st monad there "s" type parameter labels         -- state of action. type of "st s something" bit similar         -- "io something", except inner type contains "s"         -- parameter. purpose of "s" statically disallow mutable         -- variables escaping st action.          [r, g, b] <- replicatem 3 $ mv.new size :: st s [mv.mvector s word8]          -- form_ = flip mapm_         -- in st code form_ nicer looking approximation of usual         -- imperative loop.          form_ [0..size - 1] $ \i ->             let i' = 3 *             mv.unsafewrite r (b.index input $ i'    )             mv.unsafewrite g (b.index input $ i' + 1)             mv.unsafewrite b (b.index input $ i' + 2)          -- freeze converts mutable vector living in st monad          -- regular vector, can returned action         -- since type no longer depends on pesky "s".         -- unsafefreeze conversion in place without copying.         -- implies original mutable vector should not used after         -- unsafefreezing.          [r, g, b] <- mapm v.unsafefreeze [r, g, b]         return (r, g, b)  -- prepared file 3 * 15 million random bytes. inputsize = 15000000 benchconf = defaultconfig {cfgsamples = ljust 10}  main =     defaultmainwith benchconf (return ()) $ [         bench "parsergb test" $ whnfio $              input <- b.readfile "randominp.dat"              force (parsergb inputsize input) `seq` putstrln "done"         ] 

Comments

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

php - Magento - Deleted Base url key -

android - How to disable Button if EditText is empty ? -