performance - Best way to store, load and use an inverted index in C++ (~500 Mo) -


i'm developing tiny search engine using tf-idf , cosine similarity. when pages added, build inverted index keep words frequency in different pages. remove stopwords , more common words, , plural/verb/etc. stemmed.

my inverted index looks like:

map< string, map<int, float> > index  [     word_a => [ id_doc=>frequency, id_doc2=>frequency2, ... ],     word_b => [ id_doc->frequency, id_doc2=>frequency2, ... ],     ... ] 

with data structure, can idf weight word_a.size(). given query, program loops on keywords , scores documents.

i don't know data structures , questions are:

  1. how store 500 mo inverted index in order load @ search time? currently, use boost serialize index:

    ofstream ofs_index("index.sr", ios::binary); boost::archive::bynary_oarchive oa(ofs_index); oa << index; 

    and load @ search time:

    ifstream ifs_index("index.sr", ios::binary); boost::archive::bynary_iarchive ia(ifs_index); ia >> index; 

    but very slow, takes sometines 10 seconds load.

  2. i don't know if map efficient enough inverted index.

  3. in order cluster documents, keywords each document , loop on these keywords score similar documents, avoid reading again each document , use inverted index. think data structure costly.

thank in advance help!

the answer pretty depend on whether need support data comparable or larger machine's ram , whether in typical use case access of indexed data or rather small fraction of it.

if data fit machine's memory, can try optimize map-based structure using now. storing data in map should give pretty fast access, there initial overhead when load data disk memory. also, if use small fraction of index, approach may quite wasteful read , write whole index, , keep of in memory.

below list suggestions try out, before commit time of them, make sure measure improves load , run times , not. without profiling actual working code on actual data use, these guesses may wrong.

  • map implemented tree (usually black-red tree). in many cases, hash_map may give better performance better memory usage (fewer allocations , less fragmentation example).
  • try reducing size of data - less data means faster read disk, potentially less memory allocation, , better in-memory performance due better locality. may example consider use float store frequency, perhaps store number of occurrences unsigned short in map values , in separate map store number of words each document (also short). using 2 numbers, can re-calculate frequency, use less disk space when save data disk, result in faster load times.
  • your map has quite few entries, , using custom memory allocators helps improve performance in such case.

if data potentially grow beyond size of machine's ram, suggest use memory-mapped files storing data. such approach may require re-modelling data structures , either using custom stl allocators or using custom data structures instead of std::map may improve performance order of magnitude if done well. in particular, approach frees having load whole structure memory @ once, startup times improve dramatically @ cost of slight delays related disk accesses distributed on time touch different parts of structure first time. subject quite broad, , requires deeper changes code tuning map, if plan handling huge data, should have @ mmap , friends.


Comments

Popular posts from this blog

java - WrongTypeOfReturnValue exception thrown when unit testing using mockito -

php - Magento - Deleted Base url key -

android - How to disable Button if EditText is empty ? -