[InterMine Dev] Is it useful to use RAMDirectory for indexing lucene keywords?

Justin Clark-Casey justincc at intermine.org
Tue Nov 22 16:48:46 GMT 2016


At KeywordSearch.java:1078 there is the line

         writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X MB

There's no clue in git blame or else where why 64MB was chosen.  A quick poke around the web suggests setting it is trial and error [1].  Personally, I doubt 
increasing it will make much difference but this would be a fairly easy thing to try.

[1] http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly

On 21/11/16 17:35, Colin wrote:
> Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.
>
> With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure out the
> "batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index
>
> Not sure if that is already optimized or not!
>
> -Colin
>
> On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <justincc at intermine.org <mailto:justincc at intermine.org>> wrote:
>
>     Hi Hongkee,
>
>     I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that using a
>     RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.
>
>     One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
>     KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for instance,
>     though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.
>
>     But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr
>     (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded
>     Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.
>
>     You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].
>
>     [1] https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>
>
>     --
>     Justin Clark-Casey, Synbiomine/InterMine Developer
>     http://synbiomine.org
>     http://twitter.com/justincc
>
>
>     On 18/11/16 11:12, HongKee Moon wrote:
>
>         Hi all,
>
>         I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
>         Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?
>
>         Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
>         Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine tasks?
>
>         Cheers,
>         HongKee
>
>         --
>         HongKee Moon
>         Software Engineer
>         Scientific Computing Facility
>
>         Max Planck Institute of Molecular Cell Biology and Genetics
>         Pfotenhauerstr. 108
>         01307 Dresden
>         Germany
>
>         fon: +49 351 210 2740 <tel:%2B49%20351%20210%202740>
>         fax: +49 351 210 1689 <tel:%2B49%20351%20210%201689>
>         www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>
>
>
>
>         _______________________________________________
>         dev mailing list
>         dev at lists.intermine.org <mailto:dev at lists.intermine.org>
>         https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>
>     _______________________________________________
>     dev mailing list
>     dev at lists.intermine.org <mailto:dev at lists.intermine.org>
>     https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>
>


More information about the dev mailing list