[InterMine Dev] Is it useful to use RAMDirectory for indexing lucene keywords?

Justin Clark-Casey justincc at intermine.org
Tue Nov 22 18:40:00 GMT 2016


Eh, with SSDs and my very primitive benchmark of eyeballing CPU usage whilst indexing I'd still argue the toss :)  But as always with performance stuff, you 
have to try it and see :)

Yeah, we'll definitely want to look at similiar facilities in elasticsearch/solr.

-- Justin

On 22/11/16 17:14, Colin wrote:
> I also haven't tested it but it could be that increasing it helps :)
>
> From the docs http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29
> <http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29>
>
> " setRAMBufferSizeMB
>
> public IndexWriterConfig setRAMBufferSizeMB(double ramBufferSizeMB)
>
>     Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Generally for faster
> indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can. "
>
>
> I also know with elasticsearch or similar you can can manually control a "bulk api" and this was said to be important to increase performance
> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
>
> -Colin
>
>
> On Tue, Nov 22, 2016 at 10:48 AM, Justin Clark-Casey <justincc at intermine.org <mailto:justincc at intermine.org>> wrote:
>
>     At KeywordSearch.java:1078 there is the line
>
>             writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X MB
>
>     There's no clue in git blame or else where why 64MB was chosen.  A quick poke around the web suggests setting it is trial and error [1].  Personally, I
>     doubt increasing it will make much difference but this would be a fairly easy thing to try.
>
>     [1] http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly
>     <http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly>
>
>     On 21/11/16 17:35, Colin wrote:
>
>         Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.
>
>         With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure
>         out the
>         "batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index
>         <http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index>
>
>         Not sure if that is already optimized or not!
>
>         -Colin
>
>         On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <justincc at intermine.org <mailto:justincc at intermine.org> <mailto:justincc at intermine.org
>         <mailto:justincc at intermine.org>>> wrote:
>
>             Hi Hongkee,
>
>             I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that
>         using a
>             RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.
>
>             One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
>             KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for
>         instance,
>             though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.
>
>             But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr
>             (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded
>             Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.
>
>             You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].
>
>             [1] https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>
>         <https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>>
>
>             --
>             Justin Clark-Casey, Synbiomine/InterMine Developer
>             http://synbiomine.org
>             http://twitter.com/justincc
>
>
>             On 18/11/16 11:12, HongKee Moon wrote:
>
>                 Hi all,
>
>                 I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
>                 Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?
>
>                 Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
>                 Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine
>         tasks?
>
>                 Cheers,
>                 HongKee
>
>                 --
>                 HongKee Moon
>                 Software Engineer
>                 Scientific Computing Facility
>
>                 Max Planck Institute of Molecular Cell Biology and Genetics
>                 Pfotenhauerstr. 108
>                 01307 Dresden
>                 Germany
>
>                 fon: +49 351 210 2740 <tel:%2B49%20351%20210%202740> <tel:%2B49%20351%20210%202740>
>                 fax: +49 351 210 1689 <tel:%2B49%20351%20210%201689> <tel:%2B49%20351%20210%201689>
>                 www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>
>
>
>
>                 _______________________________________________
>                 dev mailing list
>                 dev at lists.intermine.org <mailto:dev at lists.intermine.org> <mailto:dev at lists.intermine.org <mailto:dev at lists.intermine.org>>
>                 https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>         <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>
>
>             _______________________________________________
>             dev mailing list
>             dev at lists.intermine.org <mailto:dev at lists.intermine.org> <mailto:dev at lists.intermine.org <mailto:dev at lists.intermine.org>>
>             https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>         <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>
>
>
>


More information about the dev mailing list