[InterMine Dev] Is it useful to use RAMDirectory for indexing lucene keywords?

HongKee Moon moon at mpi-cbg.de
Tue Nov 29 09:40:57 GMT 2016


Hi Justin & Colin,

Thank you so much for your kind and helpful comments.
I am looking forward to seeing new searching framework running in Intermine soon.
If I can manage to apply the mentioned tips and find any results, I will keep you updated.

Have a lovely day!

Cheers,
HongKee

> On Nov 22, 2016, at 7:40 PM, Justin Clark-Casey <justincc at intermine.org> wrote:
> 
> Eh, with SSDs and my very primitive benchmark of eyeballing CPU usage whilst indexing I'd still argue the toss :)  But as always with performance stuff, you have to try it and see :)
> 
> Yeah, we'll definitely want to look at similiar facilities in elasticsearch/solr.
> 
> -- Justin
> 
> On 22/11/16 17:14, Colin wrote:
>> I also haven't tested it but it could be that increasing it helps :)
>> 
>> From the docs http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29
>> <http://lucene.apache.org/core/3_2_0/api/core/org/apache/lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29>
>> 
>> " setRAMBufferSizeMB
>> 
>> public IndexWriterConfig setRAMBufferSizeMB(double ramBufferSizeMB)
>> 
>>    Determines the amount of RAM that may be used for buffering added documents and deletions before they are flushed to the Directory. Generally for faster
>> indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can. "
>> 
>> 
>> I also know with elasticsearch or similar you can can manually control a "bulk api" and this was said to be important to increase performance
>> https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
>> 
>> -Colin
>> 
>> 
>> On Tue, Nov 22, 2016 at 10:48 AM, Justin Clark-Casey <justincc at intermine.org <mailto:justincc at intermine.org>> wrote:
>> 
>>    At KeywordSearch.java:1078 there is the line
>> 
>>            writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X MB
>> 
>>    There's no clue in git blame or else where why 64MB was chosen.  A quick poke around the web suggests setting it is trial and error [1].  Personally, I
>>    doubt increasing it will make much difference but this would be a fairly easy thing to try.
>> 
>>    [1] http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly
>>    <http://stackoverflow.com/questions/6403606/lucene-java-opening-too-many-files-am-i-using-indexwriter-properly>
>> 
>>    On 21/11/16 17:35, Colin wrote:
>> 
>>        Thanks for the comments Justin. I also think the solr/elasticsearch is still interesting and my branch has a little demo of using solr.
>> 
>>        With the existing code with lucene,  I am not sure that it makes since to use RAMDirectory during loading/postprocessing but I think trying to figure
>>        out the
>>        "batch size" for committing the index to disk might be important. http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index
>>        <http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index>
>> 
>>        Not sure if that is already optimized or not!
>> 
>>        -Colin
>> 
>>        On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <justincc at intermine.org <mailto:justincc at intermine.org> <mailto:justincc at intermine.org
>>        <mailto:justincc at intermine.org>>> wrote:
>> 
>>            Hi Hongkee,
>> 
>>            I believe (though I have not rigorously tested), that InterMine's Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't expect that
>>        using a
>>            RAMDirectory would help much, though I'd be very interested in seeing numbers if you do try it.
>> 
>>            One could maybe more productively tackle the CPU bound by doing indexing work over multiple cores.  At the moment, as you can see from
>>            KeywordSearch.createIndex(), the indexing is currently done on a single thread via InterMineObjectFetcher.  One could have 8 fetchers instead, for
>>        instance,
>>            though more significant code change is probably required to split all the indexable InterMine objects into 8 workloads.
>> 
>>            But in any case, I should tell you that we're currently looking at updating the search approach, quite possibly by moving to Elasticsearch or Solr
>>            (currently leaning towards Elasticsearch).  So indexing may be carried out differently and I wouldn't want you to waste time on an approach (embedded
>>            Lucene) that may go away.  That said, we still need to consider how to keep providing a good out-of-the-box search experience.
>> 
>>            You can see some work by Colin Diesh that gets InterMine working with Solr instead of embedded Lucene here [1].
>> 
>>            [1] https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>
>>        <https://github.com/intermine/intermine/issues/517 <https://github.com/intermine/intermine/issues/517>>
>> 
>>            --
>>            Justin Clark-Casey, Synbiomine/InterMine Developer
>>            http://synbiomine.org
>>            http://twitter.com/justincc
>> 
>> 
>>            On 18/11/16 11:12, HongKee Moon wrote:
>> 
>>                Hi all,
>> 
>>                I am quite curios about RAMDirectory for indexing lucene keywords because normally “postprocess” takes quite long time.
>>                Do you guys think RAMDirectory would be better/faster option to doing “postprocess” task?
>> 
>>                Supposedly, it must be faster to write/gunzip after restoring indexed files from the database after the webapp starts with RAMDirectoy.
>>                Could you share your experience of using RAMDirectory instead of FSDirectly if you are currently using it for improving performance of intermine
>>        tasks?
>> 
>>                Cheers,
>>                HongKee
>> 
>>                --
>>                HongKee Moon
>>                Software Engineer
>>                Scientific Computing Facility
>> 
>>                Max Planck Institute of Molecular Cell Biology and Genetics
>>                Pfotenhauerstr. 108
>>                01307 Dresden
>>                Germany
>> 
>>                fon: +49 351 210 2740 <tel:%2B49%20351%20210%202740> <tel:%2B49%20351%20210%202740>
>>                fax: +49 351 210 1689 <tel:%2B49%20351%20210%201689> <tel:%2B49%20351%20210%201689>
>>                www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>
>> 
>> 
>> 
>>                _______________________________________________
>>                dev mailing list
>>                dev at lists.intermine.org <mailto:dev at lists.intermine.org> <mailto:dev at lists.intermine.org <mailto:dev at lists.intermine.org>>
>>                https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>>        <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>
>> 
>>            _______________________________________________
>>            dev mailing list
>>            dev at lists.intermine.org <mailto:dev at lists.intermine.org> <mailto:dev at lists.intermine.org <mailto:dev at lists.intermine.org>>
>>            https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>
>>        <https://lists.intermine.org/mailman/listinfo/dev <https://lists.intermine.org/mailman/listinfo/dev>>
>> 
>> 
>> 
> _______________________________________________
> dev mailing list
> dev at lists.intermine.org
> https://lists.intermine.org/mailman/listinfo/dev


--
HongKee Moon
Software Engineer
Scientific Computing Facility

Max Planck Institute of Molecular Cell Biology and Genetics
Pfotenhauerstr. 108
01307 Dresden
Germany

fon: +49 351 210 2740
fax: +49 351 210 1689
www.mpi-cbg.de

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.intermine.org/pipermail/dev/attachments/20161129/8960c140/attachment-0001.html>


More information about the dev mailing list