[InterMine Dev] Is it useful to use RAMDirectory for indexing lucene keywords?

Colin colin.diesh at gmail.com
Tue Nov 22 17:14:07 GMT 2016


I also haven't tested it but it could be that increasing it helps :)

>From the docs http://lucene.apache.org/core/3_2_0/api/core/org/apache/
lucene/index/IndexWriterConfig.html#setRAMBufferSizeMB%28double%29

" setRAMBufferSizeMB

public IndexWriterConfig setRAMBufferSizeMB(double ramBufferSizeMB)

    Determines the amount of RAM that may be used for buffering added
documents and deletions before they are flushed to the Directory. Generally
for faster indexing performance it's best to flush by RAM usage instead of
document count and use as large a RAM buffer as you can. "


I also know with elasticsearch or similar you can can manually control a
"bulk api" and this was said to be important to increase performance
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

-Colin


On Tue, Nov 22, 2016 at 10:48 AM, Justin Clark-Casey <justincc at intermine.org
> wrote:

> At KeywordSearch.java:1078 there is the line
>
>         writer.setRAMBufferSizeMB(64); //flush to disk when docs take up X
> MB
>
> There's no clue in git blame or else where why 64MB was chosen.  A quick
> poke around the web suggests setting it is trial and error [1].
> Personally, I doubt increasing it will make much difference but this would
> be a fairly easy thing to try.
>
> [1] http://stackoverflow.com/questions/6403606/lucene-java-openi
> ng-too-many-files-am-i-using-indexwriter-properly
>
> On 21/11/16 17:35, Colin wrote:
>
>> Thanks for the comments Justin. I also think the solr/elasticsearch is
>> still interesting and my branch has a little demo of using solr.
>>
>> With the existing code with lucene,  I am not sure that it makes since to
>> use RAMDirectory during loading/postprocessing but I think trying to figure
>> out the
>> "batch size" for committing the index to disk might be important.
>> http://stackoverflow.com/questions/11469131/batch-commit-for-lucene-index
>>
>> Not sure if that is already optimized or not!
>>
>> -Colin
>>
>> On Mon, Nov 21, 2016 at 8:26 AM, Justin Clark-Casey <
>> justincc at intermine.org <mailto:justincc at intermine.org>> wrote:
>>
>>     Hi Hongkee,
>>
>>     I believe (though I have not rigorously tested), that InterMine's
>> Lucene indexing is CPU bound rather than IO bound.  Therefore, I don't
>> expect that using a
>>     RAMDirectory would help much, though I'd be very interested in seeing
>> numbers if you do try it.
>>
>>     One could maybe more productively tackle the CPU bound by doing
>> indexing work over multiple cores.  At the moment, as you can see from
>>     KeywordSearch.createIndex(), the indexing is currently done on a
>> single thread via InterMineObjectFetcher.  One could have 8 fetchers
>> instead, for instance,
>>     though more significant code change is probably required to split all
>> the indexable InterMine objects into 8 workloads.
>>
>>     But in any case, I should tell you that we're currently looking at
>> updating the search approach, quite possibly by moving to Elasticsearch or
>> Solr
>>     (currently leaning towards Elasticsearch).  So indexing may be
>> carried out differently and I wouldn't want you to waste time on an
>> approach (embedded
>>     Lucene) that may go away.  That said, we still need to consider how
>> to keep providing a good out-of-the-box search experience.
>>
>>     You can see some work by Colin Diesh that gets InterMine working with
>> Solr instead of embedded Lucene here [1].
>>
>>     [1] https://github.com/intermine/intermine/issues/517 <
>> https://github.com/intermine/intermine/issues/517>
>>
>>     --
>>     Justin Clark-Casey, Synbiomine/InterMine Developer
>>     http://synbiomine.org
>>     http://twitter.com/justincc
>>
>>
>>     On 18/11/16 11:12, HongKee Moon wrote:
>>
>>         Hi all,
>>
>>         I am quite curios about RAMDirectory for indexing lucene keywords
>> because normally “postprocess” takes quite long time.
>>         Do you guys think RAMDirectory would be better/faster option to
>> doing “postprocess” task?
>>
>>         Supposedly, it must be faster to write/gunzip after restoring
>> indexed files from the database after the webapp starts with RAMDirectoy.
>>         Could you share your experience of using RAMDirectory instead of
>> FSDirectly if you are currently using it for improving performance of
>> intermine tasks?
>>
>>         Cheers,
>>         HongKee
>>
>>         --
>>         HongKee Moon
>>         Software Engineer
>>         Scientific Computing Facility
>>
>>         Max Planck Institute of Molecular Cell Biology and Genetics
>>         Pfotenhauerstr. 108
>>         01307 Dresden
>>         Germany
>>
>>         fon: +49 351 210 2740 <tel:%2B49%20351%20210%202740>
>>         fax: +49 351 210 1689 <tel:%2B49%20351%20210%201689>
>>         www.mpi-cbg.de <http://www.mpi-cbg.de> <http://www.mpi-cbg.de>
>>
>>
>>
>>         _______________________________________________
>>         dev mailing list
>>         dev at lists.intermine.org <mailto:dev at lists.intermine.org>
>>         https://lists.intermine.org/mailman/listinfo/dev <
>> https://lists.intermine.org/mailman/listinfo/dev>
>>
>>     _______________________________________________
>>     dev mailing list
>>     dev at lists.intermine.org <mailto:dev at lists.intermine.org>
>>     https://lists.intermine.org/mailman/listinfo/dev <
>> https://lists.intermine.org/mailman/listinfo/dev>
>>
>>
>>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.intermine.org/pipermail/dev/attachments/20161122/7e66c7b3/attachment.html>


More information about the dev mailing list