[InterMine Dev] keyword_search_index size

Julie Sullivan julie at flymine.org
Fri Mar 7 13:38:31 GMT 2014


Hi Joe

1. I'd be interested to see what the index is like for your full 
database. FlyMine's is 2 GB but that database is only 75 GB. I do have 
facets which increases the search index size a bit.

Are there any types you can safely not index? Here's a list of FlyMine's:

https://github.com/intermine/intermine/blob/dev/flymine/dbmodel/resources/keyword_search.properties#L8

You can also ignore specific fields, but not sure if that would help 
matters any. The tracker and intermine object tables aren't indexed if 
that makes you feel better.

2. Alex just wrote a script to stress-test InterMine:

https://github.com/alexkalderimis/death-by-snoo-snoo

The script runs queries, does enrichment, a keyword search etc. On 
FlyMine we saw about ~150 requests/second and the site didn't slow down.

Another group ran a stress test using JMeter (I think) and had ~12,000 
requests per minute without any slow down of the site. I'll write up the 
results of these soon to share with everyone. And get exact numbers.

It would be interesting to compare performance with your large-ish mine.

I think InterMine will be just fine (I would say that wouldn't I) but 
what does your hardware situation look like? Are you safe?

3. Our release script does touch these URLs as the final step:

http://www.flymine.org/
http://www.flymine.org/query/genomicRegionSearch.do
http://www.flymine.org/query/keywordSearchResults.do?searchTerm=eve&searchSubmit=GO

Other mines also check a single public list, which should trigger the 
list upgrade process.

4. Your keyword search idea is brilliant! If you do make that change 
maybe you can push it into the InterMine repo?

I've made a ticket here:
https://github.com/intermine/intermine/issues/562

On 06/03/14 18:39, Joe Carlson wrote:
> Julie Sullivan wrote:
>> Hi Joe
>>
>> FlyMine's is the same - about 2G. I haven't timed how long it takes to
>> load into memory but it does take a while. It is an issue, there may
>> be a better way to do things.
>>
>> How much bigger is your database going to get?
>
> Hi Julie,
>
> Thanks for the info.
>
> We're on the verge of going public and trying to tie up a few loose
> ends, and the lucene indices was one of those.
>
> Right now, the db size of the mine is ~ 600 G. The biggest tables are
> tracker and intermineobject, then the usual suspects of sequencefeature
> and bioentity. Some of our SNP tables are in the top 10. And we expect
> these to get 10x or 20x bigger as time goes by. 'msa' is a table of
> clustal multiple sequence alignments.
>
>           relation            | total_size
> -------------------------------+------------
> public.tracker                | 211 GB
> public.intermineobject        | 135 GB
> public.snpdiversitysample     | 45 GB
> public.clob                   | 33 GB
> public.sequencefeature        | 21 GB
> public.location               | 15 GB
> public.bioentity              | 12 GB
> public.geneflankingregion     | 11 GB
> public.exon                   | 7291 MB
> public.snp                    | 7024 MB
> public.snplocation            | 5132 MB
> public.consequencessnps       | 3863 MB
> public.proteinanalysisfeature | 3331 MB
> public.msa                    | 3016 MB
> public.intermine_sequence     | 2481 MB
> public.utr                    | 1694 MB
> public.cds                    | 1578 MB
> public.proteinproteinfamily   | 1543 MB
>
> I admit that I have not completed building an index for the entire
> thing.  I also have a mini-mine that I'm using for testing (16G). The
> current size (with 'du') of the keyword_search_index is only 427M. It
> takes ~ 30 seconds to expand on first use. I'm just getting nervous
> about the extrapolation.
>
> My thinking is that as part of our web starting procedures, I need to
> run some queries that cause all of the initialize-this-on-first-run
> things to get executed. (The startup in region searching is another slow
> one.) For  the keyword searching, I was thinking of doing a 'is this
> really needed' check.
>
> in KeywordSearch.saveIndexToDatabase:
> <code-not-checked>
>             LOG.info("Saving signature to database...");
> writeObjectToDB(os,MetadataManager.SEARCH_INDEX_SIGNATURE,TextUtil.generateRandomUniqueString());
>
>             LOG.info("Saving search index information to database...");
>             writeObjectToDB(os, MetadataManager.SEARCH_INDEX, index);
> </code-not-checked>
>
> And when restoring, write the signature to the file system.  On restart,
> look and check the signature before restoring. You may quibble at
> keeping a 20 byte string in the blob. But you get the idea.
>
> We have a large number of chromosomes in our mine, 528K. Some of the
> plant genomes are very fragmented. (I'm looking at you, switchgrass) And
> hitting the Regions tab on the web page is painful the first time so I'd
> like to run that behind the scenes. Are you aware of other things that
> need to be run?
>
> Thanks,
>
> Joe
>
>>
>> Julie
>>
>> On 06/03/14 03:53, Joe Carlson wrote:
>>> Hi,
>>>
>>> out of curiosity, what is the typical size of your lucene keyword
>>> search index (for example, for flymine)? And how long does it take
>>> expand it from the blob when a mine gets deployer?
>>>
>>> For a relatively small subset of our data, I’m seeing 3 minutes to
>>> expand. (du of the keyword_search_index is ~ 2G) I’m shuddering when
>>> I think about how big it will get and how long it will take with all
>>> the data.
>>>
>>> Thanks,
>>>
>>> joe
>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>
>
>



More information about the dev mailing list