[InterMine Dev] keyword_search_index size

Joe Carlson jwcarlson at lbl.gov
Thu Mar 6 18:39:08 GMT 2014


Julie Sullivan wrote:
> Hi Joe
>
> FlyMine's is the same - about 2G. I haven't timed how long it takes to 
> load into memory but it does take a while. It is an issue, there may 
> be a better way to do things.
>
> How much bigger is your database going to get?

Hi Julie,

Thanks for the info.

We're on the verge of going public and trying to tie up a few loose 
ends, and the lucene indices was one of those.

Right now, the db size of the mine is ~ 600 G. The biggest tables are 
tracker and intermineobject, then the usual suspects of sequencefeature 
and bioentity. Some of our SNP tables are in the top 10. And we expect 
these to get 10x or 20x bigger as time goes by. 'msa' is a table of 
clustal multiple sequence alignments.

          relation            | total_size
-------------------------------+------------
 public.tracker                | 211 GB
 public.intermineobject        | 135 GB
 public.snpdiversitysample     | 45 GB
 public.clob                   | 33 GB
 public.sequencefeature        | 21 GB
 public.location               | 15 GB
 public.bioentity              | 12 GB
 public.geneflankingregion     | 11 GB
 public.exon                   | 7291 MB
 public.snp                    | 7024 MB
 public.snplocation            | 5132 MB
 public.consequencessnps       | 3863 MB
 public.proteinanalysisfeature | 3331 MB
 public.msa                    | 3016 MB
 public.intermine_sequence     | 2481 MB
 public.utr                    | 1694 MB
 public.cds                    | 1578 MB
 public.proteinproteinfamily   | 1543 MB

I admit that I have not completed building an index for the entire 
thing.  I also have a mini-mine that I'm using for testing (16G). The 
current size (with 'du') of the keyword_search_index is only 427M. It 
takes ~ 30 seconds to expand on first use. I'm just getting nervous 
about the extrapolation.

My thinking is that as part of our web starting procedures, I need to 
run some queries that cause all of the initialize-this-on-first-run 
things to get executed. (The startup in region searching is another slow 
one.) For  the keyword searching, I was thinking of doing a 'is this 
really needed' check.

in KeywordSearch.saveIndexToDatabase:
<code-not-checked>
            LOG.info("Saving signature to database...");
            
writeObjectToDB(os,MetadataManager.SEARCH_INDEX_SIGNATURE,TextUtil.generateRandomUniqueString());
            LOG.info("Saving search index information to database...");
            writeObjectToDB(os, MetadataManager.SEARCH_INDEX, index);
</code-not-checked>

And when restoring, write the signature to the file system.  On restart, 
look and check the signature before restoring. You may quibble at 
keeping a 20 byte string in the blob. But you get the idea.

We have a large number of chromosomes in our mine, 528K. Some of the 
plant genomes are very fragmented. (I'm looking at you, switchgrass) And 
hitting the Regions tab on the web page is painful the first time so I'd 
like to run that behind the scenes. Are you aware of other things that 
need to be run?

Thanks,

Joe

>
> Julie
>
> On 06/03/14 03:53, Joe Carlson wrote:
>> Hi,
>>
>> out of curiosity, what is the typical size of your lucene keyword 
>> search index (for example, for flymine)? And how long does it take 
>> expand it from the blob when a mine gets deployer?
>>
>> For a relatively small subset of our data, I’m seeing 3 minutes to 
>> expand. (du of the keyword_search_index is ~ 2G) I’m shuddering when 
>> I think about how big it will get and how long it will take with all 
>> the data.
>>
>> Thanks,
>>
>> joe
>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>




More information about the dev mailing list