[InterMine Dev] keyword_search_index size
jwcarlson at lbl.gov
Thu Mar 6 18:39:08 GMT 2014
Julie Sullivan wrote:
> Hi Joe
> FlyMine's is the same - about 2G. I haven't timed how long it takes to
> load into memory but it does take a while. It is an issue, there may
> be a better way to do things.
> How much bigger is your database going to get?
Thanks for the info.
We're on the verge of going public and trying to tie up a few loose
ends, and the lucene indices was one of those.
Right now, the db size of the mine is ~ 600 G. The biggest tables are
tracker and intermineobject, then the usual suspects of sequencefeature
and bioentity. Some of our SNP tables are in the top 10. And we expect
these to get 10x or 20x bigger as time goes by. 'msa' is a table of
clustal multiple sequence alignments.
relation | total_size
public.tracker | 211 GB
public.intermineobject | 135 GB
public.snpdiversitysample | 45 GB
public.clob | 33 GB
public.sequencefeature | 21 GB
public.location | 15 GB
public.bioentity | 12 GB
public.geneflankingregion | 11 GB
public.exon | 7291 MB
public.snp | 7024 MB
public.snplocation | 5132 MB
public.consequencessnps | 3863 MB
public.proteinanalysisfeature | 3331 MB
public.msa | 3016 MB
public.intermine_sequence | 2481 MB
public.utr | 1694 MB
public.cds | 1578 MB
public.proteinproteinfamily | 1543 MB
I admit that I have not completed building an index for the entire
thing. I also have a mini-mine that I'm using for testing (16G). The
current size (with 'du') of the keyword_search_index is only 427M. It
takes ~ 30 seconds to expand on first use. I'm just getting nervous
about the extrapolation.
My thinking is that as part of our web starting procedures, I need to
run some queries that cause all of the initialize-this-on-first-run
things to get executed. (The startup in region searching is another slow
one.) For the keyword searching, I was thinking of doing a 'is this
really needed' check.
LOG.info("Saving signature to database...");
LOG.info("Saving search index information to database...");
writeObjectToDB(os, MetadataManager.SEARCH_INDEX, index);
And when restoring, write the signature to the file system. On restart,
look and check the signature before restoring. You may quibble at
keeping a 20 byte string in the blob. But you get the idea.
We have a large number of chromosomes in our mine, 528K. Some of the
plant genomes are very fragmented. (I'm looking at you, switchgrass) And
hitting the Regions tab on the web page is painful the first time so I'd
like to run that behind the scenes. Are you aware of other things that
need to be run?
> On 06/03/14 03:53, Joe Carlson wrote:
>> out of curiosity, what is the typical size of your lucene keyword
>> search index (for example, for flymine)? And how long does it take
>> expand it from the blob when a mine gets deployer?
>> For a relatively small subset of our data, I’m seeing 3 minutes to
>> expand. (du of the keyword_search_index is ~ 2G) I’m shuddering when
>> I think about how big it will get and how long it will take with all
>> the data.
>> dev mailing list
>> dev at intermine.org
More information about the dev