[InterMine Dev] keyword_search_index size

Julie Sullivan julie at flymine.org
Wed Mar 12 13:03:30 GMT 2014

Hi Joe!

Wow, really great work! and thank you for sharing what you've learned.

IDs should definitely not be indexed and can be safely ignored. That 
should be true for all classes though, so I've made a ticket:


I'm not sure about the zip library, it's been a while since I've looked 
at it. Let me know what you find out!


On 11/03/14 18:31, Joe Carlson wrote:
> Hi Julie,
> I'm still wrestling with the index issue. Getting searching to work on
> small mines are great. Our big mine is a problem. I've learned a few
> things that you may find useful in the future if you ever have to deal
> with large indices.
> After trying a search with a large mine I typically hit an error page.
> This is typical in a case with a big database (~ 600Gb). On the file
> system where search_index directory tree is made, 'du' tells me the
> search_index is 44.8 Gb. When it's zipped and inserted into the blob,
> the index occupies 2,097,153 pages in pg_largeobject with a total size
> of 4.3 Gb.
> I have read that java.util.zip may optionally support the ZIP64
> extension that gets around the 4Gb size limit of original zip. The fact
> that I can store it without the code complaining makes me think this
> extension is in there. But I need to write some test programs to verify
> this.
> When I try to extract the search_directory in the web app at run time,
> I'm getting a exception thrown almost immediately in
> loadIndexFromDatabase in the count=zis.read(data,0,bufferSize) loop:
> java.util.zip.ZipException: invalid entry size (expected 2444315913 but
> got 20 bytes)
> I did not give up. I had kept a copy of the search directory when it was
> created and I manually transferred it to the web server. Then, using the
> keyword_signature hack, was able to deploy the web app without trying to
> restore the search directory from the db. The first search took a while,
> but after that it was all good.
> I looked a bit more at what I was indexing (using LUKE. a handy tool),
> and saw that I had been indexing too many large and unneeded tables.
> (I'm a lucene newbie here.) So after adding some of these tables to
> exclude, I rebuilt the index. This got it down to 11G. The blob is now
> 1.2G in 594,577 pages.
> I don't know if it's a good sign or bad sign, but I get an exception
> thrown after about 30 minutes:
> [http-8089-1] INFO org.intermine.web.search.KeywordSearch  - Extracting:
> _fw.cfs (-1 MB)
> [http-8089-1] ERROR org.intermine.web.search.KeywordSearch  - Could not
> load search index
> java.util.zip.ZipException: invalid entry size (expected 0 but got
> 1092216169 bytes)
> The exception was thrown at the last write of _fw.cfs as far as I can tell.
> I'm going to again after making some more exclusions to the indexing.
> The 30 minutes it takes to restore the search index is way too long. And
> I need to implement the search_signature hack so that I do not have to
> re-unzip with every web app restart.
> A couple things I was wondering are:
> 1) is my version of java.util.zip.ZipEntry just old? When I'm stepping
> through the code with eclipse it tells me I have v1.42 dated 1/2/2008.
> It bothers me that I see (-1 MB) in the log messages for the
> entry.getSize(). Do you see the correct segment size in your log files?
> 2) LUKE tells me that I'm indexing the id fields of all (non-excluded)
> tables. Is this necessary? the id field of InterMineObject itself is 1/4
> of the index size. I sure would like to drop these.
> Anyway, thanks for whatever comments you have,
> Joe
> Julie Sullivan wrote:
>> Hi Joe
>> 1. I'd be interested to see what the index is like for your full
>> database. FlyMine's is 2 GB but that database is only 75 GB. I do have
>> facets which increases the search index size a bit.
>> Are there any types you can safely not index? Here's a list of FlyMine's:
>> https://github.com/intermine/intermine/blob/dev/flymine/dbmodel/resources/keyword_search.properties#L8
>> You can also ignore specific fields, but not sure if that would help
>> matters any. The tracker and intermine object tables aren't indexed if
>> that makes you feel better.
>> 2. Alex just wrote a script to stress-test InterMine:
>> https://github.com/alexkalderimis/death-by-snoo-snoo
>> The script runs queries, does enrichment, a keyword search etc. On
>> FlyMine we saw about ~150 requests/second and the site didn't slow down.
>> Another group ran a stress test using JMeter (I think) and had ~12,000
>> requests per minute without any slow down of the site. I'll write up
>> the results of these soon to share with everyone. And get exact numbers.
>> It would be interesting to compare performance with your large-ish mine.
>> I think InterMine will be just fine (I would say that wouldn't I) but
>> what does your hardware situation look like? Are you safe?
>> 3. Our release script does touch these URLs as the final step:
>> http://www.flymine.org/
>> http://www.flymine.org/query/genomicRegionSearch.do
>> http://www.flymine.org/query/keywordSearchResults.do?searchTerm=eve&searchSubmit=GO
>> Other mines also check a single public list, which should trigger the
>> list upgrade process.
>> 4. Your keyword search idea is brilliant! If you do make that change
>> maybe you can push it into the InterMine repo?
>> I've made a ticket here:
>> https://github.com/intermine/intermine/issues/562
>> On 06/03/14 18:39, Joe Carlson wrote:
>>> Julie Sullivan wrote:
>>>> Hi Joe
>>>> FlyMine's is the same - about 2G. I haven't timed how long it takes to
>>>> load into memory but it does take a while. It is an issue, there may
>>>> be a better way to do things.
>>>> How much bigger is your database going to get?
>>> Hi Julie,
>>> Thanks for the info.
>>> We're on the verge of going public and trying to tie up a few loose
>>> ends, and the lucene indices was one of those.
>>> Right now, the db size of the mine is ~ 600 G. The biggest tables are
>>> tracker and intermineobject, then the usual suspects of sequencefeature
>>> and bioentity. Some of our SNP tables are in the top 10. And we expect
>>> these to get 10x or 20x bigger as time goes by. 'msa' is a table of
>>> clustal multiple sequence alignments.
>>>           relation            | total_size
>>> -------------------------------+------------
>>> public.tracker                | 211 GB
>>> public.intermineobject        | 135 GB
>>> public.snpdiversitysample     | 45 GB
>>> public.clob                   | 33 GB
>>> public.sequencefeature        | 21 GB
>>> public.location               | 15 GB
>>> public.bioentity              | 12 GB
>>> public.geneflankingregion     | 11 GB
>>> public.exon                   | 7291 MB
>>> public.snp                    | 7024 MB
>>> public.snplocation            | 5132 MB
>>> public.consequencessnps       | 3863 MB
>>> public.proteinanalysisfeature | 3331 MB
>>> public.msa                    | 3016 MB
>>> public.intermine_sequence     | 2481 MB
>>> public.utr                    | 1694 MB
>>> public.cds                    | 1578 MB
>>> public.proteinproteinfamily   | 1543 MB
>>> I admit that I have not completed building an index for the entire
>>> thing.  I also have a mini-mine that I'm using for testing (16G). The
>>> current size (with 'du') of the keyword_search_index is only 427M. It
>>> takes ~ 30 seconds to expand on first use. I'm just getting nervous
>>> about the extrapolation.
>>> My thinking is that as part of our web starting procedures, I need to
>>> run some queries that cause all of the initialize-this-on-first-run
>>> things to get executed. (The startup in region searching is another slow
>>> one.) For  the keyword searching, I was thinking of doing a 'is this
>>> really needed' check.
>>> in KeywordSearch.saveIndexToDatabase:
>>> <code-not-checked>
>>> LOG.info <http://LOG.info>("Saving signature to database...");
>>> writeObjectToDB(os,MetadataManager.SEARCH_INDEX_SIGNATURE,TextUtil.generateRandomUniqueString());
>>> LOG.info <http://LOG.info>("Saving search index information to
>>> database...");
>>>             writeObjectToDB(os, MetadataManager.SEARCH_INDEX, index);
>>> </code-not-checked>
>>> And when restoring, write the signature to the file system.  On restart,
>>> look and check the signature before restoring. You may quibble at
>>> keeping a 20 byte string in the blob. But you get the idea.
>>> We have a large number of chromosomes in our mine, 528K. Some of the
>>> plant genomes are very fragmented. (I'm looking at you, switchgrass) And
>>> hitting the Regions tab on the web page is painful the first time so I'd
>>> like to run that behind the scenes. Are you aware of other things that
>>> need to be run?
>>> Thanks,
>>> Joe
>>>> Julie
>>>> On 06/03/14 03:53, Joe Carlson wrote:
>>>>> Hi,
>>>>> out of curiosity, what is the typical size of your lucene keyword
>>>>> search index (for example, for flymine)? And how long does it take
>>>>> expand it from the blob when a mine gets deployer?
>>>>> For a relatively small subset of our data, I’m seeing 3 minutes to
>>>>> expand. (du of the keyword_search_index is ~ 2G) I’m shuddering when
>>>>> I think about how big it will get and how long it will take with all
>>>>> the data.
>>>>> Thanks,
>>>>> joe
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev at intermine.org
>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

More information about the dev mailing list