[InterMine Dev] keyword_search_index size

Joe Carlson jwcarlson at lbl.gov
Tue Mar 11 18:31:22 GMT 2014

Hi Julie,

I'm still wrestling with the index issue. Getting searching to work on small mines are great. Our big mine is a problem. I've learned a few things that you may find useful in the future if you ever have to deal with large indices.

After trying a search with a large mine I typically hit an error page. This is typical in a case with a big database (~ 600Gb). On the file system where search_index directory tree is made, 'du' tells me the search_index is 44.8 Gb. When it's zipped and inserted into the blob, the index occupies 2,097,153 pages in pg_largeobject with a total size of 4.3 Gb.

I have read that java.util.zip may optionally support the ZIP64 extension that gets around the 4Gb size limit of original zip. The fact that I can store it without the code complaining makes me think this extension is in there. But I need to write some test programs to verify this.

When I try to extract the search_directory in the web app at run time, I'm getting a exception thrown almost immediately in loadIndexFromDatabase in the count=zis.read(data,0,bufferSize) loop:

java.util.zip.ZipException: invalid entry size (expected 2444315913 but got 20 bytes)

I did not give up. I had kept a copy of the search directory when it was created and I manually transferred it to the web server. Then, using the keyword_signature hack, was able to deploy the web app without trying to restore the search directory from the db. The first search took a while, but after that it was all good.

I looked a bit more at what I was indexing (using LUKE. a handy tool), and saw that I had been indexing too many large and unneeded tables. (I'm a lucene newbie here.) So after adding some of these tables to exclude, I rebuilt the index. This got it down to 11G. The blob is now 1.2G in 594,577 pages.

I don't know if it's a good sign or bad sign, but I get an exception thrown after about 30 minutes:

[http-8089-1] INFO org.intermine.web.search.KeywordSearch  - Extracting: _fw.cfs (-1 MB)
[http-8089-1] ERROR org.intermine.web.search.KeywordSearch  - Could not load search index
java.util.zip.ZipException: invalid entry size (expected 0 but got 1092216169 bytes)

The exception was thrown at the last write of _fw.cfs as far as I can tell.

I'm going to again after making some more exclusions to the indexing. The 30 minutes it takes to restore the search index is way too long. And I need to implement the search_signature hack so that I do not have to re-unzip with every web app restart.

A couple things I was wondering are:

1) is my version of java.util.zip.ZipEntry just old? When I'm stepping through the code with eclipse it tells me I have v1.42 dated 1/2/2008. It bothers me that I see (-1 MB) in the log messages for the entry.getSize(). Do you see the correct segment size in your log files? 

2) LUKE tells me that I'm indexing the id fields of all (non-excluded) tables. Is this necessary? the id field of InterMineObject itself is 1/4 of the index size. I sure would like to drop these.

Anyway, thanks for whatever comments you have,


Julie Sullivan wrote:
> Hi Joe 
> 1. I'd be interested to see what the index is like for your full database. FlyMine's is 2 GB but that database is only 75 GB. I do have facets which increases the search index size a bit. 
> Are there any types you can safely not index? Here's a list of FlyMine's: 
> https://github.com/intermine/intermine/blob/dev/flymine/dbmodel/resources/keyword_search.properties#L8 
> You can also ignore specific fields, but not sure if that would help matters any. The tracker and intermine object tables aren't indexed if that makes you feel better. 
> 2. Alex just wrote a script to stress-test InterMine: 
> https://github.com/alexkalderimis/death-by-snoo-snoo 
> The script runs queries, does enrichment, a keyword search etc. On FlyMine we saw about ~150 requests/second and the site didn't slow down. 
> Another group ran a stress test using JMeter (I think) and had ~12,000 requests per minute without any slow down of the site. I'll write up the results of these soon to share with everyone. And get exact numbers. 
> It would be interesting to compare performance with your large-ish mine. 
> I think InterMine will be just fine (I would say that wouldn't I) but what does your hardware situation look like? Are you safe? 
> 3. Our release script does touch these URLs as the final step: 
> http://www.flymine.org/ 
> http://www.flymine.org/query/genomicRegionSearch.do 
> http://www.flymine.org/query/keywordSearchResults.do?searchTerm=eve&searchSubmit=GO 
> Other mines also check a single public list, which should trigger the list upgrade process. 
> 4. Your keyword search idea is brilliant! If you do make that change maybe you can push it into the InterMine repo? 
> I've made a ticket here: 
> https://github.com/intermine/intermine/issues/562 
> On 06/03/14 18:39, Joe Carlson wrote: 
>> Julie Sullivan wrote: 
>>> Hi Joe 
>>> FlyMine's is the same - about 2G. I haven't timed how long it takes to 
>>> load into memory but it does take a while. It is an issue, there may 
>>> be a better way to do things. 
>>> How much bigger is your database going to get? 
>> Hi Julie, 
>> Thanks for the info. 
>> We're on the verge of going public and trying to tie up a few loose 
>> ends, and the lucene indices was one of those. 
>> Right now, the db size of the mine is ~ 600 G. The biggest tables are 
>> tracker and intermineobject, then the usual suspects of sequencefeature 
>> and bioentity. Some of our SNP tables are in the top 10. And we expect 
>> these to get 10x or 20x bigger as time goes by. 'msa' is a table of 
>> clustal multiple sequence alignments. 
>>           relation            | total_size 
>> -------------------------------+------------ 
>> public.tracker                | 211 GB 
>> public.intermineobject        | 135 GB 
>> public.snpdiversitysample     | 45 GB 
>> public.clob                   | 33 GB 
>> public.sequencefeature        | 21 GB 
>> public.location               | 15 GB 
>> public.bioentity              | 12 GB 
>> public.geneflankingregion     | 11 GB 
>> public.exon                   | 7291 MB 
>> public.snp                    | 7024 MB 
>> public.snplocation            | 5132 MB 
>> public.consequencessnps       | 3863 MB 
>> public.proteinanalysisfeature | 3331 MB 
>> public.msa                    | 3016 MB 
>> public.intermine_sequence     | 2481 MB 
>> public.utr                    | 1694 MB 
>> public.cds                    | 1578 MB 
>> public.proteinproteinfamily   | 1543 MB 
>> I admit that I have not completed building an index for the entire 
>> thing.  I also have a mini-mine that I'm using for testing (16G). The 
>> current size (with 'du') of the keyword_search_index is only 427M. It 
>> takes ~ 30 seconds to expand on first use. I'm just getting nervous 
>> about the extrapolation. 
>> My thinking is that as part of our web starting procedures, I need to 
>> run some queries that cause all of the initialize-this-on-first-run 
>> things to get executed. (The startup in region searching is another slow 
>> one.) For  the keyword searching, I was thinking of doing a 'is this 
>> really needed' check. 
>> in KeywordSearch.saveIndexToDatabase: 
>> <code-not-checked> 
>>             LOG.info("Saving signature to database..."); 
>> writeObjectToDB(os,MetadataManager.SEARCH_INDEX_SIGNATURE,TextUtil.generateRandomUniqueString()); 
>>             LOG.info("Saving search index information to database..."); 
>>             writeObjectToDB(os, MetadataManager.SEARCH_INDEX, index); 
>> </code-not-checked> 
>> And when restoring, write the signature to the file system.  On restart, 
>> look and check the signature before restoring. You may quibble at 
>> keeping a 20 byte string in the blob. But you get the idea. 
>> We have a large number of chromosomes in our mine, 528K. Some of the 
>> plant genomes are very fragmented. (I'm looking at you, switchgrass) And 
>> hitting the Regions tab on the web page is painful the first time so I'd 
>> like to run that behind the scenes. Are you aware of other things that 
>> need to be run? 
>> Thanks, 
>> Joe 
>>> Julie 
>>> On 06/03/14 03:53, Joe Carlson wrote: 
>>>> Hi, 
>>>> out of curiosity, what is the typical size of your lucene keyword 
>>>> search index (for example, for flymine)? And how long does it take 
>>>> expand it from the blob when a mine gets deployer? 
>>>> For a relatively small subset of our data, I’m seeing 3 minutes to 
>>>> expand. (du of the keyword_search_index is ~ 2G) I’m shuddering when 
>>>> I think about how big it will get and how long it will take with all 
>>>> the data. 
>>>> Thanks, 
>>>> joe 
>>>> _______________________________________________ 
>>>> dev mailing list 
>>>> dev at intermine.org 
>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20140311/5fa29b13/attachment.html>

More information about the dev mailing list