[InterMine Dev] keyword_search_index size

Joe Carlson jwcarlson at lbl.gov
Tue Mar 11 22:25:27 GMT 2014


Hi again Julie,

As a final note, a further trimming of the indexed fields gives me a index directory size of 66M in the database and 3.5G on disk. The search directory expands properly at run time. It takes a little over 7 minutes to expand it, so still going to use the signature hack, and I see that I can cut out some more indexed fields.

Thanks,

Joe

 
On Mar 11, 2014, at 11:31 AM, Joe Carlson wrote:

> Hi Julie,
> 
> I'm still wrestling with the index issue. Getting searching to work on small mines are great. Our big mine is a problem. I've learned a few things that you may find useful in the future if you ever have to deal with large indices.
> 
> After trying a search with a large mine I typically hit an error page. This is typical in a case with a big database (~ 600Gb). On the file system where search_index directory tree is made, 'du' tells me the search_index is 44.8 Gb. When it's zipped and inserted into the blob, the index occupies 2,097,153 pages in pg_largeobject with a total size of 4.3 Gb.
> 
> I have read that java.util.zip may optionally support the ZIP64 extension that gets around the 4Gb size limit of original zip. The fact that I can store it without the code complaining makes me think this extension is in there. But I need to write some test programs to verify this.
> 
> When I try to extract the search_directory in the web app at run time, I'm getting a exception thrown almost immediately in loadIndexFromDatabase in the count=zis.read(data,0,bufferSize) loop:
> 
> java.util.zip.ZipException: invalid entry size (expected 2444315913 but got 20 bytes)
> 
> I did not give up. I had kept a copy of the search directory when it was created and I manually transferred it to the web server. Then, using the keyword_signature hack, was able to deploy the web app without trying to restore the search directory from the db. The first search took a while, but after that it was all good.
> 
> I looked a bit more at what I was indexing (using LUKE. a handy tool), and saw that I had been indexing too many large and unneeded tables. (I'm a lucene newbie here.) So after adding some of these tables to exclude, I rebuilt the index. This got it down to 11G. The blob is now 1.2G in 594,577 pages.
> 
> I don't know if it's a good sign or bad sign, but I get an exception thrown after about 30 minutes:
> 
> [http-8089-1] INFO org.intermine.web.search.KeywordSearch  - Extracting: _fw.cfs (-1 MB)
> [http-8089-1] ERROR org.intermine.web.search.KeywordSearch  - Could not load search index
> java.util.zip.ZipException: invalid entry size (expected 0 but got 1092216169 bytes)
> 
> The exception was thrown at the last write of _fw.cfs as far as I can tell.
> 
> I'm going to again after making some more exclusions to the indexing. The 30 minutes it takes to restore the search index is way too long. And I need to implement the search_signature hack so that I do not have to re-unzip with every web app restart.
> 
> A couple things I was wondering are:
> 
> 1) is my version of java.util.zip.ZipEntry just old? When I'm stepping through the code with eclipse it tells me I have v1.42 dated 1/2/2008. It bothers me that I see (-1 MB) in the log messages for the entry.getSize(). Do you see the correct segment size in your log files? 
> 
> 2) LUKE tells me that I'm indexing the id fields of all (non-excluded) tables. Is this necessary? the id field of InterMineObject itself is 1/4 of the index size. I sure would like to drop these.
> 
> Anyway, thanks for whatever comments you have,
> 
> Joe
> 
> Julie Sullivan wrote:
>> 
>> Hi Joe 
>> 
>> 1. I'd be interested to see what the index is like for your full database. FlyMine's is 2 GB but that database is only 75 GB. I do have facets which increases the search index size a bit. 
>> 
>> Are there any types you can safely not index? Here's a list of FlyMine's: 
>> 
>> https://github.com/intermine/intermine/blob/dev/flymine/dbmodel/resources/keyword_search.properties#L8 
>> 
>> You can also ignore specific fields, but not sure if that would help matters any. The tracker and intermine object tables aren't indexed if that makes you feel better. 
>> 
>> 2. Alex just wrote a script to stress-test InterMine: 
>> 
>> https://github.com/alexkalderimis/death-by-snoo-snoo 
>> 
>> The script runs queries, does enrichment, a keyword search etc. On FlyMine we saw about ~150 requests/second and the site didn't slow down. 
>> 
>> Another group ran a stress test using JMeter (I think) and had ~12,000 requests per minute without any slow down of the site. I'll write up the results of these soon to share with everyone. And get exact numbers. 
>> 
>> It would be interesting to compare performance with your large-ish mine. 
>> 
>> I think InterMine will be just fine (I would say that wouldn't I) but what does your hardware situation look like? Are you safe? 
>> 
>> 3. Our release script does touch these URLs as the final step: 
>> 
>> http://www.flymine.org/ 
>> http://www.flymine.org/query/genomicRegionSearch.do 
>> http://www.flymine.org/query/keywordSearchResults.do?searchTerm=eve&searchSubmit=GO 
>> 
>> Other mines also check a single public list, which should trigger the list upgrade process. 
>> 
>> 4. Your keyword search idea is brilliant! If you do make that change maybe you can push it into the InterMine repo? 
>> 
>> I've made a ticket here: 
>> https://github.com/intermine/intermine/issues/562 
>> 
>> On 06/03/14 18:39, Joe Carlson wrote: 
>>> Julie Sullivan wrote: 
>>>> Hi Joe 
>>>> 
>>>> FlyMine's is the same - about 2G. I haven't timed how long it takes to 
>>>> load into memory but it does take a while. It is an issue, there may 
>>>> be a better way to do things. 
>>>> 
>>>> How much bigger is your database going to get? 
>>> 
>>> Hi Julie, 
>>> 
>>> Thanks for the info. 
>>> 
>>> We're on the verge of going public and trying to tie up a few loose 
>>> ends, and the lucene indices was one of those. 
>>> 
>>> Right now, the db size of the mine is ~ 600 G. The biggest tables are 
>>> tracker and intermineobject, then the usual suspects of sequencefeature 
>>> and bioentity. Some of our SNP tables are in the top 10. And we expect 
>>> these to get 10x or 20x bigger as time goes by. 'msa' is a table of 
>>> clustal multiple sequence alignments. 
>>> 
>>>           relation            | total_size 
>>> -------------------------------+------------ 
>>> public.tracker                | 211 GB 
>>> public.intermineobject        | 135 GB 
>>> public.snpdiversitysample     | 45 GB 
>>> public.clob                   | 33 GB 
>>> public.sequencefeature        | 21 GB 
>>> public.location               | 15 GB 
>>> public.bioentity              | 12 GB 
>>> public.geneflankingregion     | 11 GB 
>>> public.exon                   | 7291 MB 
>>> public.snp                    | 7024 MB 
>>> public.snplocation            | 5132 MB 
>>> public.consequencessnps       | 3863 MB 
>>> public.proteinanalysisfeature | 3331 MB 
>>> public.msa                    | 3016 MB 
>>> public.intermine_sequence     | 2481 MB 
>>> public.utr                    | 1694 MB 
>>> public.cds                    | 1578 MB 
>>> public.proteinproteinfamily   | 1543 MB 
>>> 
>>> I admit that I have not completed building an index for the entire 
>>> thing.  I also have a mini-mine that I'm using for testing (16G). The 
>>> current size (with 'du') of the keyword_search_index is only 427M. It 
>>> takes ~ 30 seconds to expand on first use. I'm just getting nervous 
>>> about the extrapolation. 
>>> 
>>> My thinking is that as part of our web starting procedures, I need to 
>>> run some queries that cause all of the initialize-this-on-first-run 
>>> things to get executed. (The startup in region searching is another slow 
>>> one.) For  the keyword searching, I was thinking of doing a 'is this 
>>> really needed' check. 
>>> 
>>> in KeywordSearch.saveIndexToDatabase: 
>>> <code-not-checked> 
>>>             LOG.info("Saving signature to database..."); 
>>> writeObjectToDB(os,MetadataManager.SEARCH_INDEX_SIGNATURE,TextUtil.generateRandomUniqueString()); 
>>> 
>>>             LOG.info("Saving search index information to database..."); 
>>>             writeObjectToDB(os, MetadataManager.SEARCH_INDEX, index); 
>>> </code-not-checked> 
>>> 
>>> And when restoring, write the signature to the file system.  On restart, 
>>> look and check the signature before restoring. You may quibble at 
>>> keeping a 20 byte string in the blob. But you get the idea. 
>>> 
>>> We have a large number of chromosomes in our mine, 528K. Some of the 
>>> plant genomes are very fragmented. (I'm looking at you, switchgrass) And 
>>> hitting the Regions tab on the web page is painful the first time so I'd 
>>> like to run that behind the scenes. Are you aware of other things that 
>>> need to be run? 
>>> 
>>> Thanks, 
>>> 
>>> Joe 
>>> 
>>>> 
>>>> Julie 
>>>> 
>>>> On 06/03/14 03:53, Joe Carlson wrote: 
>>>>> Hi, 
>>>>> 
>>>>> out of curiosity, what is the typical size of your lucene keyword 
>>>>> search index (for example, for flymine)? And how long does it take 
>>>>> expand it from the blob when a mine gets deployer? 
>>>>> 
>>>>> For a relatively small subset of our data, I’m seeing 3 minutes to 
>>>>> expand. (du of the keyword_search_index is ~ 2G) I’m shuddering when 
>>>>> I think about how big it will get and how long it will take with all 
>>>>> the data. 
>>>>> 
>>>>> Thanks, 
>>>>> 
>>>>> joe 
>>>>> 
>>>>> 
>>>>> _______________________________________________ 
>>>>> dev mailing list 
>>>>> dev at intermine.org 
>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev 
>>>>> 
>>> 
>>> 
> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20140311/2c049002/attachment-0001.html>


More information about the dev mailing list