[InterMine Dev] keyword_search_index size
jwcarlson at lbl.gov
Tue Mar 11 18:31:22 GMT 2014
I'm still wrestling with the index issue. Getting searching to work on small mines are great. Our big mine is a problem. I've learned a few things that you may find useful in the future if you ever have to deal with large indices.
After trying a search with a large mine I typically hit an error page. This is typical in a case with a big database (~ 600Gb). On the file system where search_index directory tree is made, 'du' tells me the search_index is 44.8 Gb. When it's zipped and inserted into the blob, the index occupies 2,097,153 pages in pg_largeobject with a total size of 4.3 Gb.
I have read that java.util.zip may optionally support the ZIP64 extension that gets around the 4Gb size limit of original zip. The fact that I can store it without the code complaining makes me think this extension is in there. But I need to write some test programs to verify this.
When I try to extract the search_directory in the web app at run time, I'm getting a exception thrown almost immediately in loadIndexFromDatabase in the count=zis.read(data,0,bufferSize) loop:
java.util.zip.ZipException: invalid entry size (expected 2444315913 but got 20 bytes)
I did not give up. I had kept a copy of the search directory when it was created and I manually transferred it to the web server. Then, using the keyword_signature hack, was able to deploy the web app without trying to restore the search directory from the db. The first search took a while, but after that it was all good.
I looked a bit more at what I was indexing (using LUKE. a handy tool), and saw that I had been indexing too many large and unneeded tables. (I'm a lucene newbie here.) So after adding some of these tables to exclude, I rebuilt the index. This got it down to 11G. The blob is now 1.2G in 594,577 pages.
I don't know if it's a good sign or bad sign, but I get an exception thrown after about 30 minutes:
[http-8089-1] INFO org.intermine.web.search.KeywordSearch - Extracting: _fw.cfs (-1 MB)
[http-8089-1] ERROR org.intermine.web.search.KeywordSearch - Could not load search index
java.util.zip.ZipException: invalid entry size (expected 0 but got 1092216169 bytes)
The exception was thrown at the last write of _fw.cfs as far as I can tell.
I'm going to again after making some more exclusions to the indexing. The 30 minutes it takes to restore the search index is way too long. And I need to implement the search_signature hack so that I do not have to re-unzip with every web app restart.
A couple things I was wondering are:
1) is my version of java.util.zip.ZipEntry just old? When I'm stepping through the code with eclipse it tells me I have v1.42 dated 1/2/2008. It bothers me that I see (-1 MB) in the log messages for the entry.getSize(). Do you see the correct segment size in your log files?
2) LUKE tells me that I'm indexing the id fields of all (non-excluded) tables. Is this necessary? the id field of InterMineObject itself is 1/4 of the index size. I sure would like to drop these.
Anyway, thanks for whatever comments you have,
Julie Sullivan wrote:
> Hi Joe
> 1. I'd be interested to see what the index is like for your full database. FlyMine's is 2 GB but that database is only 75 GB. I do have facets which increases the search index size a bit.
> Are there any types you can safely not index? Here's a list of FlyMine's:
> You can also ignore specific fields, but not sure if that would help matters any. The tracker and intermine object tables aren't indexed if that makes you feel better.
> 2. Alex just wrote a script to stress-test InterMine:
> The script runs queries, does enrichment, a keyword search etc. On FlyMine we saw about ~150 requests/second and the site didn't slow down.
> Another group ran a stress test using JMeter (I think) and had ~12,000 requests per minute without any slow down of the site. I'll write up the results of these soon to share with everyone. And get exact numbers.
> It would be interesting to compare performance with your large-ish mine.
> I think InterMine will be just fine (I would say that wouldn't I) but what does your hardware situation look like? Are you safe?
> 3. Our release script does touch these URLs as the final step:
> Other mines also check a single public list, which should trigger the list upgrade process.
> 4. Your keyword search idea is brilliant! If you do make that change maybe you can push it into the InterMine repo?
> I've made a ticket here:
> On 06/03/14 18:39, Joe Carlson wrote:
>> Julie Sullivan wrote:
>>> Hi Joe
>>> FlyMine's is the same - about 2G. I haven't timed how long it takes to
>>> load into memory but it does take a while. It is an issue, there may
>>> be a better way to do things.
>>> How much bigger is your database going to get?
>> Hi Julie,
>> Thanks for the info.
>> We're on the verge of going public and trying to tie up a few loose
>> ends, and the lucene indices was one of those.
>> Right now, the db size of the mine is ~ 600 G. The biggest tables are
>> tracker and intermineobject, then the usual suspects of sequencefeature
>> and bioentity. Some of our SNP tables are in the top 10. And we expect
>> these to get 10x or 20x bigger as time goes by. 'msa' is a table of
>> clustal multiple sequence alignments.
>> relation | total_size
>> public.tracker | 211 GB
>> public.intermineobject | 135 GB
>> public.snpdiversitysample | 45 GB
>> public.clob | 33 GB
>> public.sequencefeature | 21 GB
>> public.location | 15 GB
>> public.bioentity | 12 GB
>> public.geneflankingregion | 11 GB
>> public.exon | 7291 MB
>> public.snp | 7024 MB
>> public.snplocation | 5132 MB
>> public.consequencessnps | 3863 MB
>> public.proteinanalysisfeature | 3331 MB
>> public.msa | 3016 MB
>> public.intermine_sequence | 2481 MB
>> public.utr | 1694 MB
>> public.cds | 1578 MB
>> public.proteinproteinfamily | 1543 MB
>> I admit that I have not completed building an index for the entire
>> thing. I also have a mini-mine that I'm using for testing (16G). The
>> current size (with 'du') of the keyword_search_index is only 427M. It
>> takes ~ 30 seconds to expand on first use. I'm just getting nervous
>> about the extrapolation.
>> My thinking is that as part of our web starting procedures, I need to
>> run some queries that cause all of the initialize-this-on-first-run
>> things to get executed. (The startup in region searching is another slow
>> one.) For the keyword searching, I was thinking of doing a 'is this
>> really needed' check.
>> in KeywordSearch.saveIndexToDatabase:
>> LOG.info("Saving signature to database...");
>> LOG.info("Saving search index information to database...");
>> writeObjectToDB(os, MetadataManager.SEARCH_INDEX, index);
>> And when restoring, write the signature to the file system. On restart,
>> look and check the signature before restoring. You may quibble at
>> keeping a 20 byte string in the blob. But you get the idea.
>> We have a large number of chromosomes in our mine, 528K. Some of the
>> plant genomes are very fragmented. (I'm looking at you, switchgrass) And
>> hitting the Regions tab on the web page is painful the first time so I'd
>> like to run that behind the scenes. Are you aware of other things that
>> need to be run?
>>> On 06/03/14 03:53, Joe Carlson wrote:
>>>> out of curiosity, what is the typical size of your lucene keyword
>>>> search index (for example, for flymine)? And how long does it take
>>>> expand it from the blob when a mine gets deployer?
>>>> For a relatively small subset of our data, I’m seeing 3 minutes to
>>>> expand. (du of the keyword_search_index is ~ 2G) I’m shuddering when
>>>> I think about how big it will get and how long it will take with all
>>>> the data.
>>>> dev mailing list
>>>> dev at intermine.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the dev