[InterMine Dev] speeding up data loading

Sierra Moxon staylor at cs.uoregon.edu
Tue Aug 25 16:17:40 BST 2009


Thanks Julie!

I don't see why a direct db-connection to informix wouldn't work...we 
connect via jdbc to informix in the rest of our java code.

Do you think you could put the link you sent somewhere on the wiki? Raven 
won't let me authenticate, I don't think.

Thanks,
Sierra

On Tue, 25 Aug 2009, Julie Sullivan wrote:

>> Does this seem right to you?  Anything in particular I can do to speed it 
>> up?  It takes 33 minutes to load all 3 files.
>
> Sierra,
>
> How many items does your converter store?  (I think) InterMine averages 
> 100,000 objects/minute.  This varies by source though.
>
> There are a few things you can do to speed things up.
>
> 1. Don't keep items in memory.
>
> We need to keep maps in our converters so when we come across an anatomy 
> term, for example, we can look in the map and make sure that we haven't 
> created this term before.
>
> However, we don't need to keep the entire item, we only need its identifier. 
> You can use the getIdentifier() method on the item to get that, and it'll 
> return the item's unique id, eg 2_1.  So your maps would look like this 
> instead:
>
> 	anatomies.put(primaryIdentifier, item.getIdentifier());
>
> This will save on memory and speed up your converter.
>
> 2. Check store order.
>
> I think you are okay on this, but sometimes changing the store order can 
> speed things up:
>
> 	http://www.intermine.org/wiki/DataLoadingPerformance
>
> 3. Connect directly to an Informix database.
>
> I think processing text files is just fine but I think it's a good idea that 
> you know this option is available.  For SGD's converter, we don't process 
> text files but instead connect directly to an Oracle database.
>
> The converter for SGD loads 100,000 objects in about two minutes:
>
> http://intrac.flymine.org/browser/trunk/bio/sources/sgd/main/src/org/intermine/bio/dataconversion/SgdConverter.java 
>
> It may look a little bit scary at first, but it's a fairly simple converter. 
> Instead of processing text files, I query SGD's Oracle database - see the 
> getGeneResults() method.  Then I process the results just like you process 
> the lines of the text file in your converter.  For each line of results, I 
> create a Gene object and set the attributes/references - see the 
> processGenes() method.
>
> To connect to SGD's database, I added Oracle's JDBC driver and updated 
> yeastmine.properties with the name and location of the Oracle database.
>
> Something to keep in mind!
>
>
>
>
>



More information about the dev mailing list