[InterMine Dev] speeding up data loading

Julie Sullivan julie at flymine.org
Tue Aug 25 16:28:16 BST 2009


Whoops, sorry!  Here it is:

http://intermine.org/browser/trunk/bio/sources/sgd/main/src/org/intermine/bio/dataconversion/SgdConverter.java

This should be in your checkout, also, when you upgrade to the InterMine 0.91 
branch.


Sierra Moxon wrote:
> Thanks Julie!
> 
> I don't see why a direct db-connection to informix wouldn't work...we 
> connect via jdbc to informix in the rest of our java code.
> 
> Do you think you could put the link you sent somewhere on the wiki? 
> Raven won't let me authenticate, I don't think.
> 
> Thanks,
> Sierra
> 
> On Tue, 25 Aug 2009, Julie Sullivan wrote:
> 
>>> Does this seem right to you?  Anything in particular I can do to 
>>> speed it up?  It takes 33 minutes to load all 3 files.
>>
>> Sierra,
>>
>> How many items does your converter store?  (I think) InterMine 
>> averages 100,000 objects/minute.  This varies by source though.
>>
>> There are a few things you can do to speed things up.
>>
>> 1. Don't keep items in memory.
>>
>> We need to keep maps in our converters so when we come across an 
>> anatomy term, for example, we can look in the map and make sure that 
>> we haven't created this term before.
>>
>> However, we don't need to keep the entire item, we only need its 
>> identifier. You can use the getIdentifier() method on the item to get 
>> that, and it'll return the item's unique id, eg 2_1.  So your maps 
>> would look like this instead:
>>
>>     anatomies.put(primaryIdentifier, item.getIdentifier());
>>
>> This will save on memory and speed up your converter.
>>
>> 2. Check store order.
>>
>> I think you are okay on this, but sometimes changing the store order 
>> can speed things up:
>>
>>     http://www.intermine.org/wiki/DataLoadingPerformance
>>
>> 3. Connect directly to an Informix database.
>>
>> I think processing text files is just fine but I think it's a good 
>> idea that you know this option is available.  For SGD's converter, we 
>> don't process text files but instead connect directly to an Oracle 
>> database.
>>
>> The converter for SGD loads 100,000 objects in about two minutes:
>>
>> http://intrac.flymine.org/browser/trunk/bio/sources/sgd/main/src/org/intermine/bio/dataconversion/SgdConverter.java 
>>
>> It may look a little bit scary at first, but it's a fairly simple 
>> converter. Instead of processing text files, I query SGD's Oracle 
>> database - see the getGeneResults() method.  Then I process the 
>> results just like you process the lines of the text file in your 
>> converter.  For each line of results, I create a Gene object and set 
>> the attributes/references - see the processGenes() method.
>>
>> To connect to SGD's database, I added Oracle's JDBC driver and updated 
>> yeastmine.properties with the name and location of the Oracle database.
>>
>> Something to keep in mind!
>>
>>
>>
>>
>>
> 



More information about the dev mailing list