[InterMine Dev] speeding up data loading

Sierra Moxon staylor at cs.uoregon.edu
Tue Aug 25 23:19:48 BST 2009


Also, I moved my instance from my mac book to my more powerful mac pro, 
and the time went down from 33 minutes to under 2 minutes...hardware matters as usual! :)

Sierra

On Tue, 25 Aug 2009, Julie Sullivan wrote:

> Whoops, sorry!  Here it is:
>
> http://intermine.org/browser/trunk/bio/sources/sgd/main/src/org/intermine/bio/dataconversion/SgdConverter.java
>
> This should be in your checkout, also, when you upgrade to the InterMine 0.91 
> branch.
>
>
> Sierra Moxon wrote:
>> Thanks Julie!
>> 
>> I don't see why a direct db-connection to informix wouldn't work...we 
>> connect via jdbc to informix in the rest of our java code.
>> 
>> Do you think you could put the link you sent somewhere on the wiki? Raven 
>> won't let me authenticate, I don't think.
>> 
>> Thanks,
>> Sierra
>> 
>> On Tue, 25 Aug 2009, Julie Sullivan wrote:
>> 
>>>> Does this seem right to you?  Anything in particular I can do to speed it 
>>>> up?  It takes 33 minutes to load all 3 files.
>>> 
>>> Sierra,
>>> 
>>> How many items does your converter store?  (I think) InterMine averages 
>>> 100,000 objects/minute.  This varies by source though.
>>> 
>>> There are a few things you can do to speed things up.
>>> 
>>> 1. Don't keep items in memory.
>>> 
>>> We need to keep maps in our converters so when we come across an anatomy 
>>> term, for example, we can look in the map and make sure that we haven't 
>>> created this term before.
>>> 
>>> However, we don't need to keep the entire item, we only need its 
>>> identifier. You can use the getIdentifier() method on the item to get 
>>> that, and it'll return the item's unique id, eg 2_1.  So your maps would 
>>> look like this instead:
>>>
>>>     anatomies.put(primaryIdentifier, item.getIdentifier());
>>> 
>>> This will save on memory and speed up your converter.
>>> 
>>> 2. Check store order.
>>> 
>>> I think you are okay on this, but sometimes changing the store order can 
>>> speed things up:
>>>
>>>     http://www.intermine.org/wiki/DataLoadingPerformance
>>> 
>>> 3. Connect directly to an Informix database.
>>> 
>>> I think processing text files is just fine but I think it's a good idea 
>>> that you know this option is available.  For SGD's converter, we don't 
>>> process text files but instead connect directly to an Oracle database.
>>> 
>>> The converter for SGD loads 100,000 objects in about two minutes:
>>> 
>>> http://intrac.flymine.org/browser/trunk/bio/sources/sgd/main/src/org/intermine/bio/dataconversion/SgdConverter.java 
>>> It may look a little bit scary at first, but it's a fairly simple 
>>> converter. Instead of processing text files, I query SGD's Oracle database 
>>> - see the getGeneResults() method.  Then I process the results just like 
>>> you process the lines of the text file in your converter.  For each line 
>>> of results, I create a Gene object and set the attributes/references - see 
>>> the processGenes() method.
>>> 
>>> To connect to SGD's database, I added Oracle's JDBC driver and updated 
>>> yeastmine.properties with the name and location of the Oracle database.
>>> 
>>> Something to keep in mind!
>>> 
>>> 
>>> 
>>> 
>>> 
>> 
>



More information about the dev mailing list