[InterMine Dev] speeding up data loading

Julie Sullivan julie at flymine.org
Tue Aug 25 11:17:08 BST 2009


> Does this seem right to you?  Anything in particular I can do to speed 
> it up?  It takes 33 minutes to load all 3 files.

Sierra,

How many items does your converter store?  (I think) InterMine averages 100,000 
objects/minute.  This varies by source though.

There are a few things you can do to speed things up.

1. Don't keep items in memory.

We need to keep maps in our converters so when we come across an anatomy term, 
for example, we can look in the map and make sure that we haven't created this 
term before.

However, we don't need to keep the entire item, we only need its identifier. 
You can use the getIdentifier() method on the item to get that, and it'll return 
the item's unique id, eg 2_1.  So your maps would look like this instead:

	anatomies.put(primaryIdentifier, item.getIdentifier());

This will save on memory and speed up your converter.

2. Check store order.

I think you are okay on this, but sometimes changing the store order can speed 
things up:

	http://www.intermine.org/wiki/DataLoadingPerformance

3. Connect directly to an Informix database.

I think processing text files is just fine but I think it's a good idea that you 
know this option is available.  For SGD's converter, we don't process text files 
but instead connect directly to an Oracle database.

The converter for SGD loads 100,000 objects in about two minutes:

http://intrac.flymine.org/browser/trunk/bio/sources/sgd/main/src/org/intermine/bio/dataconversion/SgdConverter.java 


It may look a little bit scary at first, but it's a fairly simple converter. 
Instead of processing text files, I query SGD's Oracle database - see the 
getGeneResults() method.  Then I process the results just like you process the 
lines of the text file in your converter.  For each line of results, I create a 
Gene object and set the attributes/references - see the processGenes() method.

To connect to SGD's database, I added Oracle's JDBC driver and updated 
yeastmine.properties with the name and location of the Oracle database.

Something to keep in mind!








More information about the dev mailing list