[InterMine Dev] build performance question

Joel Richardson jer at informatics.jax.org
Tue Sep 6 18:46:01 BST 2011


In your email, you talk about collections specifically,
while the wiki page talks about references (in general?).
Anyway, my question is: when objects refer to each other,
who should be stored first? For example, an OntologyTerm has
a collection of OntologyRelations, and each OntologyRelation
references two OntologyTerms. Which should be stored first?

Thanks,
Joel

On 9/5/11 8:54 AM, Richard Smith wrote:
> I expect what's happening is that items of one type are being loaded
> where each item has a collection of some other type. e.g. Genes have
> a collection of GOAnnotation items.
>
> This fires a second query in the items database for the contents of
> the collection IF those items haven't already been processed (the
> queries are batched together but will still slow down the process).
>
> Items are read in the order they were saved into the items database,
> so you can prevent this extra query by changing parser code to ensure
> e.g. the GOAnnotation items are stored before the Genes.
>
> See some documentation:
>
> http://intermine.org/wiki/DataLoadingPerformance
>
> This page includes a query you can run to see if the store order can
> be optimised.
>
>
> There should be messages in minename/integrate/intermine.log that
> report the loading speed every 10000 objects. What sort of numbers
> are you getting? A good speed is a few hundred thousand per minute,
> though it depends a lot on the size & complexity of the objects.
>
> Cheers,
> Richard.
>
>
>
> On 02/09/2011 16:22, Joel Richardson wrote:
>>
>> Hi,
>>
>> I have a question about build performance. This particular source
>> is reading ontologies from the MGI database. Everything goes
>> pretty quickly until:
>>
>> ==============================
>> -pre-load:
>>
>> load:
>> [echo]
>> [echo] Loading mgi-ontologies (mgi-ontology-sql) tgt items into
>> production DB
>> [echo]
>> [integrate] Creating index: CREATE INDEX Ontology__key ON Ontology (name)
>> [integrate] Creating index: CREATE INDEX DataSet__key ON DataSet (name)
>> [integrate] Creating index: CREATE INDEX DataSource__key ON DataSource
>> (name)
>> ==============================
>>
>> This phase takes a long (long) time. Looking at pg_stat_activity shows
>> a bunch of connections that are idle and one connection that is
>> doing (apparently) all the work. Monitoring this over a long
>> period shows the same connection (31249) executing a series of
>> queries such as:
>>
>> ==============================
>> 72236 | items-mousemine_097 | 31249 | 10 | postgres | | 127.0.0.1 |
>> 64609 | 2011-09-02 10:08:08.382315-04 | 2011-09-02 10:58:39.60711-04 |
>> 2011-09-02 10:58:39.610947-04 | f | SELECT a1_.id AS a3_, a2_.id AS
>> a2_id, a2_.itemId AS a2_itemId, a2_.name AS a2_name, a2_.refIds AS
>> a2_refIds FROM Item AS a1_, ReferenceList AS a2_ WHERE a1_.id =
>> a2_.itemId AND a1_.id IN (41308215, 41308330, 41308462, 41308570,
>> 41316732, 41316818, 41316874, 41316969, 41316988, 41316998, 41317241,
>> 41317251, 41317270, 41317561, 41317618, 41317647, 41317674, 41317711,
>> 41317748, 41317758, 41317825, 41317864, 41317874, 41318267, 41326600,
>> 41326640, 41326660, 41326690, 41326709, 41326718, 41327105, 41327135,
>> 41327203, 41327243, 41327435, 41327455, 41327815, 41327895, 41327945,
>> 41328034, 41328064, 41328194, 41328204, 41328362, 41328392, 41328520,
>> 41334884, 41335142, 41335152, 41335172, 41335192, 41335386, 41335406,
>> 41335511, 41335541, 41335824, 41335834, 41335844, 41335863, 41335968,
>> 41335997, 41336026, 41336104, 41336133, 41336203, 41336213, 41336233,
>> 41336243, 41336273, 41336311, 41336321, 41336369, 41336398, 41336416,
>> 41336455, 41336465, 41336474, 41336484, 41336494, 41336643, 41336683,
>> 41336703, 41336762, 4133
>> ==============================
>>
>> My question: can this load be distributed among the connections?
>> (Or maybe I've just got something configured wrong?)
>>
>> Thanks,
>> Joel
>>
>>
>> ===============================================================
>> Joel Richardson, Ph.D.
>> Sr. Research Scientist
>> Mouse Genome Informatics
>> The Jackson Laboratory Phone: (207) 288-6435
>> 600 Main Street Fax: (207) 288-6132
>> Bar Harbor, Maine 04609 URL: www.informatics.jax.org
>> ===============================================================
>>
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>
>
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

-- 

===============================================================
Joel Richardson, Ph.D.
Sr. Research Scientist
Mouse Genome Informatics
The Jackson Laboratory   Phone: (207) 288-6435
600 Main Street          Fax:   (207) 288-6132
Bar Harbor, Maine 04609  URL:   www.informatics.jax.org
===============================================================



More information about the dev mailing list