[InterMine Dev] build performance question

Richard Smith richard at flymine.org
Mon Sep 5 13:54:50 BST 2011


I expect what's happening is that items of one type are being loaded
where each item has a collection of some other type.  e.g. Genes have
a collection of GOAnnotation items.

This fires a second query in the items database for the contents of
the collection IF those items haven't already been processed (the
queries are batched together but will still slow down the process).

Items are read in the order they were saved into the items database,
so you can prevent this extra query by changing parser code to ensure
e.g. the GOAnnotation items are stored before the Genes.

See some documentation:

	http://intermine.org/wiki/DataLoadingPerformance

This page includes a query you can run to see if the store order can
be optimised.


There should be messages in minename/integrate/intermine.log that
report the loading speed every 10000 objects.  What sort of numbers
are you getting?  A good speed is a few hundred thousand per minute,
though it depends a lot on the size & complexity of the objects.

Cheers,
Richard.



On 02/09/2011 16:22, Joel Richardson wrote:
>
> Hi,
>
> I have a question about build performance. This particular source
> is reading ontologies from the MGI database. Everything goes
> pretty quickly until:
>
> ==============================
> -pre-load:
>
> load:
> [echo]
> [echo] Loading mgi-ontologies (mgi-ontology-sql) tgt items into
> production DB
> [echo]
> [integrate] Creating index: CREATE INDEX Ontology__key ON Ontology (name)
> [integrate] Creating index: CREATE INDEX DataSet__key ON DataSet (name)
> [integrate] Creating index: CREATE INDEX DataSource__key ON DataSource
> (name)
> ==============================
>
> This phase takes a long (long) time. Looking at pg_stat_activity shows
> a bunch of connections that are idle and one connection that is
> doing (apparently) all the work. Monitoring this over a long
> period shows the same connection (31249) executing a series of
> queries such as:
>
> ==============================
> 72236 | items-mousemine_097 | 31249 | 10 | postgres | | 127.0.0.1 |
> 64609 | 2011-09-02 10:08:08.382315-04 | 2011-09-02 10:58:39.60711-04 |
> 2011-09-02 10:58:39.610947-04 | f | SELECT a1_.id AS a3_, a2_.id AS
> a2_id, a2_.itemId AS a2_itemId, a2_.name AS a2_name, a2_.refIds AS
> a2_refIds FROM Item AS a1_, ReferenceList AS a2_ WHERE a1_.id =
> a2_.itemId AND a1_.id IN (41308215, 41308330, 41308462, 41308570,
> 41316732, 41316818, 41316874, 41316969, 41316988, 41316998, 41317241,
> 41317251, 41317270, 41317561, 41317618, 41317647, 41317674, 41317711,
> 41317748, 41317758, 41317825, 41317864, 41317874, 41318267, 41326600,
> 41326640, 41326660, 41326690, 41326709, 41326718, 41327105, 41327135,
> 41327203, 41327243, 41327435, 41327455, 41327815, 41327895, 41327945,
> 41328034, 41328064, 41328194, 41328204, 41328362, 41328392, 41328520,
> 41334884, 41335142, 41335152, 41335172, 41335192, 41335386, 41335406,
> 41335511, 41335541, 41335824, 41335834, 41335844, 41335863, 41335968,
> 41335997, 41336026, 41336104, 41336133, 41336203, 41336213, 41336233,
> 41336243, 41336273, 41336311, 41336321, 41336369, 41336398, 41336416,
> 41336455, 41336465, 41336474, 41336484, 41336494, 41336643, 41336683,
> 41336703, 41336762, 4133
> ==============================
>
> My question: can this load be distributed among the connections?
> (Or maybe I've just got something configured wrong?)
>
> Thanks,
> Joel
>
>
> ===============================================================
> Joel Richardson, Ph.D.
> Sr. Research Scientist
> Mouse Genome Informatics
> The Jackson Laboratory Phone: (207) 288-6435
> 600 Main Street Fax: (207) 288-6132
> Bar Harbor, Maine 04609 URL: www.informatics.jax.org
> ===============================================================
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>




More information about the dev mailing list