[InterMine Dev] build performance question
richard at flymine.org
Wed Sep 7 14:36:15 BST 2011
The store order rules apply to both references and collections.
In fact there are specific rules for how to handle items that
refer to one another, usually you should only fill in one side
according to the rules:
i.e. one end is a reference, the other a collection
In this case you should always fill in the reference and leave the
collection empty (it will be ignored)
e.g. Gene has a collection Transcripts and Transcript references one
Gene, fill in Transcript.gene only.
i.e. both are collections, there is an indirection table in the database
You can fill in either collection and must leave the other collection
blank. In practice if one side is very large and the other smaller it
is faster to populate the smaller collection.
e.g. Gene has a collection of Pathways and Pathway has a collection of
Genes, fill in either Gene.pathways or Pathway.gene but not both. If
Pathway.genes typically contains e.g. 20,000 items and Gene.pathways
typically 100 items then it is faster populate Gene.pathways.
i.e. both ends are references (this is rare)
I think you must fill in both ends but will check this out.
This is all documented somewhere but I can't find it on the wiki - I'll
try and make it more obvious.
On 06/09/2011 18:46, Joel Richardson wrote:
> In your email, you talk about collections specifically,
> while the wiki page talks about references (in general?).
> Anyway, my question is: when objects refer to each other,
> who should be stored first? For example, an OntologyTerm has
> a collection of OntologyRelations, and each OntologyRelation
> references two OntologyTerms. Which should be stored first?
> On 9/5/11 8:54 AM, Richard Smith wrote:
>> I expect what's happening is that items of one type are being loaded
>> where each item has a collection of some other type. e.g. Genes have
>> a collection of GOAnnotation items.
>> This fires a second query in the items database for the contents of
>> the collection IF those items haven't already been processed (the
>> queries are batched together but will still slow down the process).
>> Items are read in the order they were saved into the items database,
>> so you can prevent this extra query by changing parser code to ensure
>> e.g. the GOAnnotation items are stored before the Genes.
>> See some documentation:
>> This page includes a query you can run to see if the store order can
>> be optimised.
>> There should be messages in minename/integrate/intermine.log that
>> report the loading speed every 10000 objects. What sort of numbers
>> are you getting? A good speed is a few hundred thousand per minute,
>> though it depends a lot on the size & complexity of the objects.
>> On 02/09/2011 16:22, Joel Richardson wrote:
>>> I have a question about build performance. This particular source
>>> is reading ontologies from the MGI database. Everything goes
>>> pretty quickly until:
>>> [echo] Loading mgi-ontologies (mgi-ontology-sql) tgt items into
>>> production DB
>>> [integrate] Creating index: CREATE INDEX Ontology__key ON Ontology
>>> [integrate] Creating index: CREATE INDEX DataSet__key ON DataSet (name)
>>> [integrate] Creating index: CREATE INDEX DataSource__key ON DataSource
>>> This phase takes a long (long) time. Looking at pg_stat_activity shows
>>> a bunch of connections that are idle and one connection that is
>>> doing (apparently) all the work. Monitoring this over a long
>>> period shows the same connection (31249) executing a series of
>>> queries such as:
>>> 72236 | items-mousemine_097 | 31249 | 10 | postgres | | 127.0.0.1 |
>>> 64609 | 2011-09-02 10:08:08.382315-04 | 2011-09-02 10:58:39.60711-04 |
>>> 2011-09-02 10:58:39.610947-04 | f | SELECT a1_.id AS a3_, a2_.id AS
>>> a2_id, a2_.itemId AS a2_itemId, a2_.name AS a2_name, a2_.refIds AS
>>> a2_refIds FROM Item AS a1_, ReferenceList AS a2_ WHERE a1_.id =
>>> a2_.itemId AND a1_.id IN (41308215, 41308330, 41308462, 41308570,
>>> 41316732, 41316818, 41316874, 41316969, 41316988, 41316998, 41317241,
>>> 41317251, 41317270, 41317561, 41317618, 41317647, 41317674, 41317711,
>>> 41317748, 41317758, 41317825, 41317864, 41317874, 41318267, 41326600,
>>> 41326640, 41326660, 41326690, 41326709, 41326718, 41327105, 41327135,
>>> 41327203, 41327243, 41327435, 41327455, 41327815, 41327895, 41327945,
>>> 41328034, 41328064, 41328194, 41328204, 41328362, 41328392, 41328520,
>>> 41334884, 41335142, 41335152, 41335172, 41335192, 41335386, 41335406,
>>> 41335511, 41335541, 41335824, 41335834, 41335844, 41335863, 41335968,
>>> 41335997, 41336026, 41336104, 41336133, 41336203, 41336213, 41336233,
>>> 41336243, 41336273, 41336311, 41336321, 41336369, 41336398, 41336416,
>>> 41336455, 41336465, 41336474, 41336484, 41336494, 41336643, 41336683,
>>> 41336703, 41336762, 4133
>>> My question: can this load be distributed among the connections?
>>> (Or maybe I've just got something configured wrong?)
>>> Joel Richardson, Ph.D.
>>> Sr. Research Scientist
>>> Mouse Genome Informatics
>>> The Jackson Laboratory Phone: (207) 288-6435
>>> 600 Main Street Fax: (207) 288-6132
>>> Bar Harbor, Maine 04609 URL: www.informatics.jax.org
>>> dev mailing list
>>> dev at intermine.org
>> dev mailing list
>> dev at intermine.org
More information about the dev