[InterMine Dev] build performance question

Joel Richardson jer at informatics.jax.org
Thu Sep 8 16:39:59 BST 2011


Julie (and Richard),

Thanks, this helps. Of course, this simply leads me to another
question! ;-)

I assume that to automatically fill in collections for 1:n and n:m
relationships, the reverse-reference must be declared. (Yes?)

In the core model file, OntologyTerm and OntologyRelation
refer to each other, but do not declare reverse-references.
Similarly, OntologyTerm has a "parents" collection of ancestor
OntologyTerms, but there is no inverse. Will the system know how
to backfill references in this case?

>   <class name="OntologyTerm" is-interface="true">
>     <attribute name="identifier" type="java.lang.String"/>
>     <attribute name="name" type="java.lang.String"/>
>     <attribute name="description" type="java.lang.String"/>
>     <attribute name="namespace" type="java.lang.String"/>
>     <attribute name="obsolete" type="java.lang.Boolean"/>
>     <reference name="ontology" referenced-type="Ontology"/>
>     <collection name="relations" referenced-type="OntologyRelation" />
>     <collection name="synonyms" referenced-type="OntologyTermSynonym" />
>     <collection name="ontologyAnnotations" referenced-type="OntologyAnnotation" reverse-reference="ontologyTerm" />
>     <collection name="parents" referenced-type="OntologyTerm" />
>     <collection name="dataSets" referenced-type="DataSet" />
>   </class>
>
>   <class name="OntologyRelation" is-interface="true">
>     <reference name="parentTerm" referenced-type="OntologyTerm" />
>     <reference name="childTerm" referenced-type="OntologyTerm" />
>     <attribute name="relationship" type="java.lang.String"/>
>     <attribute name="direct" type="java.lang.Boolean"/>
>     <attribute name="redundant" type="java.lang.Boolean"/>
>   </class>

Thanks,
Joel


On 9/8/11 9:10 AM, Julie Sullivan wrote:
> Hi Joel
>
> In response to your original question, I think you would store the
> Ontology term first.
>
> If you stored the Relation object first, the relation object would be
> referencing two ontology objects that don't exist in the database (yet).
> The build system would create "skeleton" objects that work as temporary
> holders for the ontology objects; these skeletons are discarded when the
> "real" ontology objects are stored later.
>
> Creating these temporary holder objects takes time, it's better if you
> can avoid it. Storing the ontology term first will make creating the
> "skeleton" object unnecessary, thus saving a bit of resources and
> speeding up your build.
>
> I think Richard's point was that you shouldn't set both sides of the
> relationship, eg. the Ontology object doesn't need to know about the
> Relation object, the reverse references will be populated automatically
> by the build system. This way when you store the ontology object, no
> skeletons need be created.
>
> Does that help at all? If this doesn't improve performance, we'd be
> happy to take a look at your data converter if you want to send it to us.
>
> Cheers
> Julie
>
> On 08/09/11 12:36, Joel Richardson wrote:
>>
>> Hi Richard,
>>
>> Sorry, I don't mean to be dense, but...
>> Are we talking about two things, then? (1) the order in which
>> items are store()'ed, and (2) which ends of relationships
>> get initialized? (At what point does the other side get
>> filled in?) Sorry if I've completely missed the boat.
>>
>> Joel
>>
>>
>> On 9/7/11 9:36 AM, Richard Smith wrote:
>>> The store order rules apply to both references and collections.
>>>
>>> In fact there are specific rules for how to handle items that
>>> refer to one another, usually you should only fill in one side
>>> according to the rules:
>>>
>>> 1:many
>>> ------
>>> i.e. one end is a reference, the other a collection
>>>
>>> In this case you should always fill in the reference and leave the
>>> collection empty (it will be ignored)
>>>
>>> e.g. Gene has a collection Transcripts and Transcript references one
>>> Gene, fill in Transcript.gene only.
>>>
>>> many:many
>>> ---------
>>> i.e. both are collections, there is an indirection table in the database
>>>
>>> You can fill in either collection and must leave the other collection
>>> blank. In practice if one side is very large and the other smaller it
>>> is faster to populate the smaller collection.
>>>
>>> e.g. Gene has a collection of Pathways and Pathway has a collection of
>>> Genes, fill in either Gene.pathways or Pathway.gene but not both. If
>>> Pathway.genes typically contains e.g. 20,000 items and Gene.pathways
>>> typically 100 items then it is faster populate Gene.pathways.
>>>
>>> 1:1
>>> ---
>>> i.e. both ends are references (this is rare)
>>>
>>> I think you must fill in both ends but will check this out.
>>>
>>>
>>> This is all documented somewhere but I can't find it on the wiki - I'll
>>> try and make it more obvious.
>>>
>>> Cheers,
>>> Richard.
>>>
>>>
>>>
>>> On 06/09/2011 18:46, Joel Richardson wrote:
>>>>
>>>> In your email, you talk about collections specifically,
>>>> while the wiki page talks about references (in general?).
>>>> Anyway, my question is: when objects refer to each other,
>>>> who should be stored first? For example, an OntologyTerm has
>>>> a collection of OntologyRelations, and each OntologyRelation
>>>> references two OntologyTerms. Which should be stored first?
>>>>
>>>> Thanks,
>>>> Joel
>>>>
>>>> On 9/5/11 8:54 AM, Richard Smith wrote:
>>>>> I expect what's happening is that items of one type are being loaded
>>>>> where each item has a collection of some other type. e.g. Genes have
>>>>> a collection of GOAnnotation items.
>>>>>
>>>>> This fires a second query in the items database for the contents of
>>>>> the collection IF those items haven't already been processed (the
>>>>> queries are batched together but will still slow down the process).
>>>>>
>>>>> Items are read in the order they were saved into the items database,
>>>>> so you can prevent this extra query by changing parser code to ensure
>>>>> e.g. the GOAnnotation items are stored before the Genes.
>>>>>
>>>>> See some documentation:
>>>>>
>>>>> http://intermine.org/wiki/DataLoadingPerformance
>>>>>
>>>>> This page includes a query you can run to see if the store order can
>>>>> be optimised.
>>>>>
>>>>>
>>>>> There should be messages in minename/integrate/intermine.log that
>>>>> report the loading speed every 10000 objects. What sort of numbers
>>>>> are you getting? A good speed is a few hundred thousand per minute,
>>>>> though it depends a lot on the size & complexity of the objects.
>>>>>
>>>>> Cheers,
>>>>> Richard.
>>>>>
>>>>>
>>>>>
>>>>> On 02/09/2011 16:22, Joel Richardson wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I have a question about build performance. This particular source
>>>>>> is reading ontologies from the MGI database. Everything goes
>>>>>> pretty quickly until:
>>>>>>
>>>>>> ==============================
>>>>>> -pre-load:
>>>>>>
>>>>>> load:
>>>>>> [echo]
>>>>>> [echo] Loading mgi-ontologies (mgi-ontology-sql) tgt items into
>>>>>> production DB
>>>>>> [echo]
>>>>>> [integrate] Creating index: CREATE INDEX Ontology__key ON Ontology
>>>>>> (name)
>>>>>> [integrate] Creating index: CREATE INDEX DataSet__key ON DataSet
>>>>>> (name)
>>>>>> [integrate] Creating index: CREATE INDEX DataSource__key ON
>>>>>> DataSource
>>>>>> (name)
>>>>>> ==============================
>>>>>>
>>>>>> This phase takes a long (long) time. Looking at pg_stat_activity
>>>>>> shows
>>>>>> a bunch of connections that are idle and one connection that is
>>>>>> doing (apparently) all the work. Monitoring this over a long
>>>>>> period shows the same connection (31249) executing a series of
>>>>>> queries such as:
>>>>>>
>>>>>> ==============================
>>>>>> 72236 | items-mousemine_097 | 31249 | 10 | postgres | | 127.0.0.1 |
>>>>>> 64609 | 2011-09-02 10:08:08.382315-04 | 2011-09-02
>>>>>> 10:58:39.60711-04 |
>>>>>> 2011-09-02 10:58:39.610947-04 | f | SELECT a1_.id AS a3_, a2_.id AS
>>>>>> a2_id, a2_.itemId AS a2_itemId, a2_.name AS a2_name, a2_.refIds AS
>>>>>> a2_refIds FROM Item AS a1_, ReferenceList AS a2_ WHERE a1_.id =
>>>>>> a2_.itemId AND a1_.id IN (41308215, 41308330, 41308462, 41308570,
>>>>>> 41316732, 41316818, 41316874, 41316969, 41316988, 41316998, 41317241,
>>>>>> 41317251, 41317270, 41317561, 41317618, 41317647, 41317674, 41317711,
>>>>>> 41317748, 41317758, 41317825, 41317864, 41317874, 41318267, 41326600,
>>>>>> 41326640, 41326660, 41326690, 41326709, 41326718, 41327105, 41327135,
>>>>>> 41327203, 41327243, 41327435, 41327455, 41327815, 41327895, 41327945,
>>>>>> 41328034, 41328064, 41328194, 41328204, 41328362, 41328392, 41328520,
>>>>>> 41334884, 41335142, 41335152, 41335172, 41335192, 41335386, 41335406,
>>>>>> 41335511, 41335541, 41335824, 41335834, 41335844, 41335863, 41335968,
>>>>>> 41335997, 41336026, 41336104, 41336133, 41336203, 41336213, 41336233,
>>>>>> 41336243, 41336273, 41336311, 41336321, 41336369, 41336398, 41336416,
>>>>>> 41336455, 41336465, 41336474, 41336484, 41336494, 41336643, 41336683,
>>>>>> 41336703, 41336762, 4133
>>>>>> ==============================
>>>>>>
>>>>>> My question: can this load be distributed among the connections?
>>>>>> (Or maybe I've just got something configured wrong?)
>>>>>>
>>>>>> Thanks,
>>>>>> Joel
>>>>>>
>>>>>>
>>>>>> ===============================================================
>>>>>> Joel Richardson, Ph.D.
>>>>>> Sr. Research Scientist
>>>>>> Mouse Genome Informatics
>>>>>> The Jackson Laboratory Phone: (207) 288-6435
>>>>>> 600 Main Street Fax: (207) 288-6132
>>>>>> Bar Harbor, Maine 04609 URL: www.informatics.jax.org
>>>>>> ===============================================================
>>>>>>
>>>>>> _______________________________________________
>>>>>> dev mailing list
>>>>>> dev at intermine.org
>>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev at intermine.org
>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

-- 

===============================================================
Joel Richardson, Ph.D.
Sr. Research Scientist
Mouse Genome Informatics
The Jackson Laboratory   Phone: (207) 288-6435
600 Main Street          Fax:   (207) 288-6132
Bar Harbor, Maine 04609  URL:   www.informatics.jax.org
===============================================================



More information about the dev mailing list