[InterMine Dev] build performance question

Richard Smith richard at flymine.org
Fri Sep 9 12:50:21 BST 2011


You're absolutely right that a reverse-reference is required in the
model to automatically populate collections.  I'd need to look into
why exactly that isn't the case for OntologyRelations.

Richard.




On 08/09/2011 16:39, Joel Richardson wrote:
>
> Julie (and Richard),
>
> Thanks, this helps. Of course, this simply leads me to another
> question! ;-)
>
> I assume that to automatically fill in collections for 1:n and n:m
> relationships, the reverse-reference must be declared. (Yes?)
>
> In the core model file, OntologyTerm and OntologyRelation
> refer to each other, but do not declare reverse-references.
> Similarly, OntologyTerm has a "parents" collection of ancestor
> OntologyTerms, but there is no inverse. Will the system know how
> to backfill references in this case?
>
>> <class name="OntologyTerm" is-interface="true">
>> <attribute name="identifier" type="java.lang.String"/>
>> <attribute name="name" type="java.lang.String"/>
>> <attribute name="description" type="java.lang.String"/>
>> <attribute name="namespace" type="java.lang.String"/>
>> <attribute name="obsolete" type="java.lang.Boolean"/>
>> <reference name="ontology" referenced-type="Ontology"/>
>> <collection name="relations" referenced-type="OntologyRelation" />
>> <collection name="synonyms" referenced-type="OntologyTermSynonym" />
>> <collection name="ontologyAnnotations"
>> referenced-type="OntologyAnnotation" reverse-reference="ontologyTerm" />
>> <collection name="parents" referenced-type="OntologyTerm" />
>> <collection name="dataSets" referenced-type="DataSet" />
>> </class>
>>
>> <class name="OntologyRelation" is-interface="true">
>> <reference name="parentTerm" referenced-type="OntologyTerm" />
>> <reference name="childTerm" referenced-type="OntologyTerm" />
>> <attribute name="relationship" type="java.lang.String"/>
>> <attribute name="direct" type="java.lang.Boolean"/>
>> <attribute name="redundant" type="java.lang.Boolean"/>
>> </class>
>
> Thanks,
> Joel
>
>
> On 9/8/11 9:10 AM, Julie Sullivan wrote:
>> Hi Joel
>>
>> In response to your original question, I think you would store the
>> Ontology term first.
>>
>> If you stored the Relation object first, the relation object would be
>> referencing two ontology objects that don't exist in the database (yet).
>> The build system would create "skeleton" objects that work as temporary
>> holders for the ontology objects; these skeletons are discarded when the
>> "real" ontology objects are stored later.
>>
>> Creating these temporary holder objects takes time, it's better if you
>> can avoid it. Storing the ontology term first will make creating the
>> "skeleton" object unnecessary, thus saving a bit of resources and
>> speeding up your build.
>>
>> I think Richard's point was that you shouldn't set both sides of the
>> relationship, eg. the Ontology object doesn't need to know about the
>> Relation object, the reverse references will be populated automatically
>> by the build system. This way when you store the ontology object, no
>> skeletons need be created.
>>
>> Does that help at all? If this doesn't improve performance, we'd be
>> happy to take a look at your data converter if you want to send it to us.
>>
>> Cheers
>> Julie
>>
>> On 08/09/11 12:36, Joel Richardson wrote:
>>>
>>> Hi Richard,
>>>
>>> Sorry, I don't mean to be dense, but...
>>> Are we talking about two things, then? (1) the order in which
>>> items are store()'ed, and (2) which ends of relationships
>>> get initialized? (At what point does the other side get
>>> filled in?) Sorry if I've completely missed the boat.
>>>
>>> Joel
>>>
>>>
>>> On 9/7/11 9:36 AM, Richard Smith wrote:
>>>> The store order rules apply to both references and collections.
>>>>
>>>> In fact there are specific rules for how to handle items that
>>>> refer to one another, usually you should only fill in one side
>>>> according to the rules:
>>>>
>>>> 1:many
>>>> ------
>>>> i.e. one end is a reference, the other a collection
>>>>
>>>> In this case you should always fill in the reference and leave the
>>>> collection empty (it will be ignored)
>>>>
>>>> e.g. Gene has a collection Transcripts and Transcript references one
>>>> Gene, fill in Transcript.gene only.
>>>>
>>>> many:many
>>>> ---------
>>>> i.e. both are collections, there is an indirection table in the
>>>> database
>>>>
>>>> You can fill in either collection and must leave the other collection
>>>> blank. In practice if one side is very large and the other smaller it
>>>> is faster to populate the smaller collection.
>>>>
>>>> e.g. Gene has a collection of Pathways and Pathway has a collection of
>>>> Genes, fill in either Gene.pathways or Pathway.gene but not both. If
>>>> Pathway.genes typically contains e.g. 20,000 items and Gene.pathways
>>>> typically 100 items then it is faster populate Gene.pathways.
>>>>
>>>> 1:1
>>>> ---
>>>> i.e. both ends are references (this is rare)
>>>>
>>>> I think you must fill in both ends but will check this out.
>>>>
>>>>
>>>> This is all documented somewhere but I can't find it on the wiki - I'll
>>>> try and make it more obvious.
>>>>
>>>> Cheers,
>>>> Richard.
>>>>
>>>>
>>>>
>>>> On 06/09/2011 18:46, Joel Richardson wrote:
>>>>>
>>>>> In your email, you talk about collections specifically,
>>>>> while the wiki page talks about references (in general?).
>>>>> Anyway, my question is: when objects refer to each other,
>>>>> who should be stored first? For example, an OntologyTerm has
>>>>> a collection of OntologyRelations, and each OntologyRelation
>>>>> references two OntologyTerms. Which should be stored first?
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>> On 9/5/11 8:54 AM, Richard Smith wrote:
>>>>>> I expect what's happening is that items of one type are being loaded
>>>>>> where each item has a collection of some other type. e.g. Genes have
>>>>>> a collection of GOAnnotation items.
>>>>>>
>>>>>> This fires a second query in the items database for the contents of
>>>>>> the collection IF those items haven't already been processed (the
>>>>>> queries are batched together but will still slow down the process).
>>>>>>
>>>>>> Items are read in the order they were saved into the items database,
>>>>>> so you can prevent this extra query by changing parser code to ensure
>>>>>> e.g. the GOAnnotation items are stored before the Genes.
>>>>>>
>>>>>> See some documentation:
>>>>>>
>>>>>> http://intermine.org/wiki/DataLoadingPerformance
>>>>>>
>>>>>> This page includes a query you can run to see if the store order can
>>>>>> be optimised.
>>>>>>
>>>>>>
>>>>>> There should be messages in minename/integrate/intermine.log that
>>>>>> report the loading speed every 10000 objects. What sort of numbers
>>>>>> are you getting? A good speed is a few hundred thousand per minute,
>>>>>> though it depends a lot on the size & complexity of the objects.
>>>>>>
>>>>>> Cheers,
>>>>>> Richard.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 02/09/2011 16:22, Joel Richardson wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>> I have a question about build performance. This particular source
>>>>>>> is reading ontologies from the MGI database. Everything goes
>>>>>>> pretty quickly until:
>>>>>>>
>>>>>>> ==============================
>>>>>>> -pre-load:
>>>>>>>
>>>>>>> load:
>>>>>>> [echo]
>>>>>>> [echo] Loading mgi-ontologies (mgi-ontology-sql) tgt items into
>>>>>>> production DB
>>>>>>> [echo]
>>>>>>> [integrate] Creating index: CREATE INDEX Ontology__key ON Ontology
>>>>>>> (name)
>>>>>>> [integrate] Creating index: CREATE INDEX DataSet__key ON DataSet
>>>>>>> (name)
>>>>>>> [integrate] Creating index: CREATE INDEX DataSource__key ON
>>>>>>> DataSource
>>>>>>> (name)
>>>>>>> ==============================
>>>>>>>
>>>>>>> This phase takes a long (long) time. Looking at pg_stat_activity
>>>>>>> shows
>>>>>>> a bunch of connections that are idle and one connection that is
>>>>>>> doing (apparently) all the work. Monitoring this over a long
>>>>>>> period shows the same connection (31249) executing a series of
>>>>>>> queries such as:
>>>>>>>
>>>>>>> ==============================
>>>>>>> 72236 | items-mousemine_097 | 31249 | 10 | postgres | | 127.0.0.1 |
>>>>>>> 64609 | 2011-09-02 10:08:08.382315-04 | 2011-09-02
>>>>>>> 10:58:39.60711-04 |
>>>>>>> 2011-09-02 10:58:39.610947-04 | f | SELECT a1_.id AS a3_, a2_.id AS
>>>>>>> a2_id, a2_.itemId AS a2_itemId, a2_.name AS a2_name, a2_.refIds AS
>>>>>>> a2_refIds FROM Item AS a1_, ReferenceList AS a2_ WHERE a1_.id =
>>>>>>> a2_.itemId AND a1_.id IN (41308215, 41308330, 41308462, 41308570,
>>>>>>> 41316732, 41316818, 41316874, 41316969, 41316988, 41316998,
>>>>>>> 41317241,
>>>>>>> 41317251, 41317270, 41317561, 41317618, 41317647, 41317674,
>>>>>>> 41317711,
>>>>>>> 41317748, 41317758, 41317825, 41317864, 41317874, 41318267,
>>>>>>> 41326600,
>>>>>>> 41326640, 41326660, 41326690, 41326709, 41326718, 41327105,
>>>>>>> 41327135,
>>>>>>> 41327203, 41327243, 41327435, 41327455, 41327815, 41327895,
>>>>>>> 41327945,
>>>>>>> 41328034, 41328064, 41328194, 41328204, 41328362, 41328392,
>>>>>>> 41328520,
>>>>>>> 41334884, 41335142, 41335152, 41335172, 41335192, 41335386,
>>>>>>> 41335406,
>>>>>>> 41335511, 41335541, 41335824, 41335834, 41335844, 41335863,
>>>>>>> 41335968,
>>>>>>> 41335997, 41336026, 41336104, 41336133, 41336203, 41336213,
>>>>>>> 41336233,
>>>>>>> 41336243, 41336273, 41336311, 41336321, 41336369, 41336398,
>>>>>>> 41336416,
>>>>>>> 41336455, 41336465, 41336474, 41336484, 41336494, 41336643,
>>>>>>> 41336683,
>>>>>>> 41336703, 41336762, 4133
>>>>>>> ==============================
>>>>>>>
>>>>>>> My question: can this load be distributed among the connections?
>>>>>>> (Or maybe I've just got something configured wrong?)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Joel
>>>>>>>
>>>>>>>
>>>>>>> ===============================================================
>>>>>>> Joel Richardson, Ph.D.
>>>>>>> Sr. Research Scientist
>>>>>>> Mouse Genome Informatics
>>>>>>> The Jackson Laboratory Phone: (207) 288-6435
>>>>>>> 600 Main Street Fax: (207) 288-6132
>>>>>>> Bar Harbor, Maine 04609 URL: www.informatics.jax.org
>>>>>>> ===============================================================
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> dev mailing list
>>>>>>> dev at intermine.org
>>>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>>>>>
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> dev mailing list
>>>>>> dev at intermine.org
>>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> dev at intermine.org
>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>




More information about the dev mailing list