[InterMine Dev] build performance question

Julie Sullivan julie at flymine.org
Thu Sep 8 14:10:18 BST 2011


Hi Joel

In response to your original question, I think you would store the Ontology term 
first.

If you stored the Relation object first, the relation object would be 
referencing two ontology objects that don't exist in the database (yet).  The 
build system would create "skeleton" objects that work as temporary holders for 
the ontology objects; these skeletons are discarded when the "real" ontology 
objects are stored later.

Creating these temporary holder objects takes time, it's better if you can avoid 
it.  Storing the ontology term first will make creating the "skeleton" object 
unnecessary, thus saving a bit of resources and speeding up your build.

I think Richard's point was that you shouldn't set both sides of the 
relationship, eg. the Ontology object doesn't need to know about the Relation 
object, the reverse references will be populated automatically by the build 
system.  This way when you store the ontology object, no skeletons need be created.

Does that help at all?  If this doesn't improve performance, we'd be happy to 
take a look at your data converter if you want to send it to us.

Cheers
Julie

On 08/09/11 12:36, Joel Richardson wrote:
>
> Hi Richard,
>
> Sorry, I don't mean to be dense, but...
> Are we talking about two things, then? (1) the order in which
> items are store()'ed, and (2) which ends of relationships
> get initialized? (At what point does the other side get
> filled in?) Sorry if I've completely missed the boat.
>
> Joel
>
>
> On 9/7/11 9:36 AM, Richard Smith wrote:
>> The store order rules apply to both references and collections.
>>
>> In fact there are specific rules for how to handle items that
>> refer to one another, usually you should only fill in one side
>> according to the rules:
>>
>> 1:many
>> ------
>> i.e. one end is a reference, the other a collection
>>
>> In this case you should always fill in the reference and leave the
>> collection empty (it will be ignored)
>>
>> e.g. Gene has a collection Transcripts and Transcript references one
>> Gene, fill in Transcript.gene only.
>>
>> many:many
>> ---------
>> i.e. both are collections, there is an indirection table in the database
>>
>> You can fill in either collection and must leave the other collection
>> blank. In practice if one side is very large and the other smaller it
>> is faster to populate the smaller collection.
>>
>> e.g. Gene has a collection of Pathways and Pathway has a collection of
>> Genes, fill in either Gene.pathways or Pathway.gene but not both. If
>> Pathway.genes typically contains e.g. 20,000 items and Gene.pathways
>> typically 100 items then it is faster populate Gene.pathways.
>>
>> 1:1
>> ---
>> i.e. both ends are references (this is rare)
>>
>> I think you must fill in both ends but will check this out.
>>
>>
>> This is all documented somewhere but I can't find it on the wiki - I'll
>> try and make it more obvious.
>>
>> Cheers,
>> Richard.
>>
>>
>>
>> On 06/09/2011 18:46, Joel Richardson wrote:
>>>
>>> In your email, you talk about collections specifically,
>>> while the wiki page talks about references (in general?).
>>> Anyway, my question is: when objects refer to each other,
>>> who should be stored first? For example, an OntologyTerm has
>>> a collection of OntologyRelations, and each OntologyRelation
>>> references two OntologyTerms. Which should be stored first?
>>>
>>> Thanks,
>>> Joel
>>>
>>> On 9/5/11 8:54 AM, Richard Smith wrote:
>>>> I expect what's happening is that items of one type are being loaded
>>>> where each item has a collection of some other type. e.g. Genes have
>>>> a collection of GOAnnotation items.
>>>>
>>>> This fires a second query in the items database for the contents of
>>>> the collection IF those items haven't already been processed (the
>>>> queries are batched together but will still slow down the process).
>>>>
>>>> Items are read in the order they were saved into the items database,
>>>> so you can prevent this extra query by changing parser code to ensure
>>>> e.g. the GOAnnotation items are stored before the Genes.
>>>>
>>>> See some documentation:
>>>>
>>>> http://intermine.org/wiki/DataLoadingPerformance
>>>>
>>>> This page includes a query you can run to see if the store order can
>>>> be optimised.
>>>>
>>>>
>>>> There should be messages in minename/integrate/intermine.log that
>>>> report the loading speed every 10000 objects. What sort of numbers
>>>> are you getting? A good speed is a few hundred thousand per minute,
>>>> though it depends a lot on the size & complexity of the objects.
>>>>
>>>> Cheers,
>>>> Richard.
>>>>
>>>>
>>>>
>>>> On 02/09/2011 16:22, Joel Richardson wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>> I have a question about build performance. This particular source
>>>>> is reading ontologies from the MGI database. Everything goes
>>>>> pretty quickly until:
>>>>>
>>>>> ==============================
>>>>> -pre-load:
>>>>>
>>>>> load:
>>>>> [echo]
>>>>> [echo] Loading mgi-ontologies (mgi-ontology-sql) tgt items into
>>>>> production DB
>>>>> [echo]
>>>>> [integrate] Creating index: CREATE INDEX Ontology__key ON Ontology
>>>>> (name)
>>>>> [integrate] Creating index: CREATE INDEX DataSet__key ON DataSet (name)
>>>>> [integrate] Creating index: CREATE INDEX DataSource__key ON DataSource
>>>>> (name)
>>>>> ==============================
>>>>>
>>>>> This phase takes a long (long) time. Looking at pg_stat_activity shows
>>>>> a bunch of connections that are idle and one connection that is
>>>>> doing (apparently) all the work. Monitoring this over a long
>>>>> period shows the same connection (31249) executing a series of
>>>>> queries such as:
>>>>>
>>>>> ==============================
>>>>> 72236 | items-mousemine_097 | 31249 | 10 | postgres | | 127.0.0.1 |
>>>>> 64609 | 2011-09-02 10:08:08.382315-04 | 2011-09-02 10:58:39.60711-04 |
>>>>> 2011-09-02 10:58:39.610947-04 | f | SELECT a1_.id AS a3_, a2_.id AS
>>>>> a2_id, a2_.itemId AS a2_itemId, a2_.name AS a2_name, a2_.refIds AS
>>>>> a2_refIds FROM Item AS a1_, ReferenceList AS a2_ WHERE a1_.id =
>>>>> a2_.itemId AND a1_.id IN (41308215, 41308330, 41308462, 41308570,
>>>>> 41316732, 41316818, 41316874, 41316969, 41316988, 41316998, 41317241,
>>>>> 41317251, 41317270, 41317561, 41317618, 41317647, 41317674, 41317711,
>>>>> 41317748, 41317758, 41317825, 41317864, 41317874, 41318267, 41326600,
>>>>> 41326640, 41326660, 41326690, 41326709, 41326718, 41327105, 41327135,
>>>>> 41327203, 41327243, 41327435, 41327455, 41327815, 41327895, 41327945,
>>>>> 41328034, 41328064, 41328194, 41328204, 41328362, 41328392, 41328520,
>>>>> 41334884, 41335142, 41335152, 41335172, 41335192, 41335386, 41335406,
>>>>> 41335511, 41335541, 41335824, 41335834, 41335844, 41335863, 41335968,
>>>>> 41335997, 41336026, 41336104, 41336133, 41336203, 41336213, 41336233,
>>>>> 41336243, 41336273, 41336311, 41336321, 41336369, 41336398, 41336416,
>>>>> 41336455, 41336465, 41336474, 41336484, 41336494, 41336643, 41336683,
>>>>> 41336703, 41336762, 4133
>>>>> ==============================
>>>>>
>>>>> My question: can this load be distributed among the connections?
>>>>> (Or maybe I've just got something configured wrong?)
>>>>>
>>>>> Thanks,
>>>>> Joel
>>>>>
>>>>>
>>>>> ===============================================================
>>>>> Joel Richardson, Ph.D.
>>>>> Sr. Research Scientist
>>>>> Mouse Genome Informatics
>>>>> The Jackson Laboratory Phone: (207) 288-6435
>>>>> 600 Main Street Fax: (207) 288-6132
>>>>> Bar Harbor, Maine 04609 URL: www.informatics.jax.org
>>>>> ===============================================================
>>>>>
>>>>> _______________________________________________
>>>>> dev mailing list
>>>>> dev at intermine.org
>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> dev at intermine.org
>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>
>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>



More information about the dev mailing list