[InterMine Dev] question about an error message

Joe Carlson jwcarlson at lbl.gov
Wed Apr 22 19:55:39 BST 2015


Hi Richard,

I played around with entering the id’s for objects into idMap. This dramatically speeds up the processing: 48 minutes to 8 minutes in a trial run of our a set of protein families into a sparsely filled mine. This is a good thing.

I created a method markAsStored(Integer) in IntegrationWriterAbstractImpl to set the idMap entry. I seems that this would be a good method to put in the higher level interface as well. I’ll leave that decision to you.

Joe
> On Apr 22, 2015, at 6:59 AM, Richard Smith <richard at flymine.org> wrote:
> 
> Hi Joe,
> I think you might be able to achieve this using ProxyReferences and
> pre-populating the IntegrationWriterAbstractImpl idMap with the known ids
> mapping to themselves. Then the integration writer should think it has
> seen these objects before and just store the foreign keys.
> IntegrationWriterAbstractImpl has a public assignMapping() method.
> 
> Subclassing ProxyReference may be a little trickier, the code to look at
> is in IntegrationWriterAbstractImpl.copyField() which is where store is
> called on referenced objects.
> 
> Good luck!
> 
> Richard.
> 
> 
>> Hi Richard,
>> 
>> This is starting to make more sense. I’m thinking that the error message
>> is really what I want to get. Not creating real objects from proxy
>> references is what I’d like to do.
>> 
>> So I was thinking of subclassing ProxyReference to make a non-storable
>> ProxyReference. Something that the store method would just ignore. Do you
>> think there is a problem with this approach?
>> 
>> Joe
>> 
>>> On Apr 21, 2015, at 7:44 AM, Richard Smith <richard at flymine.org> wrote:
>>> 
>>> Hi Joe,
>>> When an object is stored any objects it references are stored first so
>>> the
>>> correct ids can be inserted in foreign key columns. If the referenced
>>> object has already been stored then the new target id is known, it it
>>> hasn't then a skeleton object is stored.
>>> 
>>> The skeleton fills in enough fields to store the object but the loader
>>> expects the full object to be stored later in the load. By storing the
>>> skeleton and waiting for the real object we don't have to pause the load
>>> to go looking for referenced objects in the source data every time one
>>> is
>>> seen.
>>> 
>>> In the case where you created and referenced actual objects (not
>>> ProxyReferences) these were stored but without any integration keys you
>>> ended up with duplicate objects.
>>> 
>>> In both cases you get the un-replaced skeleton object error as a
>>> referenced object or ProxyReference has been stored without storing the
>>> actual object. Hopefully that makes it a little clearer what is
>>> happening,
>>> even if it doesn't actually solve your problem.
>>> 
>>> Tomorrow I hope to finish a change to make the DirectDataLoader use a
>>> ParallelBatchingFetcher - to group the integration queries into
>>> configurable batch sizes as is done in standard data loading. I think
>>> that
>>> will achieve a similar result to the code you've been working on.
>>> 
>>> The alternative would be to add an "I'm doing something weird but let me
>>> get on with it" flag to allow you to store ProxyReferences to fill in
>>> foreign keys without getting the skeletons error.
>>> 
>>> All the best,
>>> Richard.
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> Hi Richard (and gang),
>>>> 
>>>> I have a question about an error message I’m seeing. A little
>>>> background. As you know, I’m trying to speed up the loading. What
>>>> I’m
>>>> trying to do is to use a DirectDataLoader to load our protein families
>>>> as
>>>> fast as possible. I was thinking that if I could do all my queries for
>>>> existing gene and protein records up front, then when I do my loading,
>>>> I
>>>> can create a ProteinFamily object with the references to the genes and
>>>> proteins in the production database pre-filled.
>>>> 
>>>> I would have thought that removing the keys to the genes and proteins
>>>> from
>>>> the data loader’s primary keys file would prevent any query to the
>>>> database during the loading. I’ve seen that when I remove all the
>>>> keys
>>>> to genes and proteins then the integration step does not query the
>>>> production db during the insertions. This is what I want. And as far as
>>>> I
>>>> can tell, there are no unnecessary queries happening.
>>>> 
>>>> I’ve run the integration step and I get an ObjectStoreException:
>>>> Some
>>>> skeletons were not replaced by real objects: 2671330
>>>> 
>>>> There are a couple of things I’m not clear on; one of theme is the
>>>> notion of pure objects versus skeleton objects. There is a cryptic
>>>> comment
>>>> in IntegrationWriterDataTrackingImpl.close() about this error message
>>>> which I don’t quite understand.
>>>> 
>>>> I’ve tried this a ways: by creating ProxyReferences and with the
>>>> more
>>>> memory-heavy way by querying, then keeping the gene and proteins
>>>> objects
>>>> in a hash. In both cases, I get the message.
>>>> 
>>>> When I run this with the memory-heavy method, I see that I have
>>>> duplicated
>>>> genes and proteins in the production db, even though I never call store
>>>> on
>>>> the genes or proteins.
>>>> 
>>>> So what I was wondering is 1) what does this error message mean? and 2)
>>>> If
>>>> I query for all the objects in advance that my new data objects will
>>>> point
>>>> to, how can I avoid having to do other queries during load time?
>>>> 
>>>> Thanks. I appreciate all your help,
>>>> 
>>>> Joe Carlson _______________________________________________
>>>> dev mailing list
>>>> dev at intermine.org
>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>> 
>>> 
>> 
>> 
>> 
> 




More information about the dev mailing list