[InterMine Dev] question about an error message

Richard Smith richard at flymine.org
Tue Apr 21 15:44:15 BST 2015


Hi Joe,
When an object is stored any objects it references are stored first so the
correct ids can be inserted in foreign key columns. If the referenced
object has already been stored then the new target id is known, it it
hasn't then a skeleton object is stored.

The skeleton fills in enough fields to store the object but the loader
expects the full object to be stored later in the load. By storing the
skeleton and waiting for the real object we don't have to pause the load
to go looking for referenced objects in the source data every time one is
seen.

In the case where you created and referenced actual objects (not
ProxyReferences) these were stored but without any integration keys you
ended up with duplicate objects.

In both cases you get the un-replaced skeleton object error as a
referenced object or ProxyReference has been stored without storing the
actual object. Hopefully that makes it a little clearer what is happening,
even if it doesn't actually solve your problem.

Tomorrow I hope to finish a change to make the DirectDataLoader use a
ParallelBatchingFetcher - to group the integration queries into
configurable batch sizes as is done in standard data loading. I think that
will achieve a similar result to the code you've been working on.

The alternative would be to add an "I'm doing something weird but let me
get on with it" flag to allow you to store ProxyReferences to fill in
foreign keys without getting the skeletons error.

All the best,
Richard.







> Hi Richard (and gang),
>
> I have a question about an error message I’m seeing. A little
> background. As you know, I’m trying to speed up the loading. What I’m
> trying to do is to use a DirectDataLoader to load our protein families as
> fast as possible. I was thinking that if I could do all my queries for
> existing gene and protein records up front, then when I do my loading, I
> can create a ProteinFamily object with the references to the genes and
> proteins in the production database pre-filled.
>
> I would have thought that removing the keys to the genes and proteins from
> the data loader’s primary keys file would prevent any query to the
> database during the loading. I’ve seen that when I remove all the keys
> to genes and proteins then the integration step does not query the
> production db during the insertions. This is what I want. And as far as I
> can tell, there are no unnecessary queries happening.
>
> I’ve run the integration step and I get an ObjectStoreException:  Some
> skeletons were not replaced by real objects: 2671330
>
> There are a couple of things I’m not clear on; one of theme is the
> notion of pure objects versus skeleton objects. There is a cryptic comment
> in IntegrationWriterDataTrackingImpl.close() about this error message
> which I don’t quite understand.
>
> I’ve tried this a ways: by creating ProxyReferences and with the more
> memory-heavy way by querying, then keeping the gene and proteins objects
> in a hash. In both cases, I get the message.
>
> When I run this with the memory-heavy method, I see that I have duplicated
> genes and proteins in the production db, even though I never call store on
> the genes or proteins.
>
> So what I was wondering is 1) what does this error message mean? and 2) If
> I query for all the objects in advance that my new data objects will point
> to, how can I avoid having to do other queries during load time?
>
> Thanks. I appreciate all your help,
>
> Joe Carlson _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>




More information about the dev mailing list