[InterMine Dev] question about an error message

Richard Smith richard at flymine.org
Wed Apr 22 14:59:26 BST 2015

Hi Joe,
I think you might be able to achieve this using ProxyReferences and
pre-populating the IntegrationWriterAbstractImpl idMap with the known ids
mapping to themselves. Then the integration writer should think it has
seen these objects before and just store the foreign keys.
IntegrationWriterAbstractImpl has a public assignMapping() method.

Subclassing ProxyReference may be a little trickier, the code to look at
is in IntegrationWriterAbstractImpl.copyField() which is where store is
called on referenced objects.

Good luck!


> Hi Richard,
> This is starting to make more sense. I’m thinking that the error message
> is really what I want to get. Not creating real objects from proxy
> references is what I’d like to do.
> So I was thinking of subclassing ProxyReference to make a non-storable
> ProxyReference. Something that the store method would just ignore. Do you
> think there is a problem with this approach?
> Joe
>> On Apr 21, 2015, at 7:44 AM, Richard Smith <richard at flymine.org> wrote:
>> Hi Joe,
>> When an object is stored any objects it references are stored first so
>> the
>> correct ids can be inserted in foreign key columns. If the referenced
>> object has already been stored then the new target id is known, it it
>> hasn't then a skeleton object is stored.
>> The skeleton fills in enough fields to store the object but the loader
>> expects the full object to be stored later in the load. By storing the
>> skeleton and waiting for the real object we don't have to pause the load
>> to go looking for referenced objects in the source data every time one
>> is
>> seen.
>> In the case where you created and referenced actual objects (not
>> ProxyReferences) these were stored but without any integration keys you
>> ended up with duplicate objects.
>> In both cases you get the un-replaced skeleton object error as a
>> referenced object or ProxyReference has been stored without storing the
>> actual object. Hopefully that makes it a little clearer what is
>> happening,
>> even if it doesn't actually solve your problem.
>> Tomorrow I hope to finish a change to make the DirectDataLoader use a
>> ParallelBatchingFetcher - to group the integration queries into
>> configurable batch sizes as is done in standard data loading. I think
>> that
>> will achieve a similar result to the code you've been working on.
>> The alternative would be to add an "I'm doing something weird but let me
>> get on with it" flag to allow you to store ProxyReferences to fill in
>> foreign keys without getting the skeletons error.
>> All the best,
>> Richard.
>>> Hi Richard (and gang),
>>> I have a question about an error message I’m seeing. A little
>>> background. As you know, I’m trying to speed up the loading. What
>>> I’m
>>> trying to do is to use a DirectDataLoader to load our protein families
>>> as
>>> fast as possible. I was thinking that if I could do all my queries for
>>> existing gene and protein records up front, then when I do my loading,
>>> I
>>> can create a ProteinFamily object with the references to the genes and
>>> proteins in the production database pre-filled.
>>> I would have thought that removing the keys to the genes and proteins
>>> from
>>> the data loader’s primary keys file would prevent any query to the
>>> database during the loading. I’ve seen that when I remove all the
>>> keys
>>> to genes and proteins then the integration step does not query the
>>> production db during the insertions. This is what I want. And as far as
>>> I
>>> can tell, there are no unnecessary queries happening.
>>> I’ve run the integration step and I get an ObjectStoreException:
>>> Some
>>> skeletons were not replaced by real objects: 2671330
>>> There are a couple of things I’m not clear on; one of theme is the
>>> notion of pure objects versus skeleton objects. There is a cryptic
>>> comment
>>> in IntegrationWriterDataTrackingImpl.close() about this error message
>>> which I don’t quite understand.
>>> I’ve tried this a ways: by creating ProxyReferences and with the
>>> more
>>> memory-heavy way by querying, then keeping the gene and proteins
>>> objects
>>> in a hash. In both cases, I get the message.
>>> When I run this with the memory-heavy method, I see that I have
>>> duplicated
>>> genes and proteins in the production db, even though I never call store
>>> on
>>> the genes or proteins.
>>> So what I was wondering is 1) what does this error message mean? and 2)
>>> If
>>> I query for all the objects in advance that my new data objects will
>>> point
>>> to, how can I avoid having to do other queries during load time?
>>> Thanks. I appreciate all your help,
>>> Joe Carlson _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

More information about the dev mailing list