[InterMine Dev] question about an error message

Richard Smith richard at flymine.org
Thu Apr 23 09:44:17 BST 2015


Hi Joe,
So do you do queries at the start to get all protein and organism ids you
need?  How long does that query take?

Fetching ids at the start is probably a more efficient way to load some
sources, I've written a different integration system that was really fast
using this method.

It puts a bit more work into writing individual parsers and would work
better for references than for merging objects but as you have found it
saves creation of a large number of objects only needed for references. It
also requires a specific load order for data sources but that doesn't
matter too much.

Cheers,
Richard.





> Hi Richard,
>
> I played around with entering the id’s for objects into idMap. This
> dramatically speeds up the processing: 48 minutes to 8 minutes in a trial
> run of our a set of protein families into a sparsely filled mine. This is
> a good thing.
>
> I created a method markAsStored(Integer) in IntegrationWriterAbstractImpl
> to set the idMap entry. I seems that this would be a good method to put in
> the higher level interface as well. I’ll leave that decision to you.
>
> Joe
>> On Apr 22, 2015, at 6:59 AM, Richard Smith <richard at flymine.org> wrote:
>>
>> Hi Joe,
>> I think you might be able to achieve this using ProxyReferences and
>> pre-populating the IntegrationWriterAbstractImpl idMap with the known
>> ids
>> mapping to themselves. Then the integration writer should think it has
>> seen these objects before and just store the foreign keys.
>> IntegrationWriterAbstractImpl has a public assignMapping() method.
>>
>> Subclassing ProxyReference may be a little trickier, the code to look at
>> is in IntegrationWriterAbstractImpl.copyField() which is where store is
>> called on referenced objects.
>>
>> Good luck!
>>
>> Richard.
>>
>>
>>> Hi Richard,
>>>
>>> This is starting to make more sense. I’m thinking that the error
>>> message
>>> is really what I want to get. Not creating real objects from proxy
>>> references is what I’d like to do.
>>>
>>> So I was thinking of subclassing ProxyReference to make a non-storable
>>> ProxyReference. Something that the store method would just ignore. Do
>>> you
>>> think there is a problem with this approach?
>>>
>>> Joe
>>>
>>>> On Apr 21, 2015, at 7:44 AM, Richard Smith <richard at flymine.org>
>>>> wrote:
>>>>
>>>> Hi Joe,
>>>> When an object is stored any objects it references are stored first so
>>>> the
>>>> correct ids can be inserted in foreign key columns. If the referenced
>>>> object has already been stored then the new target id is known, it it
>>>> hasn't then a skeleton object is stored.
>>>>
>>>> The skeleton fills in enough fields to store the object but the loader
>>>> expects the full object to be stored later in the load. By storing the
>>>> skeleton and waiting for the real object we don't have to pause the
>>>> load
>>>> to go looking for referenced objects in the source data every time one
>>>> is
>>>> seen.
>>>>
>>>> In the case where you created and referenced actual objects (not
>>>> ProxyReferences) these were stored but without any integration keys
>>>> you
>>>> ended up with duplicate objects.
>>>>
>>>> In both cases you get the un-replaced skeleton object error as a
>>>> referenced object or ProxyReference has been stored without storing
>>>> the
>>>> actual object. Hopefully that makes it a little clearer what is
>>>> happening,
>>>> even if it doesn't actually solve your problem.
>>>>
>>>> Tomorrow I hope to finish a change to make the DirectDataLoader use a
>>>> ParallelBatchingFetcher - to group the integration queries into
>>>> configurable batch sizes as is done in standard data loading. I think
>>>> that
>>>> will achieve a similar result to the code you've been working on.
>>>>
>>>> The alternative would be to add an "I'm doing something weird but let
>>>> me
>>>> get on with it" flag to allow you to store ProxyReferences to fill in
>>>> foreign keys without getting the skeletons error.
>>>>
>>>> All the best,
>>>> Richard.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>> Hi Richard (and gang),
>>>>>
>>>>> I have a question about an error message I’m seeing. A
>>>>> little
>>>>> background. As you know, I’m trying to speed up the
>>>>> loading. What
>>>>> I’m
>>>>> trying to do is to use a DirectDataLoader to load our protein
>>>>> families
>>>>> as
>>>>> fast as possible. I was thinking that if I could do all my queries
>>>>> for
>>>>> existing gene and protein records up front, then when I do my
>>>>> loading,
>>>>> I
>>>>> can create a ProteinFamily object with the references to the genes
>>>>> and
>>>>> proteins in the production database pre-filled.
>>>>>
>>>>> I would have thought that removing the keys to the genes and proteins
>>>>> from
>>>>> the data loader’s primary keys file would prevent any
>>>>> query to the
>>>>> database during the loading. I’ve seen that when I remove
>>>>> all the
>>>>> keys
>>>>> to genes and proteins then the integration step does not query the
>>>>> production db during the insertions. This is what I want. And as far
>>>>> as
>>>>> I
>>>>> can tell, there are no unnecessary queries happening.
>>>>>
>>>>> I’ve run the integration step and I get an
>>>>> ObjectStoreException:
>>>>> Some
>>>>> skeletons were not replaced by real objects: 2671330
>>>>>
>>>>> There are a couple of things I’m not clear on; one of
>>>>> theme is the
>>>>> notion of pure objects versus skeleton objects. There is a cryptic
>>>>> comment
>>>>> in IntegrationWriterDataTrackingImpl.close() about this error message
>>>>> which I don’t quite understand.
>>>>>
>>>>> I’ve tried this a ways: by creating ProxyReferences and
>>>>> with the
>>>>> more
>>>>> memory-heavy way by querying, then keeping the gene and proteins
>>>>> objects
>>>>> in a hash. In both cases, I get the message.
>>>>>
>>>>> When I run this with the memory-heavy method, I see that I have
>>>>> duplicated
>>>>> genes and proteins in the production db, even though I never call
>>>>> store
>>>>> on
>>>>> the genes or proteins.
>>>>>
>>>>> So what I was wondering is 1) what does this error message mean? and
>>>>> 2)
>>>>> If
>>>>> I query for all the objects in advance that my new data objects will
>>>>> point
>>>>> to, how can I avoid having to do other queries during load time?
>>>>>
>>>>> Thanks. I appreciate all your help,
>>>>>
>>>>> Joe Carlson _______________________________________________
>>>>> dev mailing list
>>>>> dev at intermine.org
>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>>>
>>>>
>>>
>>>
>>>
>>
>
>
>




More information about the dev mailing list