[InterMine Dev] question about an error message

Joe Carlson jwcarlson at lbl.gov
Thu Apr 23 17:24:32 BST 2015

On 04/23/2015 01:44 AM, Richard Smith wrote:
> Hi Joe,
> So do you do queries at the start to get all protein and organism ids you
> need?  How long does that query take?

yes. I'm querying and hashing the id's for all genes and proteins. It is 
pretty quick. 173K genes is under a minute. Next, the query for the 234K 
proteins look under a minute. I was not playing with vm size and (as 
proxy references) everything ran in 32G. When I stored things as objects 
I needed to bump it up to 50G.
> Fetching ids at the start is probably a more efficient way to load some
> sources, I've written a different integration system that was really fast
> using this method.
> It puts a bit more work into writing individual parsers and would work
> better for references than for merging objects but as you have found it
> saves creation of a large number of objects only needed for references. It
> also requires a specific load order for data sources but that doesn't
> matter too much.

You're right about the load order. I've written the loader so that it 
will create and store an object if it was not prefetched. So other than 
a huge performance hit, it should work. (One of my quibbles about the 
intermine architecture is that you're implicitly depending on the order 
of elements in project.xml to determine load order. I realize that 
putting in some ant-style dependencies in the schema will complicate 
matters a lot, but I had always thought xml should be process-able in 
any order. Once when I was trying to script up generating project.xml 
and the postprocessing element came before the sources in project.xml 
and intermine complained.)

And the loader is very specifically tied to our data model. If someone 
else had some notion of protein families, they'd probably find that my 
loader is useless to them if they represented things slightly 
differently. But that's a trade off we need to make since loading time 
is our biggest issue now. I'll push up my loader to github later today.

I'm also thinking about a different loading technique. I should warn 
you, it will make you angry.


> Cheers,
> Richard.
>> Hi Richard,
>> I played around with entering the id’s for objects into idMap. This
>> dramatically speeds up the processing: 48 minutes to 8 minutes in a trial
>> run of our a set of protein families into a sparsely filled mine. This is
>> a good thing.
>> I created a method markAsStored(Integer) in IntegrationWriterAbstractImpl
>> to set the idMap entry. I seems that this would be a good method to put in
>> the higher level interface as well. I’ll leave that decision to you.
>> Joe
>>> On Apr 22, 2015, at 6:59 AM, Richard Smith <richard at flymine.org> wrote:
>>> Hi Joe,
>>> I think you might be able to achieve this using ProxyReferences and
>>> pre-populating the IntegrationWriterAbstractImpl idMap with the known
>>> ids
>>> mapping to themselves. Then the integration writer should think it has
>>> seen these objects before and just store the foreign keys.
>>> IntegrationWriterAbstractImpl has a public assignMapping() method.
>>> Subclassing ProxyReference may be a little trickier, the code to look at
>>> is in IntegrationWriterAbstractImpl.copyField() which is where store is
>>> called on referenced objects.
>>> Good luck!
>>> Richard.
>>>> Hi Richard,
>>>> This is starting to make more sense. I’m thinking that the error
>>>> message
>>>> is really what I want to get. Not creating real objects from proxy
>>>> references is what I’d like to do.
>>>> So I was thinking of subclassing ProxyReference to make a non-storable
>>>> ProxyReference. Something that the store method would just ignore. Do
>>>> you
>>>> think there is a problem with this approach?
>>>> Joe
>>>>> On Apr 21, 2015, at 7:44 AM, Richard Smith <richard at flymine.org>
>>>>> wrote:
>>>>> Hi Joe,
>>>>> When an object is stored any objects it references are stored first so
>>>>> the
>>>>> correct ids can be inserted in foreign key columns. If the referenced
>>>>> object has already been stored then the new target id is known, it it
>>>>> hasn't then a skeleton object is stored.
>>>>> The skeleton fills in enough fields to store the object but the loader
>>>>> expects the full object to be stored later in the load. By storing the
>>>>> skeleton and waiting for the real object we don't have to pause the
>>>>> load
>>>>> to go looking for referenced objects in the source data every time one
>>>>> is
>>>>> seen.
>>>>> In the case where you created and referenced actual objects (not
>>>>> ProxyReferences) these were stored but without any integration keys
>>>>> you
>>>>> ended up with duplicate objects.
>>>>> In both cases you get the un-replaced skeleton object error as a
>>>>> referenced object or ProxyReference has been stored without storing
>>>>> the
>>>>> actual object. Hopefully that makes it a little clearer what is
>>>>> happening,
>>>>> even if it doesn't actually solve your problem.
>>>>> Tomorrow I hope to finish a change to make the DirectDataLoader use a
>>>>> ParallelBatchingFetcher - to group the integration queries into
>>>>> configurable batch sizes as is done in standard data loading. I think
>>>>> that
>>>>> will achieve a similar result to the code you've been working on.
>>>>> The alternative would be to add an "I'm doing something weird but let
>>>>> me
>>>>> get on with it" flag to allow you to store ProxyReferences to fill in
>>>>> foreign keys without getting the skeletons error.
>>>>> All the best,
>>>>> Richard.
>>>>>> Hi Richard (and gang),
>>>>>> I have a question about an error message I’m seeing. A
>>>>>> little
>>>>>> background. As you know, I’m trying to speed up the
>>>>>> loading. What
>>>>>> I’m
>>>>>> trying to do is to use a DirectDataLoader to load our protein
>>>>>> families
>>>>>> as
>>>>>> fast as possible. I was thinking that if I could do all my queries
>>>>>> for
>>>>>> existing gene and protein records up front, then when I do my
>>>>>> loading,
>>>>>> I
>>>>>> can create a ProteinFamily object with the references to the genes
>>>>>> and
>>>>>> proteins in the production database pre-filled.
>>>>>> I would have thought that removing the keys to the genes and proteins
>>>>>> from
>>>>>> the data loader’s primary keys file would prevent any
>>>>>> query to the
>>>>>> database during the loading. I’ve seen that when I remove
>>>>>> all the
>>>>>> keys
>>>>>> to genes and proteins then the integration step does not query the
>>>>>> production db during the insertions. This is what I want. And as far
>>>>>> as
>>>>>> I
>>>>>> can tell, there are no unnecessary queries happening.
>>>>>> I’ve run the integration step and I get an
>>>>>> ObjectStoreException:
>>>>>> Some
>>>>>> skeletons were not replaced by real objects: 2671330
>>>>>> There are a couple of things I’m not clear on; one of
>>>>>> theme is the
>>>>>> notion of pure objects versus skeleton objects. There is a cryptic
>>>>>> comment
>>>>>> in IntegrationWriterDataTrackingImpl.close() about this error message
>>>>>> which I don’t quite understand.
>>>>>> I’ve tried this a ways: by creating ProxyReferences and
>>>>>> with the
>>>>>> more
>>>>>> memory-heavy way by querying, then keeping the gene and proteins
>>>>>> objects
>>>>>> in a hash. In both cases, I get the message.
>>>>>> When I run this with the memory-heavy method, I see that I have
>>>>>> duplicated
>>>>>> genes and proteins in the production db, even though I never call
>>>>>> store
>>>>>> on
>>>>>> the genes or proteins.
>>>>>> So what I was wondering is 1) what does this error message mean? and
>>>>>> 2)
>>>>>> If
>>>>>> I query for all the objects in advance that my new data objects will
>>>>>> point
>>>>>> to, how can I avoid having to do other queries during load time?
>>>>>> Thanks. I appreciate all your help,
>>>>>> Joe Carlson _______________________________________________
>>>>>> dev mailing list
>>>>>> dev at intermine.org
>>>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

More information about the dev mailing list