[InterMine Dev] strange database record corruption

Richard Smith richard at flymine.org
Wed Jul 1 16:29:27 BST 2015


Hi Joe,
Object ids in the target database are assigned when an object is stored,
the ids are fetched (in batches) from a sequence in the database which
autoincrements. The same id will never be assigned to two different new
objects.

I think I see what's happening with the DirectDataLoader and the pro-user
idMap manipulation though.

When objects are loaded from an items database they have ids in the source
items database, when stored they are assigned a new id in the production
database. The idMap maps between the id the item had and the id assigned
in the target database. This means references from the original items are
preserved, if organism item 101 is stored and gets an object id of 201 we
know a gene item referencing organism 101 should reference organism object
201 in the production db.

With the DirectDataloader there aren't any source ids because there's no
items database. That doesn't matter, we just assign ids sequentially as
objects are created. These aren't the ids stored in the production
database.

I forget exactly what you're doing to manipulate the idMap but presumably
you're pre-populating it with known ids from the production database. At
some point these ids are colliding with the throwaway source ids generated
in DirectDataLoader.

As for solutions - I don't think decrementing new ids in DirectDataloader
is a good idea as valid ids can be negative. You're right that calling
IntegrationWriter.getSerial() will throw away ids, potentially quite a
lot. A better fix might be to provide the source ids you've put in the
idMap to DirectDataLoader and tell it not to assign any of those.

Oh, and the mangled object that was created is a 'feature' - it's possible
to store dynamic objects that combine multiple classes in ways not defined
in the model. We don't use this (on purpose) and will hopefully remove it
soon.

Hope this helps,
Richard.


> Hello again,
>
> Sorry if that email was confusing. I tend to write semi-coherent email
> late a night when I’m about to call it a night. And I was hoping to catch
> you folks before the weekend.
>
> The issue appears to be a collision in the id fields between the one
> generated by DirectDataLoader.createObject and an id for an object already
> in the database. I’m loading ~ 1M records (100K families, 700K members,
> plus another 100K centroid sequences and 100K sequence alignment records.)
> and once I used an id for an object created by
> DirectDataLoader.createObject that collided with one in the db (the first
> organism record, as it turned out), then I got this weird object merger.
> What appears to have been an important factor was the fact that I was
> trying to minimize the querying of the database during the loading process
> by telling the IntegrationWriter what elements that I had retrieved from
> the database - including organisms - that do not need to be queried by
> inserting into IntegrationWriter’s idMap. I think Richard had suggested
> this; I’m not totally sure. So I made a markElementAsStored routine in
> IntegrationWriter to do this.
>
> But when IntegrationWriter.getEquivalentObject is called, if an id of an
> object created with DirectDataLoader.createObject coincides with an id in
> idMap, then the two things will be called equivalent and some sort of mess
> gets created.
>
> Now, I see that manipulating idMap is a dangerous thing and I’ll stop. Or
> at least be more careful in how I do it. But I’m curious about this
> approach. It seems to me that there will always be a possibility that an
> id generated by createObject will collide with the id of something that
> has already been retrieved - and possibly updated - by the
> IntegrationWriter. So there is a slight chance of a collision.
>
> I’m trying a couple of work arounds: one is to decrement the idCounter
> from 0 in DirectDataLoader.createObject rather than incrementing it. So
> long as it is unique, this should be OK, right? The other is to call setId
> with getIntegrationWriter().getSerial(). Both appear to work. The first
> method may give problems if I have to worry about integer wrap around. The
> second wastes some serial numbers. Both methods seem to work at first
> blush. I’m tempted to go with the first: chances are, if I have integer
> wrap around I’m going to have other problems.
>
> Thanks for all your work and help on this!
>
> Joe Carlson
>
> On Jun 25, 2015, at 11:19 PM, Joe Carlson <jwcarlson at lbl.gov> wrote:
>
>> Hello,
>>
>> So I’m seeing something very, very weird. Somehow I’m managing to get
>> items in different tables with the same id and a corrupted record in the
>> intermineobject table.
>>
>> I’m loading clusters of protein. The relevant tables are proteinfamily
>> and proteinfamilymember. Each collection of proteinfamilies are based on
>> the set of organisms in the family building (some collection have only a
>> few organisms, others have many).
>>
>> In 3 out of 12 of the collections, the data loading fails with a very
>> cryptic error message. Here is an example of message:
>>> /global/u1/j/jcarlson/src/intermine/bio/sources/phytozome-clusters/build.xml:23:
>>> java.lang.IllegalArgumentException: Conflicting values for field
>>> Gene,ProteinFamilyMember.organism between phytozome-chado-A.coerulea
>>> (value "Organism,ProteinFamilyMember [annotationVersion="v1.1",
>>> assemblyVersion="v1", commonName="Colorado blue columbine", count="10",
>>> genus="Aquilegia", id="1000003", membershipDetail="HMM pledge -
>>> complete", name="Aquilegia coerulea", organism=89000000,
>>> protein=91700834, proteinFamily=378112305, proteomeId="195",
>>> shortName="A. coerulea", species="coerulea", taxonId="218851",
>>> version="current"]" in database with ID 1973680) and
>>> phytozome-cluster-node-4956 (value "Organism [annotationVersion="v1.1",
>>> assemblyVersion="v1.0", commonName="switchgrass", genus="Panicum",
>>> id=142000000, name="Panicum virgatum", proteomeId=273, shortName="P.
>>> virgatum", species="virgatum", taxonId=38727, version="current"]" being
>>> stored). This field needs configuring in the
>>> genomic_priorities.properties file
>>>         at
>>> org.intermine.dataloader.SourcePriorityComparator.compare(SourcePriorityComparator.java:276)
>>>         at
>>> org.intermine.dataloader.SourcePriorityComparator.compare(SourcePriorityComparator.java:34)
>>>         at java.util.TreeMap.put(TreeMap.java:545)
>>>         at java.util.TreeSet.add(TreeSet.java:255)
>>>         at
>>> org.intermine.dataloader.IntegrationWriterDataTrackingImpl.store(IntegrationWriterDataTrackingImpl.java:385)
>>
>>
>> It is strange since the data that I’m loading has no previous objects to
>> compare. This was confusing me for a long time, for a long time I
>> thought I just had problems with my keys. Then I saw that there was a
>> corruption in the intermineobject table:
>>
>>> select * from intermineobject where id=1000003;
>>>                                                                                                                                                                                                                                              object
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> |
>>>
>>>
>>> id
>>>
>>>
>>>
>>> |
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> class
>>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+------------------------------------------------------------------------------
>>>  $_^org.intermine.model.bio.Organism
>>> org.intermine.model.bio.ProteinFamilyMember$_^aannotationVersion$_^v1.1$_^aassemblyVersion$_^v1$_^acommonName$_^Colorado
>>> blue
>>> columbine$_^acount$_^10$_^agenus$_^Aquilegia$_^aid$_^1000003$_^amembershipDetail$_^HMM
>>> pledge - complete$_^aname$_^Aquilegia
>>> coerulea$_^rorganism$_^89000000$_^rprotein$_^91700834$_^rproteinFamily$_^378112305$_^aproteomeId$_^195$_^ashortName$_^A.
>>> coerulea$_^aspecies$_^coerulea$_^ataxonId$_^218851$_^aversion$_^current
>>> | 1000003 | org.intermine.model.bio.Organism
>>> org.intermine.model.bio.ProteinFamilyMember
>>> (1 row)
>>
>>
>> This is some sort of unholy union of an organism and a
>> proteinfamilymember. There is an entry in both the organism table and
>> proteinfamilymember table with this id. The fields of these two records
>> are OK, other than the fact that the class field is the concatenation of
>> the 2 class names.
>>
>> The behavior is reproducible; after inserting ~ 100K families and ~ 700K
>> members, the loading fails on the same exact record if I load in the
>> same order. If I change the loading, there is a similar error on a
>> different entry. 1000003 is my first ‘non-trivial’ intermine object (the
>> others being sequence ontology, a data source and a data set record.)
>>
>> Have you seen this type of behavior before? I just found out about this
>> record corruption tonight. The fact that it is so reproducible makes me
>> think there is some sort of counter rollover that I’m running into.
>>
>> In the interests of full disclosure, I should say I’m using a direct
>> data loader. The code is in my github repo in
>> bio/sources/phytozome-clusters/.
>>
>> Thanks,
>>
>> Joe
>>
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>




More information about the dev mailing list