[InterMine Dev] strange database record corruption

Joe Carlson jwcarlson at lbl.gov
Wed Jul 1 18:45:19 BST 2015


Hi Richard,

Thanks for the email. I did a few more experiments after sending that last email and, while I still don’t quite understand it all, at least I have something that is working and just wanted to get the loading done before poking at it again.

I took out all of my idMap manipulation and still saw a problem. I had only been using it to keep a record of objects that I had retrieved from the database and kept around as proxy references. I was trying to save the time of having to do a query again when it came time to save the objects that referenced those things. So, other than taking a lot longer, the loading failed in the same manner.

I had been wondering about the fact that the class for the new thing had both classes listed; I was thinking you were going to have some sort of multiple inheritance going on. Might have seemed like a good idea at the time, but I imagine it would be a pain to get the serialization to work out. 

The good news is that we’ve upgraded some hardware here and the slowdown that we’d seen in the past as gone away. We’re seeing a performance dip comparable to what you had: a slight dip but nothing horrible. And the entire loading process is down to 48 hours or so. Getting better. (Though I still need to do the transfer-sequence post processing step to see how long that takes; that was another bottleneck in the loading.)

If I have any more experience and thoughts about the id collision thing, I’ll let you know.

joe

> On Jul 1, 2015, at 8:29 AM, Richard Smith <richard at flymine.org> wrote:
> 
> Hi Joe,
> Object ids in the target database are assigned when an object is stored,
> the ids are fetched (in batches) from a sequence in the database which
> autoincrements. The same id will never be assigned to two different new
> objects.
> 
> I think I see what's happening with the DirectDataLoader and the pro-user
> idMap manipulation though.
> 
> When objects are loaded from an items database they have ids in the source
> items database, when stored they are assigned a new id in the production
> database. The idMap maps between the id the item had and the id assigned
> in the target database. This means references from the original items are
> preserved, if organism item 101 is stored and gets an object id of 201 we
> know a gene item referencing organism 101 should reference organism object
> 201 in the production db.
> 
> With the DirectDataloader there aren't any source ids because there's no
> items database. That doesn't matter, we just assign ids sequentially as
> objects are created. These aren't the ids stored in the production
> database.
> 
> I forget exactly what you're doing to manipulate the idMap but presumably
> you're pre-populating it with known ids from the production database. At
> some point these ids are colliding with the throwaway source ids generated
> in DirectDataLoader.
> 
> As for solutions - I don't think decrementing new ids in DirectDataloader
> is a good idea as valid ids can be negative. You're right that calling
> IntegrationWriter.getSerial() will throw away ids, potentially quite a
> lot. A better fix might be to provide the source ids you've put in the
> idMap to DirectDataLoader and tell it not to assign any of those.
> 
> Oh, and the mangled object that was created is a 'feature' - it's possible
> to store dynamic objects that combine multiple classes in ways not defined
> in the model. We don't use this (on purpose) and will hopefully remove it
> soon.
> 
> Hope this helps,
> Richard.
> 
> 
>> Hello again,
>> 
>> Sorry if that email was confusing. I tend to write semi-coherent email
>> late a night when I知 about to call it a night. And I was hoping to catch
>> you folks before the weekend.
>> 
>> The issue appears to be a collision in the id fields between the one
>> generated by DirectDataLoader.createObject and an id for an object already
>> in the database. I知 loading ‾ 1M records (100K families, 700K members,
>> plus another 100K centroid sequences and 100K sequence alignment records.)
>> and once I used an id for an object created by
>> DirectDataLoader.createObject that collided with one in the db (the first
>> organism record, as it turned out), then I got this weird object merger.
>> What appears to have been an important factor was the fact that I was
>> trying to minimize the querying of the database during the loading process
>> by telling the IntegrationWriter what elements that I had retrieved from
>> the database - including organisms - that do not need to be queried by
>> inserting into IntegrationWriter痴 idMap. I think Richard had suggested
>> this; I知 not totally sure. So I made a markElementAsStored routine in
>> IntegrationWriter to do this.
>> 
>> But when IntegrationWriter.getEquivalentObject is called, if an id of an
>> object created with DirectDataLoader.createObject coincides with an id in
>> idMap, then the two things will be called equivalent and some sort of mess
>> gets created.
>> 
>> Now, I see that manipulating idMap is a dangerous thing and I値l stop. Or
>> at least be more careful in how I do it. But I知 curious about this
>> approach. It seems to me that there will always be a possibility that an
>> id generated by createObject will collide with the id of something that
>> has already been retrieved - and possibly updated - by the
>> IntegrationWriter. So there is a slight chance of a collision.
>> 
>> I知 trying a couple of work arounds: one is to decrement the idCounter
>> from 0 in DirectDataLoader.createObject rather than incrementing it. So
>> long as it is unique, this should be OK, right? The other is to call setId
>> with getIntegrationWriter().getSerial(). Both appear to work. The first
>> method may give problems if I have to worry about integer wrap around. The
>> second wastes some serial numbers. Both methods seem to work at first
>> blush. I知 tempted to go with the first: chances are, if I have integer
>> wrap around I知 going to have other problems.
>> 
>> Thanks for all your work and help on this!
>> 
>> Joe Carlson
>> 
>> On Jun 25, 2015, at 11:19 PM, Joe Carlson <jwcarlson at lbl.gov> wrote:
>> 
>>> Hello,
>>> 
>>> So I知 seeing something very, very weird. Somehow I知 managing to get
>>> items in different tables with the same id and a corrupted record in the
>>> intermineobject table.
>>> 
>>> I知 loading clusters of protein. The relevant tables are proteinfamily
>>> and proteinfamilymember. Each collection of proteinfamilies are based on
>>> the set of organisms in the family building (some collection have only a
>>> few organisms, others have many).
>>> 
>>> In 3 out of 12 of the collections, the data loading fails with a very
>>> cryptic error message. Here is an example of message:
>>>> /global/u1/j/jcarlson/src/intermine/bio/sources/phytozome-clusters/build.xml:23:
>>>> java.lang.IllegalArgumentException: Conflicting values for field
>>>> Gene,ProteinFamilyMember.organism between phytozome-chado-A.coerulea
>>>> (value "Organism,ProteinFamilyMember [annotationVersion="v1.1",
>>>> assemblyVersion="v1", commonName="Colorado blue columbine", count="10",
>>>> genus="Aquilegia", id="1000003", membershipDetail="HMM pledge -
>>>> complete", name="Aquilegia coerulea", organism=89000000,
>>>> protein=91700834, proteinFamily=378112305, proteomeId="195",
>>>> shortName="A. coerulea", species="coerulea", taxonId="218851",
>>>> version="current"]" in database with ID 1973680) and
>>>> phytozome-cluster-node-4956 (value "Organism [annotationVersion="v1.1",
>>>> assemblyVersion="v1.0", commonName="switchgrass", genus="Panicum",
>>>> id=142000000, name="Panicum virgatum", proteomeId=273, shortName="P.
>>>> virgatum", species="virgatum", taxonId=38727, version="current"]" being
>>>> stored). This field needs configuring in the
>>>> genomic_priorities.properties file
>>>>        at
>>>> org.intermine.dataloader.SourcePriorityComparator.compare(SourcePriorityComparator.java:276)
>>>>        at
>>>> org.intermine.dataloader.SourcePriorityComparator.compare(SourcePriorityComparator.java:34)
>>>>        at java.util.TreeMap.put(TreeMap.java:545)
>>>>        at java.util.TreeSet.add(TreeSet.java:255)
>>>>        at
>>>> org.intermine.dataloader.IntegrationWriterDataTrackingImpl.store(IntegrationWriterDataTrackingImpl.java:385)
>>> 
>>> 
>>> It is strange since the data that I知 loading has no previous objects to
>>> compare. This was confusing me for a long time, for a long time I
>>> thought I just had problems with my keys. Then I saw that there was a
>>> corruption in the intermineobject table:
>>> 
>>>> select * from intermineobject where id=1000003;
>>>>                                                                                                                                                                                                                                             object
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> |
>>>> 
>>>> 
>>>> id
>>>> 
>>>> 
>>>> 
>>>> |
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> class

>>>> $_^org.intermine.model.bio.Organism
>>>> org.intermine.model.bio.ProteinFamilyMember$_^aannotationVersion$_^v1.1$_^aassemblyVersion$_^v1$_^acommonName$_^Colorado
>>>> blue
>>>> columbine$_^acount$_^10$_^agenus$_^Aquilegia$_^aid$_^1000003$_^amembershipDetail$_^HMM
>>>> pledge - complete$_^aname$_^Aquilegia
>>>> coerulea$_^rorganism$_^89000000$_^rprotein$_^91700834$_^rproteinFamily$_^378112305$_^aproteomeId$_^195$_^ashortName$_^A.
>>>> coerulea$_^aspecies$_^coerulea$_^ataxonId$_^218851$_^aversion$_^current
>>>> | 1000003 | org.intermine.model.bio.Organism
>>>> org.intermine.model.bio.ProteinFamilyMember
>>>> (1 row)
>>> 
>>> 
>>> This is some sort of unholy union of an organism and a
>>> proteinfamilymember. There is an entry in both the organism table and
>>> proteinfamilymember table with this id. The fields of these two records
>>> are OK, other than the fact that the class field is the concatenation of
>>> the 2 class names.
>>> 
>>> The behavior is reproducible; after inserting ‾ 100K families and ‾ 700K
>>> members, the loading fails on the same exact record if I load in the
>>> same order. If I change the loading, there is a similar error on a
>>> different entry. 1000003 is my first 創on-trivial� intermine object (the
>>> others being sequence ontology, a data source and a data set record.)
>>> 
>>> Have you seen this type of behavior before? I just found out about this
>>> record corruption tonight. The fact that it is so reproducible makes me
>>> think there is some sort of counter rollover that I知 running into.
>>> 
>>> In the interests of full disclosure, I should say I知 using a direct
>>> data loader. The code is in my github repo in
>>> bio/sources/phytozome-clusters/.
>>> 
>>> Thanks,
>>> 
>>> Joe
>>> 
>> 
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>> 
> 




More information about the dev mailing list