[InterMine Dev] strange database record corruption

Richard Smith richard at flymine.org
Thu Jul 2 17:41:28 BST 2015


Hi Joe,
Yes, there was an on-demand multiple inheritance feature. I'm not
convinced it actually works properly which is why you see the strange
object serialisation. Not all of us thought it was a good idea at the time
and it still isn't :)

Glad to hear you've had an improvement in loading speed on the new
hardware. I hope in the 1.6 release we'll pull together several small
performance improvements and will be able to see if that helps some more.

I'm still surprised that you see an id collision, let us know if you work
out why.

Cheers,
Richard.



> Hi Richard,
>
> Thanks for the email. I did a few more experiments after sending that last
> email and, while I still don’t quite understand it all, at least I have
> something that is working and just wanted to get the loading done before
> poking at it again.
>
> I took out all of my idMap manipulation and still saw a problem. I had
> only been using it to keep a record of objects that I had retrieved from
> the database and kept around as proxy references. I was trying to save the
> time of having to do a query again when it came time to save the objects
> that referenced those things. So, other than taking a lot longer, the
> loading failed in the same manner.
>
> I had been wondering about the fact that the class for the new thing had
> both classes listed; I was thinking you were going to have some sort of
> multiple inheritance going on. Might have seemed like a good idea at the
> time, but I imagine it would be a pain to get the serialization to work
> out.
>
> The good news is that we’ve upgraded some hardware here and the slowdown
> that we’d seen in the past as gone away. We’re seeing a performance
> dip comparable to what you had: a slight dip but nothing horrible. And the
> entire loading process is down to 48 hours or so. Getting better. (Though
> I still need to do the transfer-sequence post processing step to see how
> long that takes; that was another bottleneck in the loading.)
>
> If I have any more experience and thoughts about the id collision thing,
> I’ll let you know.
>
> joe
>
>> On Jul 1, 2015, at 8:29 AM, Richard Smith <richard at flymine.org> wrote:
>>
>> Hi Joe,
>> Object ids in the target database are assigned when an object is stored,
>> the ids are fetched (in batches) from a sequence in the database which
>> autoincrements. The same id will never be assigned to two different new
>> objects.
>>
>> I think I see what's happening with the DirectDataLoader and the
>> pro-user
>> idMap manipulation though.
>>
>> When objects are loaded from an items database they have ids in the
>> source
>> items database, when stored they are assigned a new id in the production
>> database. The idMap maps between the id the item had and the id assigned
>> in the target database. This means references from the original items
>> are
>> preserved, if organism item 101 is stored and gets an object id of 201
>> we
>> know a gene item referencing organism 101 should reference organism
>> object
>> 201 in the production db.
>>
>> With the DirectDataloader there aren't any source ids because there's no
>> items database. That doesn't matter, we just assign ids sequentially as
>> objects are created. These aren't the ids stored in the production
>> database.
>>
>> I forget exactly what you're doing to manipulate the idMap but
>> presumably
>> you're pre-populating it with known ids from the production database. At
>> some point these ids are colliding with the throwaway source ids
>> generated
>> in DirectDataLoader.
>>
>> As for solutions - I don't think decrementing new ids in
>> DirectDataloader
>> is a good idea as valid ids can be negative. You're right that calling
>> IntegrationWriter.getSerial() will throw away ids, potentially quite a
>> lot. A better fix might be to provide the source ids you've put in the
>> idMap to DirectDataLoader and tell it not to assign any of those.
>>
>> Oh, and the mangled object that was created is a 'feature' - it's
>> possible
>> to store dynamic objects that combine multiple classes in ways not
>> defined
>> in the model. We don't use this (on purpose) and will hopefully remove
>> it
>> soon.
>>
>> Hope this helps,
>> Richard.
>>
>>
>>> Hello again,
>>>
>>> Sorry if that email was confusing. I tend to write semi-coherent email
>>> late a night when I知 about to call it a night. And I was hoping to
>>> catch
>>> you folks before the weekend.
>>>
>>> The issue appears to be a collision in the id fields between the one
>>> generated by DirectDataLoader.createObject and an id for an object
>>> already
>>> in the database. I知 loading ‾ 1M records (100K families, 700K
>>> members,
>>> plus another 100K centroid sequences and 100K sequence alignment
>>> records.)
>>> and once I used an id for an object created by
>>> DirectDataLoader.createObject that collided with one in the db (the
>>> first
>>> organism record, as it turned out), then I got this weird object
>>> merger.
>>> What appears to have been an important factor was the fact that I was
>>> trying to minimize the querying of the database during the loading
>>> process
>>> by telling the IntegrationWriter what elements that I had retrieved
>>> from
>>> the database - including organisms - that do not need to be queried by
>>> inserting into IntegrationWriterç—´ idMap. I think Richard had
>>> suggested
>>> this; I知 not totally sure. So I made a markElementAsStored routine in
>>> IntegrationWriter to do this.
>>>
>>> But when IntegrationWriter.getEquivalentObject is called, if an id of
>>> an
>>> object created with DirectDataLoader.createObject coincides with an id
>>> in
>>> idMap, then the two things will be called equivalent and some sort of
>>> mess
>>> gets created.
>>>
>>> Now, I see that manipulating idMap is a dangerous thing and I値l stop.
>>> Or
>>> at least be more careful in how I do it. But I知 curious about this
>>> approach. It seems to me that there will always be a possibility that
>>> an
>>> id generated by createObject will collide with the id of something that
>>> has already been retrieved - and possibly updated - by the
>>> IntegrationWriter. So there is a slight chance of a collision.
>>>
>>> I知 trying a couple of work arounds: one is to decrement the idCounter
>>> from 0 in DirectDataLoader.createObject rather than incrementing it. So
>>> long as it is unique, this should be OK, right? The other is to call
>>> setId
>>> with getIntegrationWriter().getSerial(). Both appear to work. The first
>>> method may give problems if I have to worry about integer wrap around.
>>> The
>>> second wastes some serial numbers. Both methods seem to work at first
>>> blush. I知 tempted to go with the first: chances are, if I have
>>> integer
>>> wrap around I知 going to have other problems.
>>>
>>> Thanks for all your work and help on this!
>>>
>>> Joe Carlson
>>>
>>> On Jun 25, 2015, at 11:19 PM, Joe Carlson <jwcarlson at lbl.gov> wrote:
>>>
>>>> Hello,
>>>>
>>>> So I知 seeing something very, very weird. Somehow I知 managing to
>>>> get
>>>> items in different tables with the same id and a corrupted record in
>>>> the
>>>> intermineobject table.
>>>>
>>>> I知 loading clusters of protein. The relevant tables are
>>>> proteinfamily
>>>> and proteinfamilymember. Each collection of proteinfamilies are based
>>>> on
>>>> the set of organisms in the family building (some collection have only
>>>> a
>>>> few organisms, others have many).
>>>>
>>>> In 3 out of 12 of the collections, the data loading fails with a very
>>>> cryptic error message. Here is an example of message:
>>>>> /global/u1/j/jcarlson/src/intermine/bio/sources/phytozome-clusters/build.xml:23:
>>>>> java.lang.IllegalArgumentException: Conflicting values for field
>>>>> Gene,ProteinFamilyMember.organism between phytozome-chado-A.coerulea
>>>>> (value "Organism,ProteinFamilyMember [annotationVersion="v1.1",
>>>>> assemblyVersion="v1", commonName="Colorado blue columbine",
>>>>> count="10",
>>>>> genus="Aquilegia", id="1000003", membershipDetail="HMM pledge -
>>>>> complete", name="Aquilegia coerulea", organism=89000000,
>>>>> protein=91700834, proteinFamily=378112305, proteomeId="195",
>>>>> shortName="A. coerulea", species="coerulea", taxonId="218851",
>>>>> version="current"]" in database with ID 1973680) and
>>>>> phytozome-cluster-node-4956 (value "Organism
>>>>> [annotationVersion="v1.1",
>>>>> assemblyVersion="v1.0", commonName="switchgrass", genus="Panicum",
>>>>> id=142000000, name="Panicum virgatum", proteomeId=273, shortName="P.
>>>>> virgatum", species="virgatum", taxonId=38727, version="current"]"
>>>>> being
>>>>> stored). This field needs configuring in the
>>>>> genomic_priorities.properties file
>>>>>        at
>>>>> org.intermine.dataloader.SourcePriorityComparator.compare(SourcePriorityComparator.java:276)
>>>>>        at
>>>>> org.intermine.dataloader.SourcePriorityComparator.compare(SourcePriorityComparator.java:34)
>>>>>        at java.util.TreeMap.put(TreeMap.java:545)
>>>>>        at java.util.TreeSet.add(TreeSet.java:255)
>>>>>        at
>>>>> org.intermine.dataloader.IntegrationWriterDataTrackingImpl.store(IntegrationWriterDataTrackingImpl.java:385)
>>>>
>>>>
>>>> It is strange since the data that I知 loading has no previous objects
>>>> to
>>>> compare. This was confusing me for a long time, for a long time I
>>>> thought I just had problems with my keys. Then I saw that there was a
>>>> corruption in the intermineobject table:
>>>>
>>>>> select * from intermineobject where id=1000003;
>>>>>                                                                                                                                                                                                                                             object
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> |
>>>>>
>>>>>
>>>>> id
>>>>>
>>>>>
>>>>>
>>>>> |
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> class

>>>>> $_^org.intermine.model.bio.Organism
>>>>> org.intermine.model.bio.ProteinFamilyMember$_^aannotationVersion$_^v1.1$_^aassemblyVersion$_^v1$_^acommonName$_^Colorado
>>>>> blue
>>>>> columbine$_^acount$_^10$_^agenus$_^Aquilegia$_^aid$_^1000003$_^amembershipDetail$_^HMM
>>>>> pledge - complete$_^aname$_^Aquilegia
>>>>> coerulea$_^rorganism$_^89000000$_^rprotein$_^91700834$_^rproteinFamily$_^378112305$_^aproteomeId$_^195$_^ashortName$_^A.
>>>>> coerulea$_^aspecies$_^coerulea$_^ataxonId$_^218851$_^aversion$_^current
>>>>> | 1000003 | org.intermine.model.bio.Organism
>>>>> org.intermine.model.bio.ProteinFamilyMember
>>>>> (1 row)
>>>>
>>>>
>>>> This is some sort of unholy union of an organism and a
>>>> proteinfamilymember. There is an entry in both the organism table and
>>>> proteinfamilymember table with this id. The fields of these two
>>>> records
>>>> are OK, other than the fact that the class field is the concatenation
>>>> of
>>>> the 2 class names.
>>>>
>>>> The behavior is reproducible; after inserting ‾ 100K families and
>>>> ‾ 700K
>>>> members, the loading fails on the same exact record if I load in the
>>>> same order. If I change the loading, there is a similar error on a
>>>> different entry. 1000003 is my first 創on-trivial� intermine object
>>>> (the
>>>> others being sequence ontology, a data source and a data set record.)
>>>>
>>>> Have you seen this type of behavior before? I just found out about
>>>> this
>>>> record corruption tonight. The fact that it is so reproducible makes
>>>> me
>>>> think there is some sort of counter rollover that I知 running into.
>>>>
>>>> In the interests of full disclosure, I should say I知 using a direct
>>>> data loader. The code is in my github repo in
>>>> bio/sources/phytozome-clusters/.
>>>>
>>>> Thanks,
>>>>
>>>> Joe
>>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>
>>
>
>
>




More information about the dev mailing list