[InterMine Dev] Shouldn't unique keys be enforced during data source load into production database? (They're not.)

Julie Sullivan julie at flymine.org
Thu Apr 14 10:29:31 BST 2016


Sam

Yes you are right. The keys are more "integration" keys than true keys. 
It's up to each source to make sure the data is unique. This is not 
ideal or obvious and can cause problems. I've made a ticket:

	https://github.com/intermine/intermine/issues/1347

Thanks
Julie

On 12/04/16 20:35, Sam Hokin wrote:
> I've finally resolved some extreme confusion about enforcement of unique
> keys defined in DATASOURCE_keys.properties. What I've found is that the
> production database can be loaded with two records that violate those
> key definitions without throwing an error; however, if you then run
> _another_ datasource that attempts to merge with those records, you get
> an error thrown as follows:
>
> Problem while loading item identifier 2_1100 because
> Duplicate objects from the same data source; o1 = "QTL:10012630" (in
> database), o2 = "QTL:10011241" (in database), source1 = "<Source:
> name="file-cmap", type="null", skeleton=false>", source2 = "<Source:
> name="file-cmap", type="null", skeleton=false>"
>
> Fair enough, but file-cmap was run two sources ago! And, yes, there ARE
> two records that violate the key definitions IN the production database
> from file-cmap:
>
> soymine=> select * from qtl where secondaryidentifier='Pod maturity 22-3';
> -[ RECORD 1 ]--------------+-------------------------------------
> id                         | 10011241
> primaryidentifier          | GmComposite2003_H_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
> -[ RECORD 2 ]--------------+-------------------------------------
> id                         | 10012630
> primaryidentifier          | GmComposite2003_C2_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
>
> Here are the key definitions that resulted in the error being thrown
> well after the culprit datasource was run:
>
> QTL.key_primaryidentifier=primaryIdentifier
> QTL.key_secondaryidentifier=secondaryIdentifier,organism
>
> The second of these is violated by these two records.
>
> This seems fundamentally wrong (and very confusing) to me. I should not
> be able to load duplicate objects into the production database which
> violate the unique keys that I've defined.
>
> Is there a reason why the uniqueness rules defined in
> DATASOURCE_keys.properties aren't applied when loading data from a
> datasource into the production database, but rather only throw an error
> during a merge later on? Is there something I can change so that my
> uniqueness rules are enforced during data loading, so I get an
> informative error when it happens, rather than down the line when I'm
> running a totally different data source?
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>



More information about the dev mailing list