[InterMine Dev] Shouldn't unique keys be enforced during data source load into production database? (They're not.)

Sam Hokin shokin at ncgr.org
Tue Apr 12 20:35:07 BST 2016


I've finally resolved some extreme confusion about enforcement of unique keys defined in DATASOURCE_keys.properties. What I've found 
is that the production database can be loaded with two records that violate those key definitions without throwing an error; 
however, if you then run _another_ datasource that attempts to merge with those records, you get an error thrown as follows:

Problem while loading item identifier 2_1100 because
Duplicate objects from the same data source; o1 = "QTL:10012630" (in database), o2 = "QTL:10011241" (in database), source1 = 
"<Source: name="file-cmap", type="null", skeleton=false>", source2 = "<Source: name="file-cmap", type="null", skeleton=false>"

Fair enough, but file-cmap was run two sources ago! And, yes, there ARE two records that violate the key definitions IN the 
production database from file-cmap:

soymine=> select * from qtl where secondaryidentifier='Pod maturity 22-3';
-[ RECORD 1 ]--------------+-------------------------------------
id                         | 10011241
primaryidentifier          | GmComposite2003_H_Pod maturity 22-3
secondaryidentifier        | Pod maturity 22-3
sequenceontologytermid     | 10012811
organismid                 | 10000001
class                      | org.intermine.model.bio.QTL
-[ RECORD 2 ]--------------+-------------------------------------
id                         | 10012630
primaryidentifier          | GmComposite2003_C2_Pod maturity 22-3
secondaryidentifier        | Pod maturity 22-3
sequenceontologytermid     | 10012811
organismid                 | 10000001
class                      | org.intermine.model.bio.QTL

Here are the key definitions that resulted in the error being thrown well after the culprit datasource was run:

QTL.key_primaryidentifier=primaryIdentifier
QTL.key_secondaryidentifier=secondaryIdentifier,organism

The second of these is violated by these two records.

This seems fundamentally wrong (and very confusing) to me. I should not be able to load duplicate objects into the production 
database which violate the unique keys that I've defined.

Is there a reason why the uniqueness rules defined in DATASOURCE_keys.properties aren't applied when loading data from a datasource 
into the production database, but rather only throw an error during a merge later on? Is there something I can change so that my 
uniqueness rules are enforced during data loading, so I get an informative error when it happens, rather than down the line when I'm 
running a totally different data source?



More information about the dev mailing list