[InterMine Dev] Shouldn't unique keys be enforced during data source load into production database? (They're not.)

Joe Carlson jwcarlson at lbl.gov
Tue Apr 12 20:59:10 BST 2016


Sam,

I'm sure the intermine folks will have an "official" response, but here's my 2 cents works.

Calling 2 things the "same" is a fluid concept. In some data loader contexts, that means the same name, other times the same primaryIdentifer, other times the same primaryAccession, and so on. So your concept of 'sameness' must always be specified in every data loader where that record is getting touched. (This was my mental roadblock for a while. I was saying that 2 things were the same in one data loader and expecting them to be the same in another data loader.)

What happens if you do not say 2 things are the same? You will get 2 records. This is exactly what you're asking for so you should be happy! This looks like what you did when you loaded file-cmap.

Now what do you want to happen when you load more information into the QTL record when you load this new data source? You want the integrator to merge the QTL information with an existing one if it has the same key. But you already have 2. How can it do this? Do you want it to try to merge the 2 existing records into 1, then integrate the new information from this data source? It would be nice for this to happen, but I don't think the integrator can do that. (This involves chasing down all the records that reference the QTL records, making sure these things should not be merged, and so forth.) It's better to be consistent in specifying the relevant primary keys for every step.

So, if you want 2 QTLs to be the same based on the secondaryIdentifer and organism, specify this whenever you're loading secondaryIdentifier and organism information.

You may ask why intermine does not just impose database constraints on uniqueness, foreign keys, and so forth. It's a performance thing. When you're doing huge inserts, it's much faster for constraints to be imposed at the application layer.

Joe



On Apr 12, 2016, at 12:35 PM, Sam Hokin wrote:

> I've finally resolved some extreme confusion about enforcement of unique keys defined in DATASOURCE_keys.properties. What I've found is that the production database can be loaded with two records that violate those key definitions without throwing an error; however, if you then run _another_ datasource that attempts to merge with those records, you get an error thrown as follows:
> 
> Problem while loading item identifier 2_1100 because
> Duplicate objects from the same data source; o1 = "QTL:10012630" (in database), o2 = "QTL:10011241" (in database), source1 = "<Source: name="file-cmap", type="null", skeleton=false>", source2 = "<Source: name="file-cmap", type="null", skeleton=false>"
> 
> Fair enough, but file-cmap was run two sources ago! And, yes, there ARE two records that violate the key definitions IN the production database from file-cmap:
> 
> soymine=> select * from qtl where secondaryidentifier='Pod maturity 22-3';
> -[ RECORD 1 ]--------------+-------------------------------------
> id                         | 10011241
> primaryidentifier          | GmComposite2003_H_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
> -[ RECORD 2 ]--------------+-------------------------------------
> id                         | 10012630
> primaryidentifier          | GmComposite2003_C2_Pod maturity 22-3
> secondaryidentifier        | Pod maturity 22-3
> sequenceontologytermid     | 10012811
> organismid                 | 10000001
> class                      | org.intermine.model.bio.QTL
> 
> Here are the key definitions that resulted in the error being thrown well after the culprit datasource was run:
> 
> QTL.key_primaryidentifier=primaryIdentifier
> QTL.key_secondaryidentifier=secondaryIdentifier,organism
> 
> The second of these is violated by these two records.
> 
> This seems fundamentally wrong (and very confusing) to me. I should not be able to load duplicate objects into the production database which violate the unique keys that I've defined.
> 
> Is there a reason why the uniqueness rules defined in DATASOURCE_keys.properties aren't applied when loading data from a datasource into the production database, but rather only throw an error during a merge later on? Is there something I can change so that my uniqueness rules are enforced during data loading, so I get an informative error when it happens, rather than down the line when I'm running a totally different data source?
> 
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev




More information about the dev mailing list