[InterMine Dev] doc on primary keys

Sam Hokin shokin at ncgr.org
Thu Apr 28 12:16:08 BST 2016

I had a bit of confusion re. keys between data _sources_ (individual entries in project.xml) versus the data _type_ to which they 
belong (given by the "type" field in the project.xml entry). I've got a whole bundle of processors, each in a separate data 
_source_, but they're all under bio/source/legfed; here's four of them:

     <source name="chado-genomics" type="legfed" dump="true">...</source>
     <source name="chado-genetics" type="legfed" dump="true">...</source>
     <source name="chado-featureprop" type="legfed" dump="true">...</source>
     <source name="chado-go" type="legfed" dump="true">...</source>

I thought from the docs that I could use a single keys file /bio/source/legfed/resources/legfed_keys.properties rather than an 
individual one from each source. And, it seemed to hold up fairly well for a while. But then I discovered that changes to 
legfed_keys.properties didn't "take" (despite ant clean, etc.) while if I created a single <source>_keys.properties file for each 
_source_ in legfed, they were read correctly during integration.

It may just have been my misunderstanding of the docs, but I was certainly confused on this one. I do see the advantage of separate 
keys files for each source (for example, when a particular data source provides the secondaryIdentifier, not the primaryIdentifier 
to merge on), but when that's not the case, it'd be super handy to be able to define a default keys file for all _sources_ within a 
_type_. Not a huge deal, but it crossed me up for quite a while.

A second aspect of this is that you have to duplicate keys files when you're using the same processor but a different source 
database - because the source name is different. I'm merging data from two different chado databases (one specifically for peanuts, 
the other for other legumes). So, even though the only difference in the data source definitions is the organisms and the db.name, I 
have to create a new keys file for each re-used source, for example:

     <!-- chado genomics - bean, soybean -->
     <source name="chado-genomics" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3847 3885 3398"/>
       <property name="processors" value="org.intermine.bio.dataconversion.SequenceProcessor"/>

     <!-- chado genomics - peanut -->
     <source name="chado-genomics2" type="legfed" dump="true">
       <property name="source.db.name" value="peanutbase"/>
       <property name="organisms" value="3398 3817 3818 130453 130454"/>
       <property name="processors" value="org.intermine.bio.dataconversion.SequenceProcessor"/>

This requires a chado-genomics2_keys.properties file as well as the original chado-genomics_keys.properties. If there were a 
properly-working default keys file under bio/sources/legfed/resources, I'd not have to duplicate the keys file, provided the default 
were sufficient.

Picky, picky, I know. But I like to share my pain with y'all. :)

More information about the dev mailing list