[InterMine Dev] failure to merge entities from two processors with same primaryIdentifier

Sam Hokin shokin at ncgr.org
Wed Dec 30 21:34:46 GMT 2015


Hi, devs. I know this is an old topic, but I've not been able to figure out how to fix it in my case. I've got a data source, 
chado-legfed, with two processors running off of a chado database: GeneticMapProcessor, which processes genetic data, and a spin of 
Kim Rutherford's SequenceProcessor, which processes genomic data for the specific and rather weird chado that we're running (Legume 
Federation).

The one common element between genetics and genomics is GeneticMarker, which extends SequenceFeature as follows:

<!-- genetic markers are both genetic and genomic -->
<class name="GeneticMarker" extends="SequenceFeature" is-interface="true">
     <!-- these from featurepos -->
     <collection name="linkageGroupPositions" referenced-type="LinkageGroupPosition"/>
</class>

It's not important for this discussion, but if you're curious, I'm locating genetic markers on linkage groups (pieces of genetic 
maps) using a LinkageGroupPosition entity which defines a centiMorgan position on the underlying LinkageGroup. These are purely 
genetic data:

<class name="LinkageGroupPosition" is-interface="true">
     <!-- these from featurepos -->
     <attribute name="position" type="java.lang.Double"/>
     <reference name="linkageGroup" referenced-type="LinkageGroup"/>
</class>

<class name="LinkageGroup" extends="BioEntity" is-interface="true">
     <attribute name="length" type="java.lang.Double"/>
     <reference name="geneticMap" referenced-type="GeneticMap" reverse-reference="linkageGroups"/>
     <collection name="geneticMarkers" referenced-type="GeneticMarker"/>
     <reference name="sequenceOntologyTerm" referenced-type="SOTerm"/>
</class>

So, I run GeneticMapProcessor, which creates the geneticmarker records along with the geneticmarkerlinkagegrouppositions records 
flawlessly. There is no issue with GeneticMapProcessor that I can see. (It does other genetic stuff as well, all fine.)

Then, I run SequenceProcessor, which in addition to all the strictly genomic data, creates geneticmarker records with sequences and 
locations on chromosomes, as it should.

I've declared the primary key for GeneticMarker in chado-legfed/resources/chado-legfed_keys.properties as follows:

GeneticMap.key_primaryidentifier=primaryIdentifier

The duplicate geneticmarker records indeed have the same primaryidentifier value; sequenceid is missing from the 
GeneticMapProcessor-created records as expected and name is missing from the SequenceProcessor-created records:

    id    |   primaryidentifier    |   secondaryidentifier    |   name   | sequenceid
---------+------------------------+--------------------------+----------+------------
  1009888 | 118M3-phavu            | 118M3                    | 118M3    |
  2026865 | 118M3-phavu            | 118M3                    |          |    2026866
  1009889 | 11M1-phavu             | 11M1                     | 11M1     |
  2026868 | 11M1-phavu             | 11M1                     |          |    2026869
  1009082 | 11M-Gm-phavu           | 11M-Gm                   | 11M-Gm   |
  2024452 | 11M-Gm-phavu           | 11M-Gm                   |          |    2024453

This is after all post-processing.

What else do I need to get geneticmarker records merged from the two processor outputs? Preserving, of course, the relationships in 
geneticmarkerlinkagegrouppositions (from GeneticMapProcessor) and DNA sequences and locations (from SequenceProcessor). I do have 
genomic priorities set as follows:

GeneticMarker.name = bean-genetics, *
GeneticMarker.secondaryIdentifier = bean-genetics, *
GeneticMarker.length = bean-genomics, *

But that has no effect.

Thanks! And Happy New Year!
Sam Hokin



More information about the dev mailing list