[InterMine Dev] Merging duplicate entries

Thomas TRIPLET thomastriplet at gmail.com
Mon Mar 11 15:49:18 GMT 2013

I'm importing  2 data sources describing proteins. One is a typical fasta
file, the other one is a CSV file imported using a custom source, which is
based on examples from FlyMine:

    public void process(Reader reader) throws Exception {
Iterator<?> lineIter = FormattedTextParser.parseTabDelimitedReader(reader);

if(lineIter.hasNext()) // skip first header line

while (lineIter.hasNext()) {
 try {
String[] line = (String[]) lineIter.next();

if(line==null || line[0].startsWith("#")) // Make sure the line isn't empty
or not commented out

String proteinId = line[PROTEIN_IDX];
 Item protein = getProtein(proteinId);

 protein.setAttribute("name", line[NAME_IDX]);
} catch(Exception e) {
 System.out.println("ERROR occured while converting aniger-protein-name ("
+ e.getMessage() + ")");
 for(Item protein: proteins.values())
    } // eo process()

     * Creates a protein of fetches it if it exists
     * @param id ID of the protein
     * @return The protein as an Item
private Item getProtein(String id) throws ObjectStoreException {
 Item protein = proteins.get(id);
if (protein == null) {
 protein = createItem("Protein");
protein.setAttribute("*primaryIdentifier*", id);
 proteins.put(id, protein);
 return protein;
} // eo getProtein()

In project.xml, I have:

<source name="aniger-protein-fasta" type="fasta">
 <property name="fasta.className" value="org.intermine.model.bio.Protein"/>
<property name="fasta.classAttribute" value="*primaryIdentifier*"/>
      <property name="fasta.sequenceType" value="protein" />
<property name="fasta.dataSourceName" value="CSFG"/>
 <property name="fasta.dataSetTitle" value="Protein sequences in A. niger"/>
<property name="fasta.taxonId" value="5061"/>
 <property name="fasta.includes" value="Aspni3p4.representatives.faa"/>
<property name="src.data.dir"
<source name="aniger-protein-name" type="csfg-protein-name">
 <property name="src.data.dir"
 <property name="src.data.dir.includes"

The IDs in the 2 sources match. Yet, after a successful build, the UI shows
2 proteins with the same primaryIdentifier. Is there a way to enforce
entities with the same id to merge?


Thomas Triplet, Jr. Eng., Ph.D.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130311/5d23f645/attachment.html>

More information about the dev mailing list