[InterMine Dev] Merging duplicate entries

Thomas TRIPLET thomastriplet at gmail.com
Mon Mar 11 15:49:18 GMT 2013


Hello,
I'm importing  2 data sources describing proteins. One is a typical fasta
file, the other one is a CSV file imported using a custom source, which is
based on examples from FlyMine:


    public void process(Reader reader) throws Exception {
Iterator<?> lineIter = FormattedTextParser.parseTabDelimitedReader(reader);

if(lineIter.hasNext()) // skip first header line
 lineIter.next();

while (lineIter.hasNext()) {
 try {
String[] line = (String[]) lineIter.next();

if(line==null || line[0].startsWith("#")) // Make sure the line isn't empty
or not commented out
 continue;

String proteinId = line[PROTEIN_IDX];
 Item protein = getProtein(proteinId);

 protein.setAttribute("name", line[NAME_IDX]);
} catch(Exception e) {
 System.out.println("ERROR occured while converting aniger-protein-name ("
+ e.getMessage() + ")");
 e.printStackTrace();
System.exit(-1);
 }
}
 for(Item protein: proteins.values())
store(protein);
    } // eo process()

    /**
     * Creates a protein of fetches it if it exists
     * @param id ID of the protein
     * @return The protein as an Item
     */
private Item getProtein(String id) throws ObjectStoreException {
 Item protein = proteins.get(id);
if (protein == null) {
 protein = createItem("Protein");
protein.setAttribute("*primaryIdentifier*", id);
 proteins.put(id, protein);
}
 return protein;
} // eo getProtein()


In project.xml, I have:

<source name="aniger-protein-fasta" type="fasta">
 <property name="fasta.className" value="org.intermine.model.bio.Protein"/>
<property name="fasta.classAttribute" value="*primaryIdentifier*"/>
      <property name="fasta.sequenceType" value="protein" />
<property name="fasta.dataSourceName" value="CSFG"/>
 <property name="fasta.dataSetTitle" value="Protein sequences in A. niger"/>
<property name="fasta.taxonId" value="5061"/>
 <property name="fasta.includes" value="Aspni3p4.representatives.faa"/>
<property name="src.data.dir"
location="/home/intermine/data/csfg/a_niger/"/>
 </source>
<source name="aniger-protein-name" type="csfg-protein-name">
 <property name="src.data.dir"
location="/home/intermine/data/csfg/a_niger/"/>
 <property name="src.data.dir.includes"
value="Aspni3p4_annotations_wf_march2013.csv"/>
 </source>

The IDs in the 2 sources match. Yet, after a successful build, the UI shows
2 proteins with the same primaryIdentifier. Is there a way to enforce
entities with the same id to merge?

Thanks






--
Thomas Triplet, Jr. Eng., Ph.D.
http://www.thomastriplet.net
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130311/5d23f645/attachment.html>


More information about the dev mailing list