[InterMine Dev] Merging duplicate entries

Julie Sullivan julie at flymine.org
Mon Mar 11 16:14:10 GMT 2013


Do you have the integration key for Protein set to be "primary identifier"?

http://intermine.readthedocs.org/en/latest/database/database-building/primary-keys

On 11/03/13 15:49, Thomas TRIPLET wrote:
> Hello,
> I'm importing  2 data sources describing proteins. One is a typical fasta
> file, the other one is a CSV file imported using a custom source, which is
> based on examples from FlyMine:
>
>
>      public void process(Reader reader) throws Exception {
> Iterator<?>  lineIter = FormattedTextParser.parseTabDelimitedReader(reader);
>
> if(lineIter.hasNext()) // skip first header line
>   lineIter.next();
>
> while (lineIter.hasNext()) {
>   try {
> String[] line = (String[]) lineIter.next();
>
> if(line==null || line[0].startsWith("#")) // Make sure the line isn't empty
> or not commented out
>   continue;
>
> String proteinId = line[PROTEIN_IDX];
>   Item protein = getProtein(proteinId);
>
>   protein.setAttribute("name", line[NAME_IDX]);
> } catch(Exception e) {
>   System.out.println("ERROR occured while converting aniger-protein-name ("
> + e.getMessage() + ")");
>   e.printStackTrace();
> System.exit(-1);
>   }
> }
>   for(Item protein: proteins.values())
> store(protein);
>      } // eo process()
>
>      /**
>       * Creates a protein of fetches it if it exists
>       * @param id ID of the protein
>       * @return The protein as an Item
>       */
> private Item getProtein(String id) throws ObjectStoreException {
>   Item protein = proteins.get(id);
> if (protein == null) {
>   protein = createItem("Protein");
> protein.setAttribute("*primaryIdentifier*", id);
>   proteins.put(id, protein);
> }
>   return protein;
> } // eo getProtein()
>
>
> In project.xml, I have:
>
> <source name="aniger-protein-fasta" type="fasta">
>   <property name="fasta.className" value="org.intermine.model.bio.Protein"/>
> <property name="fasta.classAttribute" value="*primaryIdentifier*"/>
>        <property name="fasta.sequenceType" value="protein" />
> <property name="fasta.dataSourceName" value="CSFG"/>
>   <property name="fasta.dataSetTitle" value="Protein sequences in A. niger"/>
> <property name="fasta.taxonId" value="5061"/>
>   <property name="fasta.includes" value="Aspni3p4.representatives.faa"/>
> <property name="src.data.dir"
> location="/home/intermine/data/csfg/a_niger/"/>
>   </source>
> <source name="aniger-protein-name" type="csfg-protein-name">
>   <property name="src.data.dir"
> location="/home/intermine/data/csfg/a_niger/"/>
>   <property name="src.data.dir.includes"
> value="Aspni3p4_annotations_wf_march2013.csv"/>
>   </source>
>
> The IDs in the 2 sources match. Yet, after a successful build, the UI shows
> 2 proteins with the same primaryIdentifier. Is there a way to enforce
> entities with the same id to merge?
>
> Thanks
>
>
>
>
>
>
> --
> Thomas Triplet, Jr. Eng., Ph.D.
> http://www.thomastriplet.net
>
>
>
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev



More information about the dev mailing list