[InterMine Dev] how to define LocatedOn attribute of Location and avoid duplicate objects error

Richard Smith richard at flymine.org
Mon Mar 5 15:01:34 GMT 2012


Hi Intikhab,
It looks like there are a couple of issues here.

1. The error message you below: 'There are duplicate objects in the
source being loaded, multiple items are identical according to the
primary key being used....'

This means that there are two organism items in your XML file with
the same taxonId attribute.  For data loading to work correctly the
items XML should only contain unique items according to the integration
keys you're using.  So your script will need to make sure all items that
reference an organism point to the same item identifier and the organism
is only created once.

2. Once this issue is fixed the organism you create should merge with
the one already in the database and it should have the existing DataSet
and the one you created in it's DataSets collection.

Once the organism problem is fixed let me know if you're still having
issues with DataSets.

Cheers,
Richard.


On 05/03/2012 11:40, Dr. Intikhab Alam wrote:
> Hi,
>
> If I want to use my own data to describe Location's locatedOn attribute
> for Dmel genome, what is the best way to define the locatedOn attribute
> and avoid duplicate objects e.g. The Organism object.
>
> I am trying to build flymine with my own data and followed the
> instructions at
> http://intermine.org/wiki/FlyMineOwnData (by using pg_restore). What is
> the best way to define a locatedOn feature of my cageTags data. I can read
> all
> the flymine data and my cageTags data but at the integration stage I get
> the
> duplicated items error:
>
> BUILD FAILED
> /home/intikhab/biosoft/intermine_0_99/imbuild/integrate.xml:54: The
> following error occurred while executing this line:
> /home/intikhab/biosoft/intermine_0_99/imbuild/source.xml:330:
> java.lang.RuntimeException: Exception while dataloading - to allow
> multiple errors, set the property "dataLoader.allowMultipleErrors" to true
> Problem while loading item identifier 0_1 because
> There are duplicate objects in the source being loaded, multiple items are
> identical according to the primary key being used. Storing again to id
> 1738000001 object from source Organism [commonName="null", genus="null",
> id="1", name="Drosophila melanogaster", shortName="D. melanogaster",
> species="null", taxonId="7227"]
>
>
>
> When I load my data from large-xml type source, I declare, to have a
> locatedOn attribute of my cageTags:
>
>
> my ( $taxonid, $longname, $shortname ) = ("7227", "Drosophila
> melanogaster","D. melanogaster"); #get_organism_detail($taxonfile);
>
> my $organism = $doc->add_item(
>        "Organism",
>        "taxonId"   =>   $taxonid,
>        "name"      =>   $longname,
>        "shortName" =>   $shortname
> );
>
>
>       $chromosome = $doc->add_item(
>
>            'Chromosome',
>            'primaryIdentifier' =>   $chr,
>            'dataSets'          =>   [$data_set_item],
>            #'sequence'          =>   $CHRseq,
>            #'length'            =>   $chromlen,
>            'organism'          =>   $organism,
>          );
>
>
>
>            'Location',
>                    'start' =>$st,
>                    'end'   =>$end,
>                    'strand'        =>$strand,
>                    'feature'       =>$cagecluster,
>                    'locatedOn',    =>   $chromosome,
>
>            );
>
>
>
> Obviously, The 'Organism' object would be in the database already, before
> my addition but how could I use the locatedOn feature that goes to the
> right dataSet?
>
>
> Similar issue with the KEGG data I loaded in my other project, redseamine,
> I can not see the Pathways widget displayed apart from on section in my
> Gene report page where it lists the pathways involved but displays the
> source
> Of the data as my project name.
>
> There I declare the source as:
>
> my $keggdata_source_item =
>      $doc->add_item( DataSource =>   ( name =>   'KEGG', ), );
>
> my $keggdata_set_item = $doc->add_item(
>        DataSet =>   (
>            name =>   "KEGG",
>            description =>
>              "KEGG",
>            'dataSource' =>   $keggdata_source_item,
>
>          ),
>      );
>
>
>
>                    my $pathway = $doc->add_item(
>                    'Pathway',
>                            'identifier' =>$kmapid,
>                            'name' =>$kmapdesc,
>                            'genes' =>[$gene],
>                            'dataSets' =[$keggdata_set_item];
>
>                    );
>
>
>
> Any help on properly defining the dataSets to avoid duplicate entries?
>
> Regards,
>
> Intikhab
>
>
>
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>




More information about the dev mailing list