[InterMine Dev] how to define LocatedOn attribute of Location and avoid duplicate objects error

Dr. Intikhab Alam intikhab.alam at kaust.edu.sa
Mon Mar 5 15:23:25 GMT 2012


Dear Richard,

Thanks for your email and possible solution.

What is the primaryKey on which organism data is merged? Is it taxonId
right?

My xml sources have the Organism defined only once, there are no duplicate
occurrences of taxonIds, though I load multiple xml files. My xml files
contained Information other than taxonIds as well, which perhaps integrate
doesn¹t like as this information is already in the database when flymine
data is loaded.

What do you think?

If I recall correctly when I was loading InterPro Domains, if I provide
any attribute other than the primaryId, it failed with duplicate objects
error at the data integration stage; When I provided only the primaryId,
all went fine.

Now I am trying to use only the taxonId for Organism when I format my xml
file and try again. It takes a few hours to load the flymine data and if
it fails at integration stage, I load it again, update the model and try
integrate again, as shown in the build flymine with own data page.


Gbrowse on the current flymine release is not working. When I do build a
local copy of flymine, do I need to build a local Gbrowse instance? What
specifically I am looking for is to view a range of Location from e.g.
Dmel genome and see how many of my cageTag clusters appear at or around
this location. Furthermore, what is the conservation from other Drosophila
species for the region in question. UCSC shows this conservation track
from other drosophila species. Can we do this for flymine Gbrowse? We
already load the orthology information from other species, the only thing
missing is the conservation, which is a score for each Dmel position
conserved in each of the other Drosophila species.

Best Wishes,

Intikhab



On 3/5/12 3:01 PM, "Richard Smith" <richard at flymine.org> wrote:

>Hi Intikhab,
>It looks like there are a couple of issues here.
>
>1. The error message you below: 'There are duplicate objects in the
>source being loaded, multiple items are identical according to the
>primary key being used....'
>
>This means that there are two organism items in your XML file with
>the same taxonId attribute.  For data loading to work correctly the
>items XML should only contain unique items according to the integration
>keys you're using.  So your script will need to make sure all items that
>reference an organism point to the same item identifier and the organism
>is only created once.
>
>2. Once this issue is fixed the organism you create should merge with
>the one already in the database and it should have the existing DataSet
>and the one you created in it's DataSets collection.
>
>Once the organism problem is fixed let me know if you're still having
>issues with DataSets.
>
>Cheers,
>Richard.
>
>
>On 05/03/2012 11:40, Dr. Intikhab Alam wrote:
>> Hi,
>>
>> If I want to use my own data to describe Location's locatedOn attribute
>> for Dmel genome, what is the best way to define the locatedOn attribute
>> and avoid duplicate objects e.g. The Organism object.
>>
>> I am trying to build flymine with my own data and followed the
>> instructions at
>> http://intermine.org/wiki/FlyMineOwnData (by using pg_restore). What is
>> the best way to define a locatedOn feature of my cageTags data. I can
>>read
>> all
>> the flymine data and my cageTags data but at the integration stage I get
>> the
>> duplicated items error:
>>
>> BUILD FAILED
>> /home/intikhab/biosoft/intermine_0_99/imbuild/integrate.xml:54: The
>> following error occurred while executing this line:
>> /home/intikhab/biosoft/intermine_0_99/imbuild/source.xml:330:
>> java.lang.RuntimeException: Exception while dataloading - to allow
>> multiple errors, set the property "dataLoader.allowMultipleErrors" to
>>true
>> Problem while loading item identifier 0_1 because
>> There are duplicate objects in the source being loaded, multiple items
>>are
>> identical according to the primary key being used. Storing again to id
>> 1738000001 object from source Organism [commonName="null", genus="null",
>> id="1", name="Drosophila melanogaster", shortName="D. melanogaster",
>> species="null", taxonId="7227"]
>>
>>
>>
>> When I load my data from large-xml type source, I declare, to have a
>> locatedOn attribute of my cageTags:
>>
>>
>> my ( $taxonid, $longname, $shortname ) = ("7227", "Drosophila
>> melanogaster","D. melanogaster"); #get_organism_detail($taxonfile);
>>
>> my $organism = $doc->add_item(
>>        "Organism",
>>        "taxonId"   =>   $taxonid,
>>        "name"      =>   $longname,
>>        "shortName" =>   $shortname
>> );
>>
>>
>>       $chromosome = $doc->add_item(
>>
>>            'Chromosome',
>>            'primaryIdentifier' =>   $chr,
>>            'dataSets'          =>   [$data_set_item],
>>            #'sequence'          =>   $CHRseq,
>>            #'length'            =>   $chromlen,
>>            'organism'          =>   $organism,
>>          );
>>
>>
>>
>>            'Location',
>>                    'start' =>$st,
>>                    'end'   =>$end,
>>                    'strand'        =>$strand,
>>                    'feature'       =>$cagecluster,
>>                    'locatedOn',    =>   $chromosome,
>>
>>            );
>>
>>
>>
>> Obviously, The 'Organism' object would be in the database already,
>>before
>> my addition but how could I use the locatedOn feature that goes to the
>> right dataSet?
>>
>>
>> Similar issue with the KEGG data I loaded in my other project,
>>redseamine,
>> I can not see the Pathways widget displayed apart from on section in my
>> Gene report page where it lists the pathways involved but displays the
>> source
>> Of the data as my project name.
>>
>> There I declare the source as:
>>
>> my $keggdata_source_item =
>>      $doc->add_item( DataSource =>   ( name =>   'KEGG', ), );
>>
>> my $keggdata_set_item = $doc->add_item(
>>        DataSet =>   (
>>            name =>   "KEGG",
>>            description =>
>>              "KEGG",
>>            'dataSource' =>   $keggdata_source_item,
>>
>>          ),
>>      );
>>
>>
>>
>>                    my $pathway = $doc->add_item(
>>                    'Pathway',
>>                            'identifier' =>$kmapid,
>>                            'name' =>$kmapdesc,
>>                            'genes' =>[$gene],
>>                            'dataSets' =[$keggdata_set_item];
>>
>>                    );
>>
>>
>>
>> Any help on properly defining the dataSets to avoid duplicate entries?
>>
>> Regards,
>>
>> Intikhab
>>
>>
>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>
>
>
>_______________________________________________
>dev mailing list
>dev at intermine.org
>http://mail.intermine.org/cgi-bin/mailman/listinfo/dev




More information about the dev mailing list