[InterMine Dev] how to define LocatedOn attribute of Location and avoid duplicate objects error

Richard Smith richard at flymine.org
Mon Mar 5 15:39:44 GMT 2012

On 05/03/2012 15:23, Dr. Intikhab Alam wrote:
> Dear Richard,
> Thanks for your email and possible solution.
> What is the primaryKey on which organism data is merged? Is it taxonId
> right?


> My xml sources have the Organism defined only once, there are no duplicate
> occurrences of taxonIds, though I load multiple xml files. My xml files
> contained Information other than taxonIds as well, which perhaps integrate
> doesn¹t like as this information is already in the database when flymine
> data is loaded.

Having multiple XML files is the problem.  The data loading code will
read all of the files in one load so if each file contains an organism
it will still count as a duplicate.  It would be best to create one file
with all the data in and only one organism.

> What do you think?
> If I recall correctly when I was loading InterPro Domains, if I provide
> any attribute other than the primaryId, it failed with duplicate objects
> error at the data integration stage; When I provided only the primaryId,
> all went fine.
> Now I am trying to use only the taxonId for Organism when I format my xml
> file and try again. It takes a few hours to load the flymine data and if
> it fails at integration stage, I load it again, update the model and try
> integrate again, as shown in the build flymine with own data page.

You could speed this up by creating a database copy once the dump is
loaded up, e.g.

createdb -T flymine-db flymine-db-backup

You can only do this if there are no connections to the database but
it's much faster then loading up a dump.  If anything goes wrong you
do the same command in reverse to restore from the backup.

> Gbrowse on the current flymine release is not working. When I do build a
> local copy of flymine, do I need to build a local Gbrowse instance? What
> specifically I am looking for is to view a range of Location from e.g.
> Dmel genome and see how many of my cageTag clusters appear at or around
> this location. Furthermore, what is the conservation from other Drosophila
> species for the region in question. UCSC shows this conservation track
> from other drosophila species. Can we do this for flymine Gbrowse? We
> already load the orthology information from other species, the only thing
> missing is the conservation, which is a score for each Dmel position
> conserved in each of the other Drosophila species.

GBrowse will be back soon on the current FlyMine.  For your local
database it sounds like region search would be more useful than using
GBrowse:  http://www.flymine.org/release-33.0/genomicRegionSearch.do

Once the issues with GBrowse are fixed you'll be able to set up a local
GBrowse by exporting features from your mine according to these
instructions:  http://intermine.org/wiki/GBrowseConfiguration


> Best Wishes,
> Intikhab
> On 3/5/12 3:01 PM, "Richard Smith"<richard at flymine.org>  wrote:
>> Hi Intikhab,
>> It looks like there are a couple of issues here.
>> 1. The error message you below: 'There are duplicate objects in the
>> source being loaded, multiple items are identical according to the
>> primary key being used....'
>> This means that there are two organism items in your XML file with
>> the same taxonId attribute.  For data loading to work correctly the
>> items XML should only contain unique items according to the integration
>> keys you're using.  So your script will need to make sure all items that
>> reference an organism point to the same item identifier and the organism
>> is only created once.
>> 2. Once this issue is fixed the organism you create should merge with
>> the one already in the database and it should have the existing DataSet
>> and the one you created in it's DataSets collection.
>> Once the organism problem is fixed let me know if you're still having
>> issues with DataSets.
>> Cheers,
>> Richard.
>> On 05/03/2012 11:40, Dr. Intikhab Alam wrote:
>>> Hi,
>>> If I want to use my own data to describe Location's locatedOn attribute
>>> for Dmel genome, what is the best way to define the locatedOn attribute
>>> and avoid duplicate objects e.g. The Organism object.
>>> I am trying to build flymine with my own data and followed the
>>> instructions at
>>> http://intermine.org/wiki/FlyMineOwnData (by using pg_restore). What is
>>> the best way to define a locatedOn feature of my cageTags data. I can
>>> read
>>> all
>>> the flymine data and my cageTags data but at the integration stage I get
>>> the
>>> duplicated items error:
>>> /home/intikhab/biosoft/intermine_0_99/imbuild/integrate.xml:54: The
>>> following error occurred while executing this line:
>>> /home/intikhab/biosoft/intermine_0_99/imbuild/source.xml:330:
>>> java.lang.RuntimeException: Exception while dataloading - to allow
>>> multiple errors, set the property "dataLoader.allowMultipleErrors" to
>>> true
>>> Problem while loading item identifier 0_1 because
>>> There are duplicate objects in the source being loaded, multiple items
>>> are
>>> identical according to the primary key being used. Storing again to id
>>> 1738000001 object from source Organism [commonName="null", genus="null",
>>> id="1", name="Drosophila melanogaster", shortName="D. melanogaster",
>>> species="null", taxonId="7227"]
>>> When I load my data from large-xml type source, I declare, to have a
>>> locatedOn attribute of my cageTags:
>>> my ( $taxonid, $longname, $shortname ) = ("7227", "Drosophila
>>> melanogaster","D. melanogaster"); #get_organism_detail($taxonfile);
>>> my $organism = $doc->add_item(
>>>         "Organism",
>>>         "taxonId"   =>    $taxonid,
>>>         "name"      =>    $longname,
>>>         "shortName" =>    $shortname
>>> );
>>>        $chromosome = $doc->add_item(
>>>             'Chromosome',
>>>             'primaryIdentifier' =>    $chr,
>>>             'dataSets'          =>    [$data_set_item],
>>>             #'sequence'          =>    $CHRseq,
>>>             #'length'            =>    $chromlen,
>>>             'organism'          =>    $organism,
>>>           );
>>>             'Location',
>>>                     'start' =>$st,
>>>                     'end'   =>$end,
>>>                     'strand'        =>$strand,
>>>                     'feature'       =>$cagecluster,
>>>                     'locatedOn',    =>    $chromosome,
>>>             );
>>> Obviously, The 'Organism' object would be in the database already,
>>> before
>>> my addition but how could I use the locatedOn feature that goes to the
>>> right dataSet?
>>> Similar issue with the KEGG data I loaded in my other project,
>>> redseamine,
>>> I can not see the Pathways widget displayed apart from on section in my
>>> Gene report page where it lists the pathways involved but displays the
>>> source
>>> Of the data as my project name.
>>> There I declare the source as:
>>> my $keggdata_source_item =
>>>       $doc->add_item( DataSource =>    ( name =>    'KEGG', ), );
>>> my $keggdata_set_item = $doc->add_item(
>>>         DataSet =>    (
>>>             name =>    "KEGG",
>>>             description =>
>>>               "KEGG",
>>>             'dataSource' =>    $keggdata_source_item,
>>>           ),
>>>       );
>>>                     my $pathway = $doc->add_item(
>>>                     'Pathway',
>>>                             'identifier' =>$kmapid,
>>>                             'name' =>$kmapdesc,
>>>                             'genes' =>[$gene],
>>>                             'dataSets' =[$keggdata_set_item];
>>>                     );
>>> Any help on properly defining the dataSets to avoid duplicate entries?
>>> Regards,
>>> Intikhab
>>> _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

More information about the dev mailing list