[InterMine Dev] how to define LocatedOn attribute of Location and avoid duplicate objects error

Dr. Intikhab Alam intikhab.alam at kaust.edu.sa
Mon Mar 5 16:30:22 GMT 2012

Dear Richard,

Thank you for your email.

On 3/5/12 3:39 PM, "Richard Smith" <richard at flymine.org> wrote:

>On 05/03/2012 15:23, Dr. Intikhab Alam wrote:
>> Dear Richard,
>> Thanks for your email and possible solution.
>> What is the primaryKey on which organism data is merged? Is it taxonId
>> right?
>> My xml sources have the Organism defined only once, there are no
>> occurrences of taxonIds, though I load multiple xml files. My xml files
>> contained Information other than taxonIds as well, which perhaps
>> doesn¹t like as this information is already in the database when flymine
>> data is loaded.
>Having multiple XML files is the problem.  The data loading code will
>read all of the files in one load so if each file contains an organism
>it will still count as a duplicate.  It would be best to create one file
>with all the data in and only one organism.

Here each xml file is a separate source type large-xml, how can I combine
into one? Like cageTags with nucleotide length 26,27,28 and each for
forward/reverse strand etc.

>> What do you think?
>> If I recall correctly when I was loading InterPro Domains, if I provide
>> any attribute other than the primaryId, it failed with duplicate objects
>> error at the data integration stage; When I provided only the primaryId,
>> all went fine.
>> Now I am trying to use only the taxonId for Organism when I format my
>> file and try again. It takes a few hours to load the flymine data and if
>> it fails at integration stage, I load it again, update the model and try
>> integrate again, as shown in the build flymine with own data page.
>You could speed this up by creating a database copy once the dump is
>loaded up, e.g.
>createdb -T flymine-db flymine-db-backup

I did the following when I loaded flymine for the first time:

pg_dump -c -f pdated-cageflymine cageflymine

And to load I do:

cat updated-cageflymine |psql cageflymine -U cageflyminer

Though I got some errors like:

ERROR:  index "utr__symbol_like" does not exist
ERROR:  index "utr__symbol_equals" does not exist
ERROR:  index "utr__secondaryidentifier_like" does not exist
ERROR:  index "utr__secondaryidentifier_equals" does not exist
ERROR:  index "utr__scoretype_like" does not exist
ERROR:  index "utr__scoretype_equals" does not exist
ERROR:  index "utr__score" does not exist
ERROR:  index "utr__primaryidentifier_like" does not exist
ERROR:  index "utr__primaryidentifier_equals" does not exist


>You can only do this if there are no connections to the database but
>it's much faster then loading up a dump.  If anything goes wrong you
>do the same command in reverse to restore from the backup.
>> Gbrowse on the current flymine release is not working. When I do build a
>> local copy of flymine, do I need to build a local Gbrowse instance? What
>> specifically I am looking for is to view a range of Location from e.g.
>> Dmel genome and see how many of my cageTag clusters appear at or around
>> this location. Furthermore, what is the conservation from other
>> species for the region in question. UCSC shows this conservation track
>> from other drosophila species. Can we do this for flymine Gbrowse? We
>> already load the orthology information from other species, the only
>> missing is the conservation, which is a score for each Dmel position
>> conserved in each of the other Drosophila species.
>GBrowse will be back soon on the current FlyMine.  For your local
>database it sounds like region search would be more useful than using
>GBrowse:  http://www.flymine.org/release-33.0/genomicRegionSearch.do

You mean for given locations I should be able to see overlapping features?

>Once the issues with GBrowse are fixed you'll be able to set up a local
>GBrowse by exporting features from your mine according to these
>instructions:  http://intermine.org/wiki/GBrowseConfiguration

You mean when issues are fixed at flymine end, I can query flymine Gbrowse
or need to setup one locally?

Thanks for your help, really.

Best Wishes,

>> Best Wishes,
>> Intikhab
>> On 3/5/12 3:01 PM, "Richard Smith"<richard at flymine.org>  wrote:
>>> Hi Intikhab,
>>> It looks like there are a couple of issues here.
>>> 1. The error message you below: 'There are duplicate objects in the
>>> source being loaded, multiple items are identical according to the
>>> primary key being used....'
>>> This means that there are two organism items in your XML file with
>>> the same taxonId attribute.  For data loading to work correctly the
>>> items XML should only contain unique items according to the integration
>>> keys you're using.  So your script will need to make sure all items
>>> reference an organism point to the same item identifier and the
>>> is only created once.
>>> 2. Once this issue is fixed the organism you create should merge with
>>> the one already in the database and it should have the existing DataSet
>>> and the one you created in it's DataSets collection.
>>> Once the organism problem is fixed let me know if you're still having
>>> issues with DataSets.
>>> Cheers,
>>> Richard.
>>> On 05/03/2012 11:40, Dr. Intikhab Alam wrote:
>>>> Hi,
>>>> If I want to use my own data to describe Location's locatedOn
>>>> for Dmel genome, what is the best way to define the locatedOn
>>>> and avoid duplicate objects e.g. The Organism object.
>>>> I am trying to build flymine with my own data and followed the
>>>> instructions at
>>>> http://intermine.org/wiki/FlyMineOwnData (by using pg_restore). What
>>>> the best way to define a locatedOn feature of my cageTags data. I can
>>>> read
>>>> all
>>>> the flymine data and my cageTags data but at the integration stage I
>>>> the
>>>> duplicated items error:
>>>> /home/intikhab/biosoft/intermine_0_99/imbuild/integrate.xml:54: The
>>>> following error occurred while executing this line:
>>>> /home/intikhab/biosoft/intermine_0_99/imbuild/source.xml:330:
>>>> java.lang.RuntimeException: Exception while dataloading - to allow
>>>> multiple errors, set the property "dataLoader.allowMultipleErrors" to
>>>> true
>>>> Problem while loading item identifier 0_1 because
>>>> There are duplicate objects in the source being loaded, multiple items
>>>> are
>>>> identical according to the primary key being used. Storing again to id
>>>> 1738000001 object from source Organism [commonName="null",
>>>> id="1", name="Drosophila melanogaster", shortName="D. melanogaster",
>>>> species="null", taxonId="7227"]
>>>> When I load my data from large-xml type source, I declare, to have a
>>>> locatedOn attribute of my cageTags:
>>>> my ( $taxonid, $longname, $shortname ) = ("7227", "Drosophila
>>>> melanogaster","D. melanogaster"); #get_organism_detail($taxonfile);
>>>> my $organism = $doc->add_item(
>>>>         "Organism",
>>>>         "taxonId"   =>    $taxonid,
>>>>         "name"      =>    $longname,
>>>>         "shortName" =>    $shortname
>>>> );
>>>>        $chromosome = $doc->add_item(
>>>>             'Chromosome',
>>>>             'primaryIdentifier' =>    $chr,
>>>>             'dataSets'          =>    [$data_set_item],
>>>>             #'sequence'          =>    $CHRseq,
>>>>             #'length'            =>    $chromlen,
>>>>             'organism'          =>    $organism,
>>>>           );
>>>>             'Location',
>>>>                     'start' =>$st,
>>>>                     'end'   =>$end,
>>>>                     'strand'        =>$strand,
>>>>                     'feature'       =>$cagecluster,
>>>>                     'locatedOn',    =>    $chromosome,
>>>>             );
>>>> Obviously, The 'Organism' object would be in the database already,
>>>> before
>>>> my addition but how could I use the locatedOn feature that goes to the
>>>> right dataSet?
>>>> Similar issue with the KEGG data I loaded in my other project,
>>>> redseamine,
>>>> I can not see the Pathways widget displayed apart from on section in
>>>> Gene report page where it lists the pathways involved but displays the
>>>> source
>>>> Of the data as my project name.
>>>> There I declare the source as:
>>>> my $keggdata_source_item =
>>>>       $doc->add_item( DataSource =>    ( name =>    'KEGG', ), );
>>>> my $keggdata_set_item = $doc->add_item(
>>>>         DataSet =>    (
>>>>             name =>    "KEGG",
>>>>             description =>
>>>>               "KEGG",
>>>>             'dataSource' =>    $keggdata_source_item,
>>>>           ),
>>>>       );
>>>>                     my $pathway = $doc->add_item(
>>>>                     'Pathway',
>>>>                             'identifier' =>$kmapid,
>>>>                             'name' =>$kmapdesc,
>>>>                             'genes' =>[$gene],
>>>>                             'dataSets' =[$keggdata_set_item];
>>>>                     );
>>>> Any help on properly defining the dataSets to avoid duplicate entries?
>>>> Regards,
>>>> Intikhab
>>>> _______________________________________________
>>>> dev mailing list
>>>> dev at intermine.org
>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>> _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

More information about the dev mailing list