[InterMine Dev] how to define LocatedOn attribute of Location and avoid duplicate objects error

Dr. Intikhab Alam intikhab.alam at kaust.edu.sa
Mon Mar 5 16:30:22 GMT 2012


Dear Richard,

Thank you for your email.



On 3/5/12 3:39 PM, "Richard Smith" <richard at flymine.org> wrote:

>On 05/03/2012 15:23, Dr. Intikhab Alam wrote:
>> Dear Richard,
>>
>> Thanks for your email and possible solution.
>>
>> What is the primaryKey on which organism data is merged? Is it taxonId
>> right?
>
>Yes.
>
>> My xml sources have the Organism defined only once, there are no
>>duplicate
>> occurrences of taxonIds, though I load multiple xml files. My xml files
>> contained Information other than taxonIds as well, which perhaps
>>integrate
>> doesn¹t like as this information is already in the database when flymine
>> data is loaded.
>
>Having multiple XML files is the problem.  The data loading code will
>read all of the files in one load so if each file contains an organism
>it will still count as a duplicate.  It would be best to create one file
>with all the data in and only one organism.

Here each xml file is a separate source type large-xml, how can I combine
into one? Like cageTags with nucleotide length 26,27,28 and each for
forward/reverse strand etc.


>
>> What do you think?
>>
>> If I recall correctly when I was loading InterPro Domains, if I provide
>> any attribute other than the primaryId, it failed with duplicate objects
>> error at the data integration stage; When I provided only the primaryId,
>> all went fine.
>>
>> Now I am trying to use only the taxonId for Organism when I format my
>>xml
>> file and try again. It takes a few hours to load the flymine data and if
>> it fails at integration stage, I load it again, update the model and try
>> integrate again, as shown in the build flymine with own data page.
>
>You could speed this up by creating a database copy once the dump is
>loaded up, e.g.
>
>createdb -T flymine-db flymine-db-backup


I did the following when I loaded flymine for the first time:

pg_dump -c -f pdated-cageflymine cageflymine

And to load I do:

cat updated-cageflymine |psql cageflymine -U cageflyminer

Though I got some errors like:

SET
SET
SET
SET
SET
SET
SET
ERROR:  index "utr__symbol_like" does not exist
ERROR:  index "utr__symbol_equals" does not exist
DROP INDEX
DROP INDEX
ERROR:  index "utr__secondaryidentifier_like" does not exist
ERROR:  index "utr__secondaryidentifier_equals" does not exist
ERROR:  index "utr__scoretype_like" does not exist
ERROR:  index "utr__scoretype_equals" does not exist
ERROR:  index "utr__score" does not exist
ERROR:  index "utr__primaryidentifier_like" does not exist
ERROR:  index "utr__primaryidentifier_equals" does not exist


...


>
>You can only do this if there are no connections to the database but
>it's much faster then loading up a dump.  If anything goes wrong you
>do the same command in reverse to restore from the backup.
>
>>
>> Gbrowse on the current flymine release is not working. When I do build a
>> local copy of flymine, do I need to build a local Gbrowse instance? What
>> specifically I am looking for is to view a range of Location from e.g.
>> Dmel genome and see how many of my cageTag clusters appear at or around
>> this location. Furthermore, what is the conservation from other
>>Drosophila
>> species for the region in question. UCSC shows this conservation track
>> from other drosophila species. Can we do this for flymine Gbrowse? We
>> already load the orthology information from other species, the only
>>thing
>> missing is the conservation, which is a score for each Dmel position
>> conserved in each of the other Drosophila species.
>
>GBrowse will be back soon on the current FlyMine.  For your local
>database it sounds like region search would be more useful than using
>GBrowse:  http://www.flymine.org/release-33.0/genomicRegionSearch.do

You mean for given locations I should be able to see overlapping features?

>
>Once the issues with GBrowse are fixed you'll be able to set up a local
>GBrowse by exporting features from your mine according to these
>instructions:  http://intermine.org/wiki/GBrowseConfiguration

You mean when issues are fixed at flymine end, I can query flymine Gbrowse
or need to setup one locally?

Thanks for your help, really.

Best Wishes,

Intikhab
>
>Cheers,
>Richard.
>
>
>> Best Wishes,
>>
>> Intikhab
>>
>>
>>
>> On 3/5/12 3:01 PM, "Richard Smith"<richard at flymine.org>  wrote:
>>
>>> Hi Intikhab,
>>> It looks like there are a couple of issues here.
>>>
>>> 1. The error message you below: 'There are duplicate objects in the
>>> source being loaded, multiple items are identical according to the
>>> primary key being used....'
>>>
>>> This means that there are two organism items in your XML file with
>>> the same taxonId attribute.  For data loading to work correctly the
>>> items XML should only contain unique items according to the integration
>>> keys you're using.  So your script will need to make sure all items
>>>that
>>> reference an organism point to the same item identifier and the
>>>organism
>>> is only created once.
>>>
>>> 2. Once this issue is fixed the organism you create should merge with
>>> the one already in the database and it should have the existing DataSet
>>> and the one you created in it's DataSets collection.
>>>
>>> Once the organism problem is fixed let me know if you're still having
>>> issues with DataSets.
>>>
>>> Cheers,
>>> Richard.
>>>
>>>
>>> On 05/03/2012 11:40, Dr. Intikhab Alam wrote:
>>>> Hi,
>>>>
>>>> If I want to use my own data to describe Location's locatedOn
>>>>attribute
>>>> for Dmel genome, what is the best way to define the locatedOn
>>>>attribute
>>>> and avoid duplicate objects e.g. The Organism object.
>>>>
>>>> I am trying to build flymine with my own data and followed the
>>>> instructions at
>>>> http://intermine.org/wiki/FlyMineOwnData (by using pg_restore). What
>>>>is
>>>> the best way to define a locatedOn feature of my cageTags data. I can
>>>> read
>>>> all
>>>> the flymine data and my cageTags data but at the integration stage I
>>>>get
>>>> the
>>>> duplicated items error:
>>>>
>>>> BUILD FAILED
>>>> /home/intikhab/biosoft/intermine_0_99/imbuild/integrate.xml:54: The
>>>> following error occurred while executing this line:
>>>> /home/intikhab/biosoft/intermine_0_99/imbuild/source.xml:330:
>>>> java.lang.RuntimeException: Exception while dataloading - to allow
>>>> multiple errors, set the property "dataLoader.allowMultipleErrors" to
>>>> true
>>>> Problem while loading item identifier 0_1 because
>>>> There are duplicate objects in the source being loaded, multiple items
>>>> are
>>>> identical according to the primary key being used. Storing again to id
>>>> 1738000001 object from source Organism [commonName="null",
>>>>genus="null",
>>>> id="1", name="Drosophila melanogaster", shortName="D. melanogaster",
>>>> species="null", taxonId="7227"]
>>>>
>>>>
>>>>
>>>> When I load my data from large-xml type source, I declare, to have a
>>>> locatedOn attribute of my cageTags:
>>>>
>>>>
>>>> my ( $taxonid, $longname, $shortname ) = ("7227", "Drosophila
>>>> melanogaster","D. melanogaster"); #get_organism_detail($taxonfile);
>>>>
>>>> my $organism = $doc->add_item(
>>>>         "Organism",
>>>>         "taxonId"   =>    $taxonid,
>>>>         "name"      =>    $longname,
>>>>         "shortName" =>    $shortname
>>>> );
>>>>
>>>>
>>>>        $chromosome = $doc->add_item(
>>>>
>>>>             'Chromosome',
>>>>             'primaryIdentifier' =>    $chr,
>>>>             'dataSets'          =>    [$data_set_item],
>>>>             #'sequence'          =>    $CHRseq,
>>>>             #'length'            =>    $chromlen,
>>>>             'organism'          =>    $organism,
>>>>           );
>>>>
>>>>
>>>>
>>>>             'Location',
>>>>                     'start' =>$st,
>>>>                     'end'   =>$end,
>>>>                     'strand'        =>$strand,
>>>>                     'feature'       =>$cagecluster,
>>>>                     'locatedOn',    =>    $chromosome,
>>>>
>>>>             );
>>>>
>>>>
>>>>
>>>> Obviously, The 'Organism' object would be in the database already,
>>>> before
>>>> my addition but how could I use the locatedOn feature that goes to the
>>>> right dataSet?
>>>>
>>>>
>>>> Similar issue with the KEGG data I loaded in my other project,
>>>> redseamine,
>>>> I can not see the Pathways widget displayed apart from on section in
>>>>my
>>>> Gene report page where it lists the pathways involved but displays the
>>>> source
>>>> Of the data as my project name.
>>>>
>>>> There I declare the source as:
>>>>
>>>> my $keggdata_source_item =
>>>>       $doc->add_item( DataSource =>    ( name =>    'KEGG', ), );
>>>>
>>>> my $keggdata_set_item = $doc->add_item(
>>>>         DataSet =>    (
>>>>             name =>    "KEGG",
>>>>             description =>
>>>>               "KEGG",
>>>>             'dataSource' =>    $keggdata_source_item,
>>>>
>>>>           ),
>>>>       );
>>>>
>>>>
>>>>
>>>>                     my $pathway = $doc->add_item(
>>>>                     'Pathway',
>>>>                             'identifier' =>$kmapid,
>>>>                             'name' =>$kmapdesc,
>>>>                             'genes' =>[$gene],
>>>>                             'dataSets' =[$keggdata_set_item];
>>>>
>>>>                     );
>>>>
>>>>
>>>>
>>>> Any help on properly defining the dataSets to avoid duplicate entries?
>>>>
>>>> Regards,
>>>>
>>>> Intikhab
>>>>
>>>>
>>>>
>>>>
>>>> _______________________________________________
>>>> dev mailing list
>>>> dev at intermine.org
>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>>
>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>
>>
>



More information about the dev mailing list