[InterMine Dev] failed data integration for malariamine example

Pengcheng Yang pengchy at gmail.com
Thu Jul 16 04:41:51 BST 2015


Dear Julie,

Thank you for your kindly reply. It greatly helped me out.
Following your suggestion, I have successfully deployed the Malariamine. 
And now working on tutorial section 1.3.2.

Best,
Pengcheng Yang

On 2015/7/15 18:07, Julie Sullivan wrote:
> The data sources UniProt and GFF3 use the same integration keys for 
> the Gene class: primary identifier. This value is the same for each 
> source so integration of the gene objects will work correctly.
>
> In your project XML you have a different data source name for the GFF3 
> source, so the correct integration keys are not being used. If you 
> update your data source to the correct name, the build system will use 
> the associated keys file and data integration will be successful.
>
> Here are the docs on data integration, they might be useful:
>
> http://intermine.readthedocs.org/en/latest/database/database-building/data-integration/ 
>
>
> I hope that helps!
> Julie
>
> On 15/07/15 07:19, Pengcheng Yang wrote:
>> Hi,
>>
>>  From the intermine documentation (page 27-28), I learned that this
>> problem is caused by the different gene.primaryidentifier for gff3 and
>> uniprot. The default gene.primaryidentifier of gff3 is from ID and
>> gene.symbol is from Name. My question is how to define the
>> gene.primaryidentifier to be Name? There are some words described this
>> problem, it seems to modify the "MalariaGFF3RecordHandler class" file in
>> the
>> bio/sources/example-sources/malaria-gff/main/src/org/intermine/bio/dataconversion/MalariaGFF3RecordHandler.java. 
>>
>> Is there any detailed description of how to modify this file to define
>> the gene.primaryidentifier to be Name attribute.
>>
>> Best,
>> Pengcheng Yang
>>
>>
>> On 2015/7/15 9:22, Pengcheng Yang wrote:
>>> Hi,
>>>
>>> I have successfully loaded uniprot and gff3 following the tutorial in
>>> intermine documentation. However, when I run the commands to check
>>> data integration, the results show that the two data sets were not
>>> integrated through primaryidentifier.
>>>
>>> The attached is the psql commands and the file project.xml.
>>>
>>> Best,
>>> Pengcheng Yang
>>>
>>> [1] the psql commands:
>>> malariamine=# select id, primaryidentifier, secondaryidentifier,
>>> symbol, length , chromosomeid, chromosomelocationid, organismid from
>>> gene where primaryIdentifier = 'PFL1385c';
>>>    id    | primaryidentifier | secondaryidentifier | symbol | length |
>>> chromosomeid | chromosomelocationid | organismid
>>> ---------+-------------------+---------------------+--------+--------+--------------+----------------------+------------ 
>>>
>>>
>>>  1000581 | PFL1385c          |                     | ABRA   |
>>> |              |                      |    1000026
>>> (1 row)
>>>
>>>
>>>
>>> malariamine=# select * from gene where primaryIdentifier = 'PFL1385c';
>>>  briefdescription | score | description | scoretype |   id | symbol
>>> | length | name | primaryidentifier | secondaryidentifier | ups
>>> treamintergenicregionid | downstreamintergenicregionid |
>>> sequenceontologytermid | organismid | chromosomelocationid |
>>> sequenceid | chr
>>> omosomeid |            class
>>> ------------------+-------+-------------+-----------+---------+--------+--------+------+-------------------+---------------------+---- 
>>>
>>>
>>> ------------------------+------------------------------+------------------------+------------+----------------------+------------+---- 
>>>
>>>
>>> ----------+------------------------------
>>>                   |       |             |           | 1000581 | ABRA
>>> |        |      | PFL1385c          | |
>>>                         | |                1000081 | 1000026 |
>>> |            |
>>>           | org.intermine.model.bio.Gene
>>> (1 row)
>>>
>>> [2] the project.xml file content:
>>> <project type="bio">
>>>   <property name="target.model" value="genomic"/>
>>>   <property name="source.location" location="../bio/sources/"/>
>>>   <property name="common.os.prefix" value="common"/>
>>>   <property name="intermine.properties.file"
>>> value="malariamine.properties"/>
>>>   <property name="default.intermine.properties.file"
>>> location="../default.intermine.integrate.properties"/>
>>>   <sources>
>>>                 <source name="uniprot-malaria" type="uniprot">
>>>                         <property name="uniprot.organisms"
>>> value="36329"/>
>>>                         <property name="src.data.dir"
>>> location="/home/pengchy/Soft/05.SystemBiology/malaria/uniprot/"/>
>>>                 </source>
>>>                 <source name="go-malaria" type="go">
>>>                         <property name="go.organisms" value="36329"/>
>>>                         <property name="src.data.dir"
>>> location="/home/pengchy/Soft/05.SystemBiology/malaria/go/"/>
>>>                 </source>
>>>                 <source name="go-annotation-malaria"
>>> type="go-annotation">
>>>                         <property name="go-annotation.organisms"
>>> value="36329"/>
>>>                         <property name="src.data.dir"
>>> location="/home/pengchy/Soft/05.SystemBiology/malaria/go-annotation/"/>
>>>                 </source>
>>>                 <source name="malaria-chromosome-fasta" type="fasta">
>>>                         <property name="fasta.taxonId" value="36329"/>
>>>                         <property name="fasta.dataSourceName"
>>> value="PlasmoDB"/>
>>>                         <property name="fasta.dataSetTitle"
>>> value="PlasmoDB chromosome sequence"/>
>>>                         <property name="fasta.className"
>>> value="org.intermine.model.bio.Chromosome"/>
>>>                         <property name="fasta.sequenceType" 
>>> value="dna"/>
>>>                         <property name="fasta.includes"
>>> value="MAL*fasta"/>
>>>                         <property name="src.data.dir"
>>> location="/home/pengchy/Soft/05.SystemBiology/malaria/genome/fasta/"/>
>>>                 </source>
>>>                 <source name="gff-malaria" type="gff">
>>>                         <property name="gff3.taxonId" value="36329"/>
>>>                         <property name="gff3.seqClsName"
>>> value="Chromosome"/>
>>>                         <property name="gff3.dataSourceName"
>>> value="PlasmoDB"/>
>>>                         <property name="gff3.seqDataSourceName"
>>> value="PlasmoDB"/>
>>>                         <property name="gff3.dataSetTitle"
>>> value="PlasmoDB P.falciparum genome"/>
>>>                         <property name="src.data.dir"
>>> location="/home/pengchy/Soft/05.SystemBiology/malaria/genome/gff/"/>
>>>                 </source>
>>>                 <source name="interpro-malaria" type="interpro">
>>>                         <property name="interpro.organisms"
>>> value="36329"/>
>>>                         <property name="src.data.dir"
>>> location="/home/pengchy/Soft/05.SystemBiology/malaria/interpro/"/>
>>>                 </source>
>>>                 <source name="kegg-pathway-malaria" 
>>> type="kegg-pathway">
>>>                         <property name="kegg-pathway.organisms"
>>> value="36329"/>
>>>                         <property name="src.data.dir"
>>> location="/home/pengchy/Soft/05.SystemBiology/malaria/kegg/"/>
>>>                 </source>
>>>
>>>
>>>   </sources>
>>>
>>>   <post-processing>
>>>
>>>
>>>
>>>   </post-processing>
>>>
>>> </project>
>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>
> .
>




More information about the dev mailing list