[InterMine Dev] failed data integration for malariamine example

Julie Sullivan julie at flymine.org
Wed Jul 15 11:07:24 BST 2015


The data sources UniProt and GFF3 use the same integration keys for the 
Gene class: primary identifier. This value is the same for each source 
so integration of the gene objects will work correctly.

In your project XML you have a different data source name for the GFF3 
source, so the correct integration keys are not being used. If you 
update your data source to the correct name, the build system will use 
the associated keys file and data integration will be successful.

Here are the docs on data integration, they might be useful:

http://intermine.readthedocs.org/en/latest/database/database-building/data-integration/

I hope that helps!
Julie

On 15/07/15 07:19, Pengcheng Yang wrote:
> Hi,
>
>  From the intermine documentation (page 27-28), I learned that this
> problem is caused by the different gene.primaryidentifier for gff3 and
> uniprot. The default gene.primaryidentifier of gff3 is from ID and
> gene.symbol is from Name. My question is how to define the
> gene.primaryidentifier to be Name? There are some words described this
> problem, it seems to modify the "MalariaGFF3RecordHandler class" file in
> the
> bio/sources/example-sources/malaria-gff/main/src/org/intermine/bio/dataconversion/MalariaGFF3RecordHandler.java.
> Is there any detailed description of how to modify this file to define
> the gene.primaryidentifier to be Name attribute.
>
> Best,
> Pengcheng Yang
>
>
> On 2015/7/15 9:22, Pengcheng Yang wrote:
>> Hi,
>>
>> I have successfully loaded uniprot and gff3 following the tutorial in
>> intermine documentation. However, when I run the commands to check
>> data integration, the results show that the two data sets were not
>> integrated through primaryidentifier.
>>
>> The attached is the psql commands and the file project.xml.
>>
>> Best,
>> Pengcheng Yang
>>
>> [1] the psql commands:
>> malariamine=# select id, primaryidentifier, secondaryidentifier,
>> symbol, length , chromosomeid, chromosomelocationid, organismid from
>> gene where primaryIdentifier = 'PFL1385c';
>>    id    | primaryidentifier | secondaryidentifier | symbol | length |
>> chromosomeid | chromosomelocationid | organismid
>> ---------+-------------------+---------------------+--------+--------+--------------+----------------------+------------
>>
>>  1000581 | PFL1385c          |                     | ABRA   |
>> |              |                      |    1000026
>> (1 row)
>>
>>
>>
>> malariamine=# select * from gene where primaryIdentifier = 'PFL1385c';
>>  briefdescription | score | description | scoretype |   id    | symbol
>> | length | name | primaryidentifier | secondaryidentifier | ups
>> treamintergenicregionid | downstreamintergenicregionid |
>> sequenceontologytermid | organismid | chromosomelocationid |
>> sequenceid | chr
>> omosomeid |            class
>> ------------------+-------+-------------+-----------+---------+--------+--------+------+-------------------+---------------------+----
>>
>> ------------------------+------------------------------+------------------------+------------+----------------------+------------+----
>>
>> ----------+------------------------------
>>                   |       |             |           | 1000581 | ABRA
>> |        |      | PFL1385c          | |
>>                         | |                1000081 |    1000026 |
>> |            |
>>           | org.intermine.model.bio.Gene
>> (1 row)
>>
>> [2] the project.xml file content:
>> <project type="bio">
>>   <property name="target.model" value="genomic"/>
>>   <property name="source.location" location="../bio/sources/"/>
>>   <property name="common.os.prefix" value="common"/>
>>   <property name="intermine.properties.file"
>> value="malariamine.properties"/>
>>   <property name="default.intermine.properties.file"
>> location="../default.intermine.integrate.properties"/>
>>   <sources>
>>                 <source name="uniprot-malaria" type="uniprot">
>>                         <property name="uniprot.organisms"
>> value="36329"/>
>>                         <property name="src.data.dir"
>> location="/home/pengchy/Soft/05.SystemBiology/malaria/uniprot/"/>
>>                 </source>
>>                 <source name="go-malaria" type="go">
>>                         <property name="go.organisms" value="36329"/>
>>                         <property name="src.data.dir"
>> location="/home/pengchy/Soft/05.SystemBiology/malaria/go/"/>
>>                 </source>
>>                 <source name="go-annotation-malaria"
>> type="go-annotation">
>>                         <property name="go-annotation.organisms"
>> value="36329"/>
>>                         <property name="src.data.dir"
>> location="/home/pengchy/Soft/05.SystemBiology/malaria/go-annotation/"/>
>>                 </source>
>>                 <source name="malaria-chromosome-fasta" type="fasta">
>>                         <property name="fasta.taxonId" value="36329"/>
>>                         <property name="fasta.dataSourceName"
>> value="PlasmoDB"/>
>>                         <property name="fasta.dataSetTitle"
>> value="PlasmoDB chromosome sequence"/>
>>                         <property name="fasta.className"
>> value="org.intermine.model.bio.Chromosome"/>
>>                         <property name="fasta.sequenceType" value="dna"/>
>>                         <property name="fasta.includes"
>> value="MAL*fasta"/>
>>                         <property name="src.data.dir"
>> location="/home/pengchy/Soft/05.SystemBiology/malaria/genome/fasta/"/>
>>                 </source>
>>                 <source name="gff-malaria" type="gff">
>>                         <property name="gff3.taxonId" value="36329"/>
>>                         <property name="gff3.seqClsName"
>> value="Chromosome"/>
>>                         <property name="gff3.dataSourceName"
>> value="PlasmoDB"/>
>>                         <property name="gff3.seqDataSourceName"
>> value="PlasmoDB"/>
>>                         <property name="gff3.dataSetTitle"
>> value="PlasmoDB P.falciparum genome"/>
>>                         <property name="src.data.dir"
>> location="/home/pengchy/Soft/05.SystemBiology/malaria/genome/gff/"/>
>>                 </source>
>>                 <source name="interpro-malaria" type="interpro">
>>                         <property name="interpro.organisms"
>> value="36329"/>
>>                         <property name="src.data.dir"
>> location="/home/pengchy/Soft/05.SystemBiology/malaria/interpro/"/>
>>                 </source>
>>                 <source name="kegg-pathway-malaria" type="kegg-pathway">
>>                         <property name="kegg-pathway.organisms"
>> value="36329"/>
>>                         <property name="src.data.dir"
>> location="/home/pengchy/Soft/05.SystemBiology/malaria/kegg/"/>
>>                 </source>
>>
>>
>>   </sources>
>>
>>   <post-processing>
>>
>>
>>
>>   </post-processing>
>>
>> </project>
>
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>



More information about the dev mailing list