[InterMine Dev] How to link genes and proteins from gff3 files and fasta files

Fengyuan Hu fh293 at cam.ac.uk
Thu Aug 29 12:16:28 BST 2013


Hi Sebastien,

It looks you are missing one step to link genes and proteins. Because 
gene has a collection of proteins (or vise versa) in the model, you need 
a source that can parse information with genes and proteins together. 
One example is uniprot 
<http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/> 
source.

Here is the code how they are connected - ref 
https://github.com/intermine/intermine/blob/dev/bio/sources/uniprot/main/src/org/intermine/bio/dataconversion/UniprotConverter.java#L997. 


I'd suggest to try UniProt first if sunflower data is available.

Thanks
Fengyuan

On 29/08/13 09:39, Sebastien Carrere wrote:
> Thanks for your answer. I tried what you said but the problem remains.
> I think the problem is in the GFF file loading step.
> So I started again from scratch, using the default gff source with the 
> Chromosome model:
>
>
> <source name="helianthus-gff" type="*gff*">
>   <property name="gff3.taxonId" value="4232"/>
>   <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
>   <property name="gff3.dataSourceName" value="HelianthusDB"/>
>   <property name="gff3.seqClsName" value="*Chromosome*"/> <!-- Seq 
> Ontology Term Camelized-->
>   <property name="gff3.dataSetTitle" value="HaT13l FrameDP predictions"/>
>   <property name="src.data.dir" location="/path/to/data_helianthus/gff3"/>
> </source>
>
> tHaT13l000001    LIPM    gene    1    11546    0    . .    
> ID=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;
> tHaT13l000001    LIPM    mRNA    1    11546    0 .    .    
> ID=*mRNA:HaT13l000001.HaT13l000001.1*;Parent=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;
>
>
>
> mymine=# SELECT id from  gene where primaryidentifier = 
> '*gene:HaT13l000001.HaT13l000001.1*' ;
>    id
> ---------
>  1000007
> (1 row)
>
> mymine=# SELECT id,geneid from  mrna where primaryidentifier = 
> '*mRNA:HaT13l000001.HaT13l000001.1*' ;
>    id    | geneid
> ---------+--------
>  1000010 |
> (1 row)
>
>
> So the links through the Parent values seems not to be done even after 
> postprocessing steps.
> Any idea of what's wrong ?
>
>
> Sebastien
>
>
> Le 28/08/2013 19:02, Jayaraman, Pushkala a écrit :
>>
>> If im not wrong.. the one thing different between your protein fasta 
>> loader and the Uniprot Fasta loader is this line:
>>
>> *<property*name="fasta.classAttribute"value="primaryAccession"*/>*
>>
>> From this link here:
>>
>> http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/#fasta
>>
>> not sure if this helps.. but would changing that classAttribute Value 
>> help?
>>
>> Pushkala
>>
>> *From:*dev-bounces at intermine.org [mailto:dev-bounces at intermine.org] 
>> *On Behalf Of *Sebastien Carrere
>> *Sent:* Wednesday, August 28, 2013 10:35 AM
>> *To:* dev at intermine.org
>> *Subject:* [InterMine Dev] How to link genes and proteins from gff3 
>> files and fasta files
>>
>> Hi all,
>>
>> I'm a newbie using InterMine. So first of all sorry if my question is 
>> stupid but I wasn't able to find a clear answer in the doc or the 
>> different examples.
>>
>> Here's my problem. I've got 3 datasources:
>>
>> 1. a *gff* file describing transcripts (gene/mRNA/exon/CDS with 
>> possibility of many mRNAs per genes): I wrote my own GFF3 loader 
>> (gff-bfc) to use the Name attribute as primaryIdentifier for each feature
>> _sample:_
>> ##gff-version 3
>> ##sequence-region HaT13l000001 1 11546
>> *tHaT13l000001*    LIPM    gene    1 11546    0    .    . 
>> ID=gene:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001*;NoteUBX;
>> tHaT13l000001    LIPM    mRNA    1    11546    0 .    . 
>> ID=mRNA:HaT13l000001.HaT13l000001.1;Parent=gene:HaT13l000001.HaT13l000001.1;Name=HaT13l000001;Note=UBX;
>> tHaT13l000001    FrameDP    exon    1    185    0 +    . 
>> ID=exon:HaT13l000001.HaT13l000001.1;Parent=mRNA:HaT13l000001.HaT13l000001.1;*Name=HaT13l000001_1_AA*;
>> tHaT13l000001    FrameDP    CDS    1    185    0 +    . 
>> ID=CDS:HaT13l000001_1_AA.HaT13l000001_1_AA.1;Parent=exon:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001_1_AA*;Note=hypothetical;
>> ##sequence-region HaT13l000002 1 7394
>> tHaT13l000002    LIPM    gene    1    7394    0 .    . 
>> ID=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
>> tHaT13l000002    LIPM    mRNA    1    7394    0 .    . 
>> ID=mRNA:HaT13l000002.HaT13l000002.1;Parent=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
>> tHaT13l000002    FrameDP    exon    1782    5231 0    +    . 
>> ID=exon:HaT13l000002.HaT13l000002.1;Parent=mRNA:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;
>> tHaT13l000002    FrameDP    CDS    1782    5231 0    +    . 
>> ID=CDS:HaT13l000002_3_AA.HaT13l000002_3_AA.1;Parent=exon:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;Note=DNA 
>> polymerase;
>>
>>
>> 2. a *fasta* file for the *transcripts* sequences
>> _sample:_
>> >*tHaT13l000001 *
>> GCACAGTATACCTTCTTGCTTGTCTAATTCACTTTCATTCTTCATCTTCTCTCTTAATCAACAATCTTCCGCAAATCACA
>> CACACAACACACCTTTTCAATTTCAATTTCTTCATAACCGCCGTAACAGACACAAAAAACCCCAACCGAAAACCCTTGAA
>> ATCGAACCGCCGGTTGATTTCGTAATCTCGATTCGATTGTTTGTTGGTTCGATCAATCAATCGTGACGCTAATCGGTTGT
>> ATACAGTGGTTATTGATTGTACGATTAACGAATGCTTTTTGTTAGGTTTGTTTTAAGATGAAGTTGAGGAGAAGGAGGCA
>> [...]
>>
>>
>> 3. a *fasta* file for the *protein* sequences (CDS translations) (ids 
>> are the same as mRNA:Name or CDS:Name)
>> _sample:_
>> >*HaT13l000001_1_AA*gn=*HaT13l000001*
>> TVYLLACLIHFHSSSSLLINNLPQITHTTHLFNFNFFITAVTDTKNPNRKPLKSNRRLIS
>> >HaT13l000001_2_AA gn=HaT13l000001
>> MKLRRRRQSEVPPKIKSFINGVIAVPLENIEEPLKSFFWDFDKGDFHHWVDLFNHFDTFFEKYIKPRKDLQLDDGFLESD
>> PPFPREAVLQILRVVRTILDNCTNKHFYSSYEHHLSSLLASTDADVVEACLQTLSSFLRKSIGKHIARDTSLSSKLFAFA
>>
>>
>> Everything seems to be loaded (I can see gene/mRNA/CDS and proteins 
>> with the right primaryIdentifier in the database) but the link 
>> between genes and proteins is not processed:
>>
>> mymyne=# SELECT * from genesproteins ;
>>  proteins | genes
>> ----------+-------
>> (0 rows)
>>
>>
>> So my questions are:
>> - am I doing something wrong (certainly;) ) ?
>> - do I have to write my own FastaLoader and use the 
>> protein.addToCollection("genes", geneRefId) to fill this table ?
>>     - and in this case, what is the method to get the geneRefId from 
>> its primaryIdentifier value ?
>>
>> Here is my project.xml file:
>>
>> <sources>
>>     <source name="helianthus-gff" type="gff-bfc">
>>       <property name="gff3.taxonId" value="4232"/>
>>       <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
>>       <property name="gff3.dataSourceName" value="HelianthusDB"/>
>>       <property name="gff3.seqClsName" value="Transcript"/> <!-- Seq 
>> Ontology Term Camelized-->
>>       <property name="gff3.dataSetTitle" value="HaT13l FrameDP 
>> predictions"/>
>>       <property name="src.data.dir" 
>> location="/path/to/data_helianthus/gff3"/>
>>     </source>
>>     <source name="helianthus-transcripts-fasta" type="fasta" >
>>       <property name="fasta.className" 
>> value="org.intermine.model.bio.Transcript"/>
>>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>>       <property name="fasta.dataSetTitle" value="HelianthusDB 
>> transcripts sequences"/>
>>       <property name="fasta.taxonId" value="4232"/>
>>       <property name="fasta.includes" value="*.fna"/>
>>       <property name="src.data.dir" 
>> location="/path/to/data_helianthus/fasta/"/>
>>     </source>
>>     <source name="helianthus-peptides-fasta" type="fasta" >
>>       <property name="fasta.className" 
>> value="org.intermine.model.bio.Protein"/>
>>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>>       <property name="fasta.sequenceType" value="protein" />
>>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>>       <property name="fasta.dataSetTitle" value="HelianthusDB peptide 
>> sequences"/>
>>       <property name="fasta.taxonId" value="4232"/>
>>       <property name="fasta.includes" value="*.faa"/>
>>       <property name="src.data.dir" 
>> location="/path/to/data_helianthus/fasta"/>
>>     </source>
>>
>> Thanks for your help,
>>
>> Sebastien
>>
>
>
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130829/360cc9f0/attachment-0001.html>


More information about the dev mailing list