[InterMine Dev] How to link genes and proteins from gff3 files and fasta files

Sebastien Carrere Sebastien.Carrere at toulouse.inra.fr
Fri Aug 30 10:26:44 BST 2013


Hi Fengyuan,

Thnaks for your answer.
I'm looking at writting my own source handler to do such a mapping.
I just thought that this mechanism was done by default in the generic 
GFF3 loader using the SO model.

Sebastien


Le 29/08/2013 13:16, Fengyuan Hu a écrit :
> Hi Sebastien,
>
> It looks you are missing one step to link genes and proteins. Because 
> gene has a collection of proteins (or vise versa) in the model, you 
> need a source that can parse information with genes and proteins 
> together. One example is uniprot 
> <http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/> 
> source.
>
> Here is the code how they are connected - ref 
> https://github.com/intermine/intermine/blob/dev/bio/sources/uniprot/main/src/org/intermine/bio/dataconversion/UniprotConverter.java#L997. 
>
>
> I'd suggest to try UniProt first if sunflower data is available.
>
> Thanks
> Fengyuan
>
> On 29/08/13 09:39, Sebastien Carrere wrote:
>> Thanks for your answer. I tried what you said but the problem remains.
>> I think the problem is in the GFF file loading step.
>> So I started again from scratch, using the default gff source with 
>> the Chromosome model:
>>
>>
>> <source name="helianthus-gff" type="*gff*">
>>   <property name="gff3.taxonId" value="4232"/>
>>   <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
>>   <property name="gff3.dataSourceName" value="HelianthusDB"/>
>>   <property name="gff3.seqClsName" value="*Chromosome*"/> <!-- Seq 
>> Ontology Term Camelized-->
>>   <property name="gff3.dataSetTitle" value="HaT13l FrameDP predictions"/>
>>   <property name="src.data.dir" 
>> location="/path/to/data_helianthus/gff3"/>
>> </source>
>>
>> tHaT13l000001    LIPM    gene    1    11546    0    . .    
>> ID=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;
>> tHaT13l000001    LIPM    mRNA    1    11546    0 .    .    
>> ID=*mRNA:HaT13l000001.HaT13l000001.1*;Parent=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;
>>
>>
>>
>> mymine=# SELECT id from  gene where primaryidentifier = 
>> '*gene:HaT13l000001.HaT13l000001.1*' ;
>>    id
>> ---------
>>  1000007
>> (1 row)
>>
>> mymine=# SELECT id,geneid from  mrna where primaryidentifier = 
>> '*mRNA:HaT13l000001.HaT13l000001.1*' ;
>>    id    | geneid
>> ---------+--------
>>  1000010 |
>> (1 row)
>>
>>
>> So the links through the Parent values seems not to be done even 
>> after postprocessing steps.
>> Any idea of what's wrong ?
>>
>>
>> Sebastien
>>
>>
>> Le 28/08/2013 19:02, Jayaraman, Pushkala a écrit :
>>>
>>> If im not wrong.. the one thing different between your protein fasta 
>>> loader and the Uniprot Fasta loader is this line:
>>>
>>> *<property*name="fasta.classAttribute"value="primaryAccession"*/>*
>>>
>>> From this link here:
>>>
>>> http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/#fasta
>>>
>>> not sure if this helps.. but would changing that classAttribute 
>>> Value help?
>>>
>>> Pushkala
>>>
>>> *From:*dev-bounces at intermine.org [mailto:dev-bounces at intermine.org] 
>>> *On Behalf Of *Sebastien Carrere
>>> *Sent:* Wednesday, August 28, 2013 10:35 AM
>>> *To:* dev at intermine.org
>>> *Subject:* [InterMine Dev] How to link genes and proteins from gff3 
>>> files and fasta files
>>>
>>> Hi all,
>>>
>>> I'm a newbie using InterMine. So first of all sorry if my question 
>>> is stupid but I wasn't able to find a clear answer in the doc or the 
>>> different examples.
>>>
>>> Here's my problem. I've got 3 datasources:
>>>
>>> 1. a *gff* file describing transcripts (gene/mRNA/exon/CDS with 
>>> possibility of many mRNAs per genes): I wrote my own GFF3 loader 
>>> (gff-bfc) to use the Name attribute as primaryIdentifier for each 
>>> feature
>>> _sample:_
>>> ##gff-version 3
>>> ##sequence-region HaT13l000001 1 11546
>>> *tHaT13l000001*    LIPM    gene    1 11546    0    .    . 
>>> ID=gene:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001*;NoteUBX;
>>> tHaT13l000001    LIPM    mRNA    1    11546    0 .    . 
>>> ID=mRNA:HaT13l000001.HaT13l000001.1;Parent=gene:HaT13l000001.HaT13l000001.1;Name=HaT13l000001;Note=UBX;
>>> tHaT13l000001    FrameDP    exon    1    185    0 +    . 
>>> ID=exon:HaT13l000001.HaT13l000001.1;Parent=mRNA:HaT13l000001.HaT13l000001.1;*Name=HaT13l000001_1_AA*;
>>> tHaT13l000001    FrameDP    CDS    1    185    0 +    . 
>>> ID=CDS:HaT13l000001_1_AA.HaT13l000001_1_AA.1;Parent=exon:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001_1_AA*;Note=hypothetical;
>>> ##sequence-region HaT13l000002 1 7394
>>> tHaT13l000002    LIPM    gene    1    7394    0 .    . 
>>> ID=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
>>> tHaT13l000002    LIPM    mRNA    1    7394    0 .    . 
>>> ID=mRNA:HaT13l000002.HaT13l000002.1;Parent=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
>>> tHaT13l000002    FrameDP    exon    1782    5231 0    +    . 
>>> ID=exon:HaT13l000002.HaT13l000002.1;Parent=mRNA:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;
>>> tHaT13l000002    FrameDP    CDS    1782    5231 0    +    . 
>>> ID=CDS:HaT13l000002_3_AA.HaT13l000002_3_AA.1;Parent=exon:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;Note=DNA 
>>> polymerase;
>>>
>>>
>>> 2. a *fasta* file for the *transcripts* sequences
>>> _sample:_
>>> >*tHaT13l000001 *
>>> GCACAGTATACCTTCTTGCTTGTCTAATTCACTTTCATTCTTCATCTTCTCTCTTAATCAACAATCTTCCGCAAATCACA
>>> CACACAACACACCTTTTCAATTTCAATTTCTTCATAACCGCCGTAACAGACACAAAAAACCCCAACCGAAAACCCTTGAA
>>> ATCGAACCGCCGGTTGATTTCGTAATCTCGATTCGATTGTTTGTTGGTTCGATCAATCAATCGTGACGCTAATCGGTTGT
>>> ATACAGTGGTTATTGATTGTACGATTAACGAATGCTTTTTGTTAGGTTTGTTTTAAGATGAAGTTGAGGAGAAGGAGGCA
>>> [...]
>>>
>>>
>>> 3. a *fasta* file for the *protein* sequences (CDS translations) 
>>> (ids are the same as mRNA:Name or CDS:Name)
>>> _sample:_
>>> >*HaT13l000001_1_AA*gn=*HaT13l000001*
>>> TVYLLACLIHFHSSSSLLINNLPQITHTTHLFNFNFFITAVTDTKNPNRKPLKSNRRLIS
>>> >HaT13l000001_2_AA gn=HaT13l000001
>>> MKLRRRRQSEVPPKIKSFINGVIAVPLENIEEPLKSFFWDFDKGDFHHWVDLFNHFDTFFEKYIKPRKDLQLDDGFLESD
>>> PPFPREAVLQILRVVRTILDNCTNKHFYSSYEHHLSSLLASTDADVVEACLQTLSSFLRKSIGKHIARDTSLSSKLFAFA
>>>
>>>
>>> Everything seems to be loaded (I can see gene/mRNA/CDS and proteins 
>>> with the right primaryIdentifier in the database) but the link 
>>> between genes and proteins is not processed:
>>>
>>> mymyne=# SELECT * from genesproteins ;
>>>  proteins | genes
>>> ----------+-------
>>> (0 rows)
>>>
>>>
>>> So my questions are:
>>> - am I doing something wrong (certainly;) ) ?
>>> - do I have to write my own FastaLoader and use the 
>>> protein.addToCollection("genes", geneRefId) to fill this table ?
>>>     - and in this case, what is the method to get the geneRefId from 
>>> its primaryIdentifier value ?
>>>
>>> Here is my project.xml file:
>>>
>>> <sources>
>>>     <source name="helianthus-gff" type="gff-bfc">
>>>       <property name="gff3.taxonId" value="4232"/>
>>>       <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
>>>       <property name="gff3.dataSourceName" value="HelianthusDB"/>
>>>       <property name="gff3.seqClsName" value="Transcript"/> <!-- Seq 
>>> Ontology Term Camelized-->
>>>       <property name="gff3.dataSetTitle" value="HaT13l FrameDP 
>>> predictions"/>
>>>       <property name="src.data.dir" 
>>> location="/path/to/data_helianthus/gff3"/>
>>>     </source>
>>>     <source name="helianthus-transcripts-fasta" type="fasta" >
>>>       <property name="fasta.className" 
>>> value="org.intermine.model.bio.Transcript"/>
>>>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>>>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>>>       <property name="fasta.dataSetTitle" value="HelianthusDB 
>>> transcripts sequences"/>
>>>       <property name="fasta.taxonId" value="4232"/>
>>>       <property name="fasta.includes" value="*.fna"/>
>>>       <property name="src.data.dir" 
>>> location="/path/to/data_helianthus/fasta/"/>
>>>     </source>
>>>     <source name="helianthus-peptides-fasta" type="fasta" >
>>>       <property name="fasta.className" 
>>> value="org.intermine.model.bio.Protein"/>
>>>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>>>       <property name="fasta.sequenceType" value="protein" />
>>>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>>>       <property name="fasta.dataSetTitle" value="HelianthusDB 
>>> peptide sequences"/>
>>>       <property name="fasta.taxonId" value="4232"/>
>>>       <property name="fasta.includes" value="*.faa"/>
>>>       <property name="src.data.dir" 
>>> location="/path/to/data_helianthus/fasta"/>
>>>     </source>
>>>
>>> Thanks for your help,
>>>
>>> Sebastien
>>>
>>
>>
>>
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130830/473fd152/attachment-0001.html>


More information about the dev mailing list