[InterMine Dev] How to link genes and proteins from gff3 files and fasta files

Fengyuan Hu fh293 at cam.ac.uk
Fri Aug 30 10:34:42 BST 2013


On 30/08/13 10:26, Sebastien Carrere wrote:
> Hi Fengyuan,
>
> Thnaks for your answer.
> I'm looking at writting my own source handler to do such a mapping.
> I just thought that this mechanism was done by default in the generic 
> GFF3 loader using the SO model.

It won't do the trick unfortunately. Please let us know if you need help 
with your parser.

Cheers
Fengyuan

>
> Sebastien
>
>
> Le 29/08/2013 13:16, Fengyuan Hu a écrit :
>> Hi Sebastien,
>>
>> It looks you are missing one step to link genes and proteins. Because 
>> gene has a collection of proteins (or vise versa) in the model, you 
>> need a source that can parse information with genes and proteins 
>> together. One example is uniprot 
>> <http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/> 
>> source.
>>
>> Here is the code how they are connected - ref 
>> https://github.com/intermine/intermine/blob/dev/bio/sources/uniprot/main/src/org/intermine/bio/dataconversion/UniprotConverter.java#L997. 
>>
>>
>> I'd suggest to try UniProt first if sunflower data is available.
>>
>> Thanks
>> Fengyuan
>>
>> On 29/08/13 09:39, Sebastien Carrere wrote:
>>> Thanks for your answer. I tried what you said but the problem remains.
>>> I think the problem is in the GFF file loading step.
>>> So I started again from scratch, using the default gff source with 
>>> the Chromosome model:
>>>
>>>
>>> <source name="helianthus-gff" type="*gff*">
>>>   <property name="gff3.taxonId" value="4232"/>
>>>   <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
>>>   <property name="gff3.dataSourceName" value="HelianthusDB"/>
>>>   <property name="gff3.seqClsName" value="*Chromosome*"/> <!-- Seq 
>>> Ontology Term Camelized-->
>>>   <property name="gff3.dataSetTitle" value="HaT13l FrameDP 
>>> predictions"/>
>>>   <property name="src.data.dir" 
>>> location="/path/to/data_helianthus/gff3"/>
>>> </source>
>>>
>>> tHaT13l000001    LIPM    gene    1    11546    0    . .    
>>> ID=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;
>>> tHaT13l000001    LIPM    mRNA    1    11546    0 .    .    
>>> ID=*mRNA:HaT13l000001.HaT13l000001.1*;Parent=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;
>>>
>>>
>>>
>>> mymine=# SELECT id from  gene where primaryidentifier = 
>>> '*gene:HaT13l000001.HaT13l000001.1*' ;
>>>    id
>>> ---------
>>>  1000007
>>> (1 row)
>>>
>>> mymine=# SELECT id,geneid from  mrna where primaryidentifier = 
>>> '*mRNA:HaT13l000001.HaT13l000001.1*' ;
>>>    id    | geneid
>>> ---------+--------
>>>  1000010 |
>>> (1 row)
>>>
>>>
>>> So the links through the Parent values seems not to be done even 
>>> after postprocessing steps.
>>> Any idea of what's wrong ?
>>>
>>>
>>> Sebastien
>>>
>>>
>>> Le 28/08/2013 19:02, Jayaraman, Pushkala a écrit :
>>>>
>>>> If im not wrong.. the one thing different between your protein 
>>>> fasta loader and the Uniprot Fasta loader is this line:
>>>>
>>>> *<property*name="fasta.classAttribute"value="primaryAccession"*/>*
>>>>
>>>> From this link here:
>>>>
>>>> http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/#fasta
>>>>
>>>> not sure if this helps.. but would changing that classAttribute 
>>>> Value help?
>>>>
>>>> Pushkala
>>>>
>>>> *From:*dev-bounces at intermine.org [mailto:dev-bounces at intermine.org] 
>>>> *On Behalf Of *Sebastien Carrere
>>>> *Sent:* Wednesday, August 28, 2013 10:35 AM
>>>> *To:* dev at intermine.org
>>>> *Subject:* [InterMine Dev] How to link genes and proteins from gff3 
>>>> files and fasta files
>>>>
>>>> Hi all,
>>>>
>>>> I'm a newbie using InterMine. So first of all sorry if my question 
>>>> is stupid but I wasn't able to find a clear answer in the doc or 
>>>> the different examples.
>>>>
>>>> Here's my problem. I've got 3 datasources:
>>>>
>>>> 1. a *gff* file describing transcripts (gene/mRNA/exon/CDS with 
>>>> possibility of many mRNAs per genes): I wrote my own GFF3 loader 
>>>> (gff-bfc) to use the Name attribute as primaryIdentifier for each 
>>>> feature
>>>> _sample:_
>>>> ##gff-version 3
>>>> ##sequence-region HaT13l000001 1 11546
>>>> *tHaT13l000001*    LIPM    gene    1 11546    0    .    . 
>>>> ID=gene:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001*;NoteUBX;
>>>> tHaT13l000001    LIPM    mRNA    1    11546 0    .    . 
>>>> ID=mRNA:HaT13l000001.HaT13l000001.1;Parent=gene:HaT13l000001.HaT13l000001.1;Name=HaT13l000001;Note=UBX;
>>>> tHaT13l000001    FrameDP    exon    1    185 0    +    . 
>>>> ID=exon:HaT13l000001.HaT13l000001.1;Parent=mRNA:HaT13l000001.HaT13l000001.1;*Name=HaT13l000001_1_AA*;
>>>> tHaT13l000001    FrameDP    CDS    1    185 0    +    . 
>>>> ID=CDS:HaT13l000001_1_AA.HaT13l000001_1_AA.1;Parent=exon:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001_1_AA*;Note=hypothetical;
>>>> ##sequence-region HaT13l000002 1 7394
>>>> tHaT13l000002    LIPM    gene    1    7394    0 .    . 
>>>> ID=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
>>>> tHaT13l000002    LIPM    mRNA    1    7394    0 .    . 
>>>> ID=mRNA:HaT13l000002.HaT13l000002.1;Parent=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
>>>> tHaT13l000002    FrameDP    exon    1782 5231    0    +    . 
>>>> ID=exon:HaT13l000002.HaT13l000002.1;Parent=mRNA:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;
>>>> tHaT13l000002    FrameDP    CDS    1782    5231 0    +    . 
>>>> ID=CDS:HaT13l000002_3_AA.HaT13l000002_3_AA.1;Parent=exon:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;Note=DNA 
>>>> polymerase;
>>>>
>>>>
>>>> 2. a *fasta* file for the *transcripts* sequences
>>>> _sample:_
>>>> >*tHaT13l000001 *
>>>> GCACAGTATACCTTCTTGCTTGTCTAATTCACTTTCATTCTTCATCTTCTCTCTTAATCAACAATCTTCCGCAAATCACA
>>>> CACACAACACACCTTTTCAATTTCAATTTCTTCATAACCGCCGTAACAGACACAAAAAACCCCAACCGAAAACCCTTGAA
>>>> ATCGAACCGCCGGTTGATTTCGTAATCTCGATTCGATTGTTTGTTGGTTCGATCAATCAATCGTGACGCTAATCGGTTGT
>>>> ATACAGTGGTTATTGATTGTACGATTAACGAATGCTTTTTGTTAGGTTTGTTTTAAGATGAAGTTGAGGAGAAGGAGGCA
>>>> [...]
>>>>
>>>>
>>>> 3. a *fasta* file for the *protein* sequences (CDS translations) 
>>>> (ids are the same as mRNA:Name or CDS:Name)
>>>> _sample:_
>>>> >*HaT13l000001_1_AA*gn=*HaT13l000001*
>>>> TVYLLACLIHFHSSSSLLINNLPQITHTTHLFNFNFFITAVTDTKNPNRKPLKSNRRLIS
>>>> >HaT13l000001_2_AA gn=HaT13l000001
>>>> MKLRRRRQSEVPPKIKSFINGVIAVPLENIEEPLKSFFWDFDKGDFHHWVDLFNHFDTFFEKYIKPRKDLQLDDGFLESD
>>>> PPFPREAVLQILRVVRTILDNCTNKHFYSSYEHHLSSLLASTDADVVEACLQTLSSFLRKSIGKHIARDTSLSSKLFAFA
>>>>
>>>>
>>>> Everything seems to be loaded (I can see gene/mRNA/CDS and proteins 
>>>> with the right primaryIdentifier in the database) but the link 
>>>> between genes and proteins is not processed:
>>>>
>>>> mymyne=# SELECT * from  genesproteins ;
>>>>  proteins | genes
>>>> ----------+-------
>>>> (0 rows)
>>>>
>>>>
>>>> So my questions are:
>>>> - am I doing something wrong (certainly;) ) ?
>>>> - do I have to write my own FastaLoader and use the 
>>>> protein.addToCollection("genes", geneRefId) to fill this table ?
>>>>     - and in this case, what is the method to get the geneRefId 
>>>> from its primaryIdentifier value ?
>>>>
>>>> Here is my project.xml file:
>>>>
>>>> <sources>
>>>>     <source name="helianthus-gff" type="gff-bfc">
>>>>       <property name="gff3.taxonId" value="4232"/>
>>>>       <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
>>>>       <property name="gff3.dataSourceName" value="HelianthusDB"/>
>>>>       <property name="gff3.seqClsName" value="Transcript"/> <!-- 
>>>> Seq Ontology Term Camelized-->
>>>>       <property name="gff3.dataSetTitle" value="HaT13l FrameDP 
>>>> predictions"/>
>>>>       <property name="src.data.dir" 
>>>> location="/path/to/data_helianthus/gff3"/>
>>>>     </source>
>>>>     <source name="helianthus-transcripts-fasta" type="fasta" >
>>>>       <property name="fasta.className" 
>>>> value="org.intermine.model.bio.Transcript"/>
>>>>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>>>>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>>>>       <property name="fasta.dataSetTitle" value="HelianthusDB 
>>>> transcripts sequences"/>
>>>>       <property name="fasta.taxonId" value="4232"/>
>>>>       <property name="fasta.includes" value="*.fna"/>
>>>>       <property name="src.data.dir" 
>>>> location="/path/to/data_helianthus/fasta/"/>
>>>>     </source>
>>>>     <source name="helianthus-peptides-fasta" type="fasta" >
>>>>       <property name="fasta.className" 
>>>> value="org.intermine.model.bio.Protein"/>
>>>>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>>>>       <property name="fasta.sequenceType" value="protein" />
>>>>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>>>>       <property name="fasta.dataSetTitle" value="HelianthusDB 
>>>> peptide sequences"/>
>>>>       <property name="fasta.taxonId" value="4232"/>
>>>>       <property name="fasta.includes" value="*.faa"/>
>>>>       <property name="src.data.dir" 
>>>> location="/path/to/data_helianthus/fasta"/>
>>>>     </source>
>>>>
>>>> Thanks for your help,
>>>>
>>>> Sebastien
>>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> dev mailing list
>>> dev at intermine.org
>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130830/428fe3da/attachment-0001.html>


More information about the dev mailing list