[InterMine Dev] How to link genes and proteins from gff3 files and fasta files

Sebastien Carrere Sebastien.Carrere at toulouse.inra.fr
Thu Aug 29 09:39:36 BST 2013


Thanks for your answer. I tried what you said but the problem remains.
I think the problem is in the GFF file loading step.
So I started again from scratch, using the default gff source with the 
Chromosome model:


<source name="helianthus-gff" type="*gff*">
   <property name="gff3.taxonId" value="4232"/>
   <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
   <property name="gff3.dataSourceName" value="HelianthusDB"/>
   <property name="gff3.seqClsName" value="*Chromosome*"/> <!-- Seq 
Ontology Term Camelized-->
   <property name="gff3.dataSetTitle" value="HaT13l FrameDP predictions"/>
   <property name="src.data.dir" location="/path/to/data_helianthus/gff3"/>
</source>

tHaT13l000001    LIPM    gene    1    11546    0    .    . 
ID=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;
tHaT13l000001    LIPM    mRNA    1    11546    0    . .    
ID=*mRNA:HaT13l000001.HaT13l000001.1*;Parent=*gene:HaT13l000001.HaT13l000001.1*;Name=HaT13l000001;Note=UBX;



mymine=# SELECT id from  gene where primaryidentifier = 
'*gene:HaT13l000001.HaT13l000001.1*' ;
    id
---------
  1000007
(1 row)

mymine=# SELECT id,geneid from  mrna where primaryidentifier = 
'*mRNA:HaT13l000001.HaT13l000001.1*' ;
    id    | geneid
---------+--------
  1000010 |
(1 row)


So the links through the Parent values seems not to be done even after 
postprocessing steps.
Any idea of what's wrong ?


Sebastien


Le 28/08/2013 19:02, Jayaraman, Pushkala a écrit :
>
> If im not wrong.. the one thing different between your protein fasta 
> loader and the Uniprot Fasta loader is this line:
>
> *<property*name="fasta.classAttribute"value="primaryAccession"*/>*
>
> From this link here:
>
> http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/#fasta
>
> not sure if this helps.. but would changing that classAttribute Value 
> help?
>
> Pushkala
>
> *From:*dev-bounces at intermine.org [mailto:dev-bounces at intermine.org] 
> *On Behalf Of *Sebastien Carrere
> *Sent:* Wednesday, August 28, 2013 10:35 AM
> *To:* dev at intermine.org
> *Subject:* [InterMine Dev] How to link genes and proteins from gff3 
> files and fasta files
>
> Hi all,
>
> I'm a newbie using InterMine. So first of all sorry if my question is 
> stupid but I wasn't able to find a clear answer in the doc or the 
> different examples.
>
> Here's my problem. I've got 3 datasources:
>
> 1. a *gff* file describing transcripts (gene/mRNA/exon/CDS with 
> possibility of many mRNAs per genes): I wrote my own GFF3 loader 
> (gff-bfc) to use the Name attribute as primaryIdentifier for each feature
> _sample:_
> ##gff-version 3
> ##sequence-region HaT13l000001 1 11546
> *tHaT13l000001*    LIPM    gene    1    11546 0    .    .    
> ID=gene:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001*;NoteUBX;
> tHaT13l000001    LIPM    mRNA    1    11546    0    . . 
> ID=mRNA:HaT13l000001.HaT13l000001.1;Parent=gene:HaT13l000001.HaT13l000001.1;Name=HaT13l000001;Note=UBX;
> tHaT13l000001    FrameDP    exon    1    185    0 +    . 
> ID=exon:HaT13l000001.HaT13l000001.1;Parent=mRNA:HaT13l000001.HaT13l000001.1;*Name=HaT13l000001_1_AA*;
> tHaT13l000001    FrameDP    CDS    1    185    0    + . 
> ID=CDS:HaT13l000001_1_AA.HaT13l000001_1_AA.1;Parent=exon:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001_1_AA*;Note=hypothetical;
> ##sequence-region HaT13l000002 1 7394
> tHaT13l000002    LIPM    gene    1    7394    0    . . 
> ID=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
> tHaT13l000002    LIPM    mRNA    1    7394    0    . . 
> ID=mRNA:HaT13l000002.HaT13l000002.1;Parent=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
> tHaT13l000002    FrameDP    exon    1782    5231    0 +    . 
> ID=exon:HaT13l000002.HaT13l000002.1;Parent=mRNA:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;
> tHaT13l000002    FrameDP    CDS    1782    5231    0 +    . 
> ID=CDS:HaT13l000002_3_AA.HaT13l000002_3_AA.1;Parent=exon:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;Note=DNA 
> polymerase;
>
>
> 2. a *fasta* file for the *transcripts* sequences
> _sample:_
> >*tHaT13l000001 *
> GCACAGTATACCTTCTTGCTTGTCTAATTCACTTTCATTCTTCATCTTCTCTCTTAATCAACAATCTTCCGCAAATCACA
> CACACAACACACCTTTTCAATTTCAATTTCTTCATAACCGCCGTAACAGACACAAAAAACCCCAACCGAAAACCCTTGAA
> ATCGAACCGCCGGTTGATTTCGTAATCTCGATTCGATTGTTTGTTGGTTCGATCAATCAATCGTGACGCTAATCGGTTGT
> ATACAGTGGTTATTGATTGTACGATTAACGAATGCTTTTTGTTAGGTTTGTTTTAAGATGAAGTTGAGGAGAAGGAGGCA
> [...]
>
>
> 3. a *fasta* file for the *protein* sequences (CDS translations) (ids 
> are the same as mRNA:Name or CDS:Name)
> _sample:_
> >*HaT13l000001_1_AA*gn=*HaT13l000001*
> TVYLLACLIHFHSSSSLLINNLPQITHTTHLFNFNFFITAVTDTKNPNRKPLKSNRRLIS
> >HaT13l000001_2_AA gn=HaT13l000001
> MKLRRRRQSEVPPKIKSFINGVIAVPLENIEEPLKSFFWDFDKGDFHHWVDLFNHFDTFFEKYIKPRKDLQLDDGFLESD
> PPFPREAVLQILRVVRTILDNCTNKHFYSSYEHHLSSLLASTDADVVEACLQTLSSFLRKSIGKHIARDTSLSSKLFAFA
>
>
> Everything seems to be loaded (I can see gene/mRNA/CDS and proteins 
> with the right primaryIdentifier in the database) but the link between 
> genes and proteins is not processed:
>
> mymyne=# SELECT * from genesproteins ;
>  proteins | genes
> ----------+-------
> (0 rows)
>
>
> So my questions are:
> - am I doing something wrong (certainly;) ) ?
> - do I have to write my own FastaLoader and use the 
> protein.addToCollection("genes", geneRefId) to fill this table ?
>     - and in this case, what is the method to get the geneRefId from 
> its primaryIdentifier value ?
>
> Here is my project.xml file:
>
> <sources>
>     <source name="helianthus-gff" type="gff-bfc">
>       <property name="gff3.taxonId" value="4232"/>
>       <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
>       <property name="gff3.dataSourceName" value="HelianthusDB"/>
>       <property name="gff3.seqClsName" value="Transcript"/> <!-- Seq 
> Ontology Term Camelized-->
>       <property name="gff3.dataSetTitle" value="HaT13l FrameDP 
> predictions"/>
>       <property name="src.data.dir" 
> location="/path/to/data_helianthus/gff3"/>
>     </source>
>     <source name="helianthus-transcripts-fasta" type="fasta" >
>       <property name="fasta.className" 
> value="org.intermine.model.bio.Transcript"/>
>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>       <property name="fasta.dataSetTitle" value="HelianthusDB 
> transcripts sequences"/>
>       <property name="fasta.taxonId" value="4232"/>
>       <property name="fasta.includes" value="*.fna"/>
>       <property name="src.data.dir" 
> location="/path/to/data_helianthus/fasta/"/>
>     </source>
>     <source name="helianthus-peptides-fasta" type="fasta" >
>       <property name="fasta.className" 
> value="org.intermine.model.bio.Protein"/>
>       <property name="fasta.classAttribute" value="primaryIdentifier"/>
>       <property name="fasta.sequenceType" value="protein" />
>       <property name="fasta.dataSourceName" value="HelianthusDB"/>
>       <property name="fasta.dataSetTitle" value="HelianthusDB peptide 
> sequences"/>
>       <property name="fasta.taxonId" value="4232"/>
>       <property name="fasta.includes" value="*.faa"/>
>       <property name="src.data.dir" 
> location="/path/to/data_helianthus/fasta"/>
>     </source>
>
> Thanks for your help,
>
> Sebastien
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130829/2421d021/attachment-0001.html>


More information about the dev mailing list