[InterMine Dev] How to link genes and proteins from gff3 files and fasta files

Sebastien Carrere Sebastien.Carrere at toulouse.inra.fr
Wed Aug 28 16:35:21 BST 2013


Hi all,

I'm a newbie using InterMine. So first of all sorry if my question is 
stupid but I wasn't able to find a clear answer in the doc or the 
different examples.

Here's my problem. I've got 3 datasources:

1. a *gff* file describing transcripts (gene/mRNA/exon/CDS with 
possibility of many mRNAs per genes): I wrote my own GFF3 loader 
(gff-bfc) to use the Name attribute as primaryIdentifier for each feature
_sample:_
##gff-version 3
##sequence-region HaT13l000001 1 11546
*tHaT13l000001* LIPM    gene    1    11546    0    .    . 
ID=gene:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001*;NoteUBX;
tHaT13l000001    LIPM    mRNA    1    11546    0    . . 
ID=mRNA:HaT13l000001.HaT13l000001.1;Parent=gene:HaT13l000001.HaT13l000001.1;Name=HaT13l000001;Note=UBX;
tHaT13l000001    FrameDP    exon    1    185    0    + . 
ID=exon:HaT13l000001.HaT13l000001.1;Parent=mRNA:HaT13l000001.HaT13l000001.1;*Name=HaT13l000001_1_AA*;
tHaT13l000001    FrameDP    CDS    1    185    0    + . 
ID=CDS:HaT13l000001_1_AA.HaT13l000001_1_AA.1;Parent=exon:HaT13l000001.HaT13l000001.1;Name=*HaT13l000001_1_AA*;Note=hypothetical;
##sequence-region HaT13l000002 1 7394
tHaT13l000002    LIPM    gene    1    7394    0    . . 
ID=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
tHaT13l000002    LIPM    mRNA    1    7394    0    . . 
ID=mRNA:HaT13l000002.HaT13l000002.1;Parent=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
tHaT13l000002    FrameDP    exon    1782    5231    0 +    . 
ID=exon:HaT13l000002.HaT13l000002.1;Parent=mRNA:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;
tHaT13l000002    FrameDP    CDS    1782    5231    0 +    . 
ID=CDS:HaT13l000002_3_AA.HaT13l000002_3_AA.1;Parent=exon:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;Note=DNA 
polymerase;


2. a *fasta* file for the *transcripts* sequences
_sample:_
 >*tHaT13l000001 *
GCACAGTATACCTTCTTGCTTGTCTAATTCACTTTCATTCTTCATCTTCTCTCTTAATCAACAATCTTCCGCAAATCACA
CACACAACACACCTTTTCAATTTCAATTTCTTCATAACCGCCGTAACAGACACAAAAAACCCCAACCGAAAACCCTTGAA
ATCGAACCGCCGGTTGATTTCGTAATCTCGATTCGATTGTTTGTTGGTTCGATCAATCAATCGTGACGCTAATCGGTTGT
ATACAGTGGTTATTGATTGTACGATTAACGAATGCTTTTTGTTAGGTTTGTTTTAAGATGAAGTTGAGGAGAAGGAGGCA
[...]


3. a *fasta* file for the *protein* sequences (CDS translations) (ids 
are the same as mRNA:Name or CDS:Name)
_sample:_
 >*HaT13l000001_1_AA* gn=*HaT13l000001*
TVYLLACLIHFHSSSSLLINNLPQITHTTHLFNFNFFITAVTDTKNPNRKPLKSNRRLIS
 >HaT13l000001_2_AA gn=HaT13l000001
MKLRRRRQSEVPPKIKSFINGVIAVPLENIEEPLKSFFWDFDKGDFHHWVDLFNHFDTFFEKYIKPRKDLQLDDGFLESD
PPFPREAVLQILRVVRTILDNCTNKHFYSSYEHHLSSLLASTDADVVEACLQTLSSFLRKSIGKHIARDTSLSSKLFAFA


Everything seems to be loaded (I can see gene/mRNA/CDS and proteins with 
the right primaryIdentifier in the database) but the link between genes 
and proteins is not processed:

mymyne=# SELECT * from  genesproteins ;
  proteins | genes
----------+-------
(0 rows)


So my questions are:
- am I doing something wrong (certainly;) ) ?
- do I have to write my own FastaLoader and use the 
protein.addToCollection("genes", geneRefId) to fill this table ?
     - and in this case, what is the method to get the geneRefId from 
its primaryIdentifier value ?

Here is my project.xml file:

<sources>
     <source name="helianthus-gff" type="gff-bfc">
       <property name="gff3.taxonId" value="4232"/>
       <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
       <property name="gff3.dataSourceName" value="HelianthusDB"/>
       <property name="gff3.seqClsName" value="Transcript"/> <!-- Seq 
Ontology Term Camelized-->
       <property name="gff3.dataSetTitle" value="HaT13l FrameDP 
predictions"/>
       <property name="src.data.dir" 
location="/path/to/data_helianthus/gff3"/>
     </source>
     <source name="helianthus-transcripts-fasta" type="fasta" >
       <property name="fasta.className" 
value="org.intermine.model.bio.Transcript"/>
       <property name="fasta.classAttribute" value="primaryIdentifier"/>
       <property name="fasta.dataSourceName" value="HelianthusDB"/>
       <property name="fasta.dataSetTitle" value="HelianthusDB 
transcripts sequences"/>
       <property name="fasta.taxonId" value="4232"/>
       <property name="fasta.includes" value="*.fna"/>
       <property name="src.data.dir" 
location="/path/to/data_helianthus/fasta/"/>
     </source>
     <source name="helianthus-peptides-fasta" type="fasta" >
       <property name="fasta.className" 
value="org.intermine.model.bio.Protein"/>
       <property name="fasta.classAttribute" value="primaryIdentifier"/>
       <property name="fasta.sequenceType" value="protein" />
       <property name="fasta.dataSourceName" value="HelianthusDB"/>
       <property name="fasta.dataSetTitle" value="HelianthusDB peptide 
sequences"/>
       <property name="fasta.taxonId" value="4232"/>
       <property name="fasta.includes" value="*.faa"/>
       <property name="src.data.dir" 
location="/path/to/data_helianthus/fasta"/>
     </source>

Thanks for your help,

Sebastien

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130828/1b2959e6/attachment-0001.html>


More information about the dev mailing list