[InterMine Dev] How to link genes and proteins from gff3 files and fasta files

Jayaraman, Pushkala pjayaraman at mcw.edu
Wed Aug 28 18:02:19 BST 2013


If im not wrong.. the one thing different between your protein fasta loader and the Uniprot Fasta loader is this line:

<property name="fasta.classAttribute" value="primaryAccession"/>


>From this link here:
http://intermine.readthedocs.org/en/latest/database/data-sources/library/proteins/uniprot/#fasta


not sure if this helps.. but would changing that classAttribute Value help?


Pushkala

From: dev-bounces at intermine.org [mailto:dev-bounces at intermine.org] On Behalf Of Sebastien Carrere
Sent: Wednesday, August 28, 2013 10:35 AM
To: dev at intermine.org
Subject: [InterMine Dev] How to link genes and proteins from gff3 files and fasta files

Hi all,

I'm a newbie using InterMine. So first of all sorry if my question is stupid but I wasn't able to find a clear answer in the doc or the different examples.

Here's my problem. I've got 3 datasources:

1. a gff file describing transcripts (gene/mRNA/exon/CDS with possibility of many mRNAs per genes): I wrote my own GFF3 loader (gff-bfc) to use the Name attribute as primaryIdentifier for each feature
sample:
##gff-version 3
##sequence-region HaT13l000001 1 11546
tHaT13l000001    LIPM    gene    1    11546    0    .    .    ID=gene:HaT13l000001.HaT13l000001.1;Name=HaT13l000001;NoteUBX;
tHaT13l000001    LIPM    mRNA    1    11546    0    .    .    ID=mRNA:HaT13l000001.HaT13l000001.1;Parent=gene:HaT13l000001.HaT13l000001.1;Name=HaT13l000001;Note=UBX;
tHaT13l000001    FrameDP    exon    1    185    0    +    .    ID=exon:HaT13l000001.HaT13l000001.1;Parent=mRNA:HaT13l000001.HaT13l000001.1;Name=HaT13l000001_1_AA;
tHaT13l000001    FrameDP    CDS    1    185    0    +    .    ID=CDS:HaT13l000001_1_AA.HaT13l000001_1_AA.1;Parent=exon:HaT13l000001.HaT13l000001.1;Name=HaT13l000001_1_AA;Note=hypothetical;
##sequence-region HaT13l000002 1 7394
tHaT13l000002    LIPM    gene    1    7394    0    .    .    ID=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
tHaT13l000002    LIPM    mRNA    1    7394    0    .    .    ID=mRNA:HaT13l000002.HaT13l000002.1;Parent=gene:HaT13l000002.HaT13l000002.1;Name=HaT13l000002;Note=OPT;
tHaT13l000002    FrameDP    exon    1782    5231    0    +    .    ID=exon:HaT13l000002.HaT13l000002.1;Parent=mRNA:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;
tHaT13l000002    FrameDP    CDS    1782    5231    0    +    .    ID=CDS:HaT13l000002_3_AA.HaT13l000002_3_AA.1;Parent=exon:HaT13l000002.HaT13l000002.1;Name=HaT13l000002_3_AA;Note=DNA polymerase;


2. a fasta file for the transcripts sequences
sample:
>tHaT13l000001
GCACAGTATACCTTCTTGCTTGTCTAATTCACTTTCATTCTTCATCTTCTCTCTTAATCAACAATCTTCCGCAAATCACA
CACACAACACACCTTTTCAATTTCAATTTCTTCATAACCGCCGTAACAGACACAAAAAACCCCAACCGAAAACCCTTGAA
ATCGAACCGCCGGTTGATTTCGTAATCTCGATTCGATTGTTTGTTGGTTCGATCAATCAATCGTGACGCTAATCGGTTGT
ATACAGTGGTTATTGATTGTACGATTAACGAATGCTTTTTGTTAGGTTTGTTTTAAGATGAAGTTGAGGAGAAGGAGGCA
[...]


3. a fasta file for the protein sequences (CDS translations) (ids are the same as mRNA:Name or CDS:Name)
sample:
>HaT13l000001_1_AA gn=HaT13l000001
TVYLLACLIHFHSSSSLLINNLPQITHTTHLFNFNFFITAVTDTKNPNRKPLKSNRRLIS
>HaT13l000001_2_AA gn=HaT13l000001
MKLRRRRQSEVPPKIKSFINGVIAVPLENIEEPLKSFFWDFDKGDFHHWVDLFNHFDTFFEKYIKPRKDLQLDDGFLESD
PPFPREAVLQILRVVRTILDNCTNKHFYSSYEHHLSSLLASTDADVVEACLQTLSSFLRKSIGKHIARDTSLSSKLFAFA


Everything seems to be loaded (I can see gene/mRNA/CDS and proteins with the right primaryIdentifier in the database) but the link between genes and proteins is not processed:

mymyne=# SELECT * from  genesproteins ;
 proteins | genes
----------+-------
(0 rows)


So my questions are:
- am I doing something wrong (certainly;) ) ?
- do I have to write my own FastaLoader and use the  protein.addToCollection("genes", geneRefId) to fill this table ?
    - and in this case, what is the method to get the geneRefId from its primaryIdentifier value ?

Here is my project.xml file:

 <sources>
    <source name="helianthus-gff" type="gff-bfc">
      <property name="gff3.taxonId" value="4232"/>
      <property name="gff3.seqDataSourceName" value="HelianthusDB"/>
      <property name="gff3.dataSourceName" value="HelianthusDB"/>
      <property name="gff3.seqClsName" value="Transcript"/> <!-- Seq Ontology Term Camelized-->
      <property name="gff3.dataSetTitle" value="HaT13l FrameDP predictions"/>
      <property name="src.data.dir" location="/path/to/data_helianthus/gff3"/>
    </source>
    <source name="helianthus-transcripts-fasta" type="fasta" >
      <property name="fasta.className" value="org.intermine.model.bio.Transcript"/>
      <property name="fasta.classAttribute" value="primaryIdentifier"/>
      <property name="fasta.dataSourceName" value="HelianthusDB"/>
      <property name="fasta.dataSetTitle" value="HelianthusDB transcripts sequences"/>
      <property name="fasta.taxonId" value="4232"/>
      <property name="fasta.includes" value="*.fna"/>
      <property name="src.data.dir" location="/path/to/data_helianthus/fasta/"/>
    </source>
    <source name="helianthus-peptides-fasta" type="fasta" >
      <property name="fasta.className" value="org.intermine.model.bio.Protein"/>
      <property name="fasta.classAttribute" value="primaryIdentifier"/>
      <property name="fasta.sequenceType" value="protein" />
      <property name="fasta.dataSourceName" value="HelianthusDB"/>
      <property name="fasta.dataSetTitle" value="HelianthusDB peptide sequences"/>
      <property name="fasta.taxonId" value="4232"/>
      <property name="fasta.includes" value="*.faa"/>
      <property name="src.data.dir" location="/path/to/data_helianthus/fasta"/>
    </source>

Thanks for your help,

Sebastien
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20130828/f23a92ee/attachment.html>


More information about the dev mailing list