[InterMine Dev] orthologues in InterMine -
julie at flymine.org
Wed Oct 24 09:21:59 BST 2012
On 24/10/12 08:58, Dr Intikhab Alam wrote:
> Dear Julie,
> Thank you so much for your response. I will definitely get the new release of intermine and give it a go.
> I have obtained my orthologues using OrthoMCL software that reports one line per orthologous group as the following line:
> ORTHOMCL6227(2 genes,2 taxa): AAA831C14_00243(AAA831C14) AAA831J21_00092(AAA831J21)
> where the first column is orthologous cluster id with information on number of genes and taxa followed by a tab and then gene names (orgranism) delimeted by a space. Is there any
source which can handle this type of data format? or can you tell me the format
in which I should transform my data and one of your sources are able to load my
No, none of our orthologue sources uses this format, you'll have to write a
It will be similar to the parser you use to generate the XML for your
annotations. I think that's written in Perl?
1. write script, generate items XML file
a. parse the file that `OrthoMCL` created.
b. use the values from the file to create genes, organisms and homologues.
here's the Java code for storing a homologue:
Item homologue = createItem("Homologue");
and a gene:
Item gene = createItem("Gene");
You'll set the same attributes and references in Perl. You can use your current
script as a guide.
Remember - you'll need to keep a map of genes so you don't store duplicate
copies of the same gene. I think that caused us problems last time.
If you want help with your script, we're happy to take a look! :) (I know it's
been a while!)
2. create a new source
./bio/scripts/make_source orthomcl-items-xml intermine-items-xml-file
3. add new source to project XML
<source name="orthomcl-items-xml" type="orthomcl-items-xml">
<property name="src.data.file" location="/DATA/orthomcl-items-xml.xml"/>
> Organisms and all annotations are already loaded using large xml for each of the organism. If I load the orthologous data using any of the sources you have mentioned, do you expect
any conflict for e.g. organism being loaded a second time or I need to do
anything in priorities file?
No, there will be no conflicts. But make sure you have keys for every type of
object you are storing.
Your keys file will be here (where `orthomcl-items-xml` is your source name):
You'll need keys for:
gene (eg Gene.key_primaryidentifier=primaryIdentifier)
organism (eg Organism.key_taxonid=taxonId)
Plus whatever else you store.
The keys will tell the build system to merge your genes with the genes already
in the database - the ones stored by your other source.
There won't be any conflicts because you are only storing the identifiers for
the genes. You might get a conflict if you were storing other information, eg.
gene.name. If that gene.name didn't match with something in the database, you
would get an error. Actually, you may get a conflict if the gene in the database
has a different organism than the gene you are trying to store. If that happens,
there may be issues with your data!
More information about the dev