[InterMine Dev] orthologues in InterMine -

Julie Sullivan julie at flymine.org
Wed Oct 24 09:21:59 BST 2012



On 24/10/12 08:58, Dr Intikhab Alam wrote:
> Dear Julie,
>
> Thank you so much for your response. I will definitely get the new release of intermine and give it a go.
>
> I have obtained my orthologues using OrthoMCL software that reports one line per orthologous group as the following line:
>
> ORTHOMCL6227(2 genes,2 taxa):    AAA831C14_00243(AAA831C14) AAA831J21_00092(AAA831J21)
>
> where the first column is orthologous cluster id with information on number of genes and taxa followed by a tab and then gene names (orgranism) delimeted by a space. Is there any

source which can handle this type of data format? or can you tell me the format 
in which I should transform my data and one of your sources are able to load my 
data?

No, none of our orthologue sources uses this format, you'll have to write a 
parser yourself.

It will be similar to the parser you use to generate the XML for your 
annotations. I think that's written in Perl?

  1. write script, generate items XML file

	a. parse the file that `OrthoMCL` created.
	b. use the values from the file to create genes, organisms and homologues.

	here's the Java code for storing a homologue:

	        Item homologue = createItem("Homologue");
	        homologue.setReference("gene", gene1);
         	homologue.setReference("homologue", gene2);
	        homologue.setAttribute("type", "homologue");
		store(homologue);

	and a gene:

		Item gene = createItem("Gene");
	        gene.setAttribute(identifierType, identifier);
         	gene.setReference("organism", getOrganism(taxonId));

You'll set the same attributes and references in Perl. You can use your current 
script as a guide.

Remember - you'll need to keep a map of genes so you don't store duplicate 
copies of the same gene. I think that caused us problems last time.

If you want help with your script, we're happy to take a look! :) (I know it's 
been a while!)

  2. create a new source
	
	http://intermine.org/wiki/SourceHowto#a1.1Runmake_sourcescript
	for example:
	 ./bio/scripts/make_source orthomcl-items-xml intermine-items-xml-file


  3. add new source to project XML

for example:
     <source name="orthomcl-items-xml" type="orthomcl-items-xml">
       <property name="src.data.file" location="/DATA/orthomcl-items-xml.xml"/>
     </source>


> Organisms and all annotations are already loaded using large xml for each of the organism. If I load the orthologous data using any of the sources you have mentioned, do you expect

  any conflict for e.g. organism being loaded a second time or I need to do 
anything in priorities file?

No, there will be no conflicts. But make sure you have keys for every type of 
object you are storing.

Your keys file will be here (where `orthomcl-items-xml` is your source name):

bio/sources/orthomcl-items-xml/resources/orthomcl-items-xml_keys.properties

You'll need keys for:

	gene (eg Gene.key_primaryidentifier=primaryIdentifier)
	organism (eg Organism.key_taxonid=taxonId)

Plus whatever else you store.

The keys will tell the build system to merge your genes with the genes already 
in the database - the ones stored by your other source.

There won't be any conflicts because you are only storing the identifiers for 
the genes. You might get a conflict if you were storing other information, eg. 
gene.name. If that gene.name didn't match with something in the database, you 
would get an error. Actually, you may get a conflict if the gene in the database 
has a different organism than the gene you are trying to store. If that happens, 
there may be issues with your data!






More information about the dev mailing list