[InterMine Dev] seeking advice, best practices

Richard Smith richard at flymine.org
Wed Sep 14 11:34:49 BST 2011

Thanks for the comments Andrew.  Some more below.

On 13/09/2011 14:55, Joel Richardson wrote:
> Hi all,
> The more I dig into Intermine, the more possibilities I see
> but also the more questions I have. I'd really appreciate
> any help/insight/opinions as to the best ways to deal with
> the following issues.
> Core model + bio extensions. Is there flexibility here? Can any
> of this be changed without breaking things? Which parts?
> Reasons:
> - a fair amount of stuff is irrelevant for our data and so
> will remain unpopulated in the mine. I know we can just ignore these
> parts (and that's fine for now), but it seems a bit awkward, e.g,
> to have "dead" classes available in the query builder.
> - some aspects conflict with the data we have. For example,
> neither Alleles nor Proteins are subclasses of BioEntity, and so
> cannot have OntologyAnnotations. Again, I know we can define
> our own subclasses of BioEntity (MGIAllele, MGIProtein, or whatever),
> but that seems messy.
> A larger question is whether/how the different mines (at least, the
> InterMOD ones) coordinate their model extensions. I'm assuming everyone
> pretty much extends the core model for their own purposes, and it's
> a great strength of Intermine that this is possible. But it also
> raises issues for interoperability as the mines' models diverge.

We try to hide empty classes and fields in the interface to some
extent, we could do so more in future.

Between the MODs we have discussed keeping models in sync as much as
possible.  We have a web page that compares models from the different
Mines and highlights any differences - right now it doesn't seem to
be working :(  but it's usually at intermine.org/reports.

I think Andrew's approach of looking at other Mines before adding a
new source is a good one.

> Source control/versioning. I'm wondering how people are approaching
> version control of their mines' components (config files, source
> code, etc.) as distinct from Intermine itself?

We aim to move all of InterMine to git at some point.  We certainly
need to isolate the site specific config and sources from the core
code as much as possible - this is 90% there but still inconvenient.

> Loading lots of data from a relational db. Most of our data
> will come out of MGI. There's lots of it and lots of different types.
> Should this be one big load or lots of little ones? Should the
> loads connect to the db directly, or should the db get dumped
> in ItemXml format and we load that? If loading ItemsXml, is it
> better to load one big file, or a directory of smaller ones?

Certainly loading as much data as possible from one source is a good
idea.  InterMine supports some options for integrating data from
multiple sources but they rely on running queries in the production
database.  This is slower than just being able to write.

If loading items XML it needs to be one file per source, so one big
one is best.

Writing to items XML adds an extra step, so connecting to the database
directly from Java code and writing straight into the items database
is my preference.


> Many thanks in advance,
> Joel

More information about the dev mailing list