[InterMine Dev] Prevent post-processor from running multiple times when datasource has multiple source defs in project.xml?

Sam Hokin shokin at ncgr.org
Thu Apr 14 22:47:42 BST 2016


Hi, devs. It's me again. Here's another one. I've got four source entries in my project.xml that all come from the same datasource, 
but using different processors, which need to be run separately for merging reasons. I also have a single post-processor that I'd 
like to run once. (Since all the data sources are run before post-processing, it doesn't make any sense to run a post-processor more 
than once.)

However, since there are four entries in project.xml (chado-genomics, chado-genetics, chado-go and chado-featureprop), the 
corresponding post-processor is run four consecutive times by the do-sources post-process task. (Fortunately in this case it's a 
quick post-process, but if it took many hours, it would be hugely annoying that it runs four times. And yes, I know, I can create 
some sort of flag or test so the post-processor exits out if it's already been run, but I'm trying to make InterMine better here, or 
at least my use of it.)

I don't see any way of limiting the times the post-processor is run. It looks to me like this must happen with FlyMine's chado-db
FlyBasePostProcess for the same reason: FlyMine has many chado-db sources defined in their project.xml.

Here's my DATASOURCE/project.properties which defines the source-related post-processor:

compile.dependencies = intermine/objectstore/main,bio/core/main,\
                        intermine/integrate/main, \
                        bio/sources/legfed/main

have.db.tgt = true
converter.class = org.intermine.bio.dataconversion.ChadoDBConverter
postprocessor.class = org.intermine.bio.postprocess.LegfedPostProcess

And here's the project.xml which results in do-sources running LegfedPostProcess four times:

     <!-- chado genomics - has merge priority, so run first -->
     <source name="chado-genomics" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.SequenceProcessor"/>
     </source>

     <!-- chado genetics -->
     <source name="chado-genetics" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.GeneticProcessor"/>
     </source>

     <!-- chado GO annotation -->
     <source name="chado-go" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.GOProcessor"/>
     </source>

     <!-- chado featureprop attributes -->
     <source name="chado-featureprop" type="legfed" dump="true">
       <property name="source.db.name" value="tripal"/>
       <property name="organisms" value="3885 3398"/>
       <property name="dataSetTitle" value="LIS Phaseolus vulgaris (3885) data"/>
       <property name="dataSourceName" value="LIS Tripal database"/>
       <property name="converter.class" value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
       <property name="processors" value="org.intermine.bio.dataconversion.FeaturePropProcessor"/>
     </source>

Is there any way to tell InterMine to only run LegfedPostProcess once, even though its data source appears four times in project.xml??



More information about the dev mailing list