[InterMine Dev] Prevent post-processor from running multiple times when datasource has multiple source defs in project.xml?

sergio contrino sergio at modencode.org
Mon Apr 18 15:25:51 BST 2016


dear sam,
would it be reasonable to add a specific post process for your chado 
data, and add it to the list of post processes at the end of your 
project file (while removing the post proceeses involved in the 
do-source one)?
depending on what your post process do you could refer to different 
cases already in the repository.
if this is not reasonable, i'll make a ticket.
thanks!
sergio


On 14/04/16 22:47, Sam Hokin wrote:
> Hi, devs. It's me again. Here's another one. I've got four source
> entries in my project.xml that all come from the same datasource, but
> using different processors, which need to be run separately for merging
> reasons. I also have a single post-processor that I'd like to run once.
> (Since all the data sources are run before post-processing, it doesn't
> make any sense to run a post-processor more than once.)
>
> However, since there are four entries in project.xml (chado-genomics,
> chado-genetics, chado-go and chado-featureprop), the corresponding
> post-processor is run four consecutive times by the do-sources
> post-process task. (Fortunately in this case it's a quick post-process,
> but if it took many hours, it would be hugely annoying that it runs four
> times. And yes, I know, I can create some sort of flag or test so the
> post-processor exits out if it's already been run, but I'm trying to
> make InterMine better here, or at least my use of it.)
>
> I don't see any way of limiting the times the post-processor is run. It
> looks to me like this must happen with FlyMine's chado-db
> FlyBasePostProcess for the same reason: FlyMine has many chado-db
> sources defined in their project.xml.
>
> Here's my DATASOURCE/project.properties which defines the source-related
> post-processor:
>
> compile.dependencies = intermine/objectstore/main,bio/core/main,\
>                         intermine/integrate/main, \
>                         bio/sources/legfed/main
>
> have.db.tgt = true
> converter.class = org.intermine.bio.dataconversion.ChadoDBConverter
> postprocessor.class = org.intermine.bio.postprocess.LegfedPostProcess
>
> And here's the project.xml which results in do-sources running
> LegfedPostProcess four times:
>
>      <!-- chado genomics - has merge priority, so run first -->
>      <source name="chado-genomics" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.SequenceProcessor"/>
>      </source>
>
>      <!-- chado genetics -->
>      <source name="chado-genetics" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.GeneticProcessor"/>
>      </source>
>
>      <!-- chado GO annotation -->
>      <source name="chado-go" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.GOProcessor"/>
>      </source>
>
>      <!-- chado featureprop attributes -->
>      <source name="chado-featureprop" type="legfed" dump="true">
>        <property name="source.db.name" value="tripal"/>
>        <property name="organisms" value="3885 3398"/>
>        <property name="dataSetTitle" value="LIS Phaseolus vulgaris
> (3885) data"/>
>        <property name="dataSourceName" value="LIS Tripal database"/>
>        <property name="converter.class"
> value="org.intermine.bio.dataconversion.ChadoDBConverter"/>
>        <property name="processors"
> value="org.intermine.bio.dataconversion.FeaturePropProcessor"/>
>      </source>
>
> Is there any way to tell InterMine to only run LegfedPostProcess once,
> even though its data source appears four times in project.xml??
>
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>

-- 
sergio contrino                  InterMine, University of Cambridge
https://sergiocontrino.github.io           http://www.intermine.org



More information about the dev mailing list