[InterMine Dev] postprocessing speedup?

Richard Smith richard at flymine.org
Thu Mar 29 15:34:49 BST 2012


Hi Intikhab,
Thanks for the detailed information.

The data you're loading is very different to the situation
TransferSequences was written and optimised to solve.  We would
normally expect a small number of chromosomes with large numbers of
features located on them.

Chromosome sequence is stored in the database as a clob object, the
transfer sequences process is to iterate over all features on a
chromosome that have a continuous locations and don't have a sequence
set and create a reference to the clob for that location.

It looks like your process is taking about half a second per chromosome,
but as there are four million of them that will take about 24 days!

1. Do you need to use InterMine to set the sequences of the annotation
features?  You will need the sequences set if you want to export fasta
from your InterMine but if that isn't a priority then you could remove
the transfer-sequences step from your project.xml.

2. If the pipeline you used to generate the annotation also output
the sequence you could include it in your items XML from the start.  If
a feature already has sequence transfer-sequences won't do anything.

3. For each chromosome TransferSequences processes we first create a
precomputed table as we expect a large amount of data to read.  In your
case the overhead of creating and indexing the table probably isn't
necessary.  You could try commenting out lines 175 & 176 in
bio/postprocess/main/src/org/intermine/bio/postprcess/TransferSequences.java:

  ((ObjectStoreInterMineImpl) os).precompute(q, indexesToCreate,
             Constants.PRECOMPUTE_CATEGORY);

This means killing the processing and starting again, but if you run
ant -Daction=transfer-sequences in the mine/postprocess directory rather
than using the project build script to restore from the previous backup
it shouldn't re-do the reads it has finished.

4. It looks like there are some minor improvements we could make to
TransferSequences for your situation but the use case is so different
it may need a completely new method and some experimentation to get it
to work quickly.  So we'll only do this when we have time and if it's
essential to use this step for your Mine.


Cheers,
Richard.

On 28/03/2012 19:35, Dr. Intikhab Alam wrote:
> Dear Richard,
>
> There were 6 metagenomic datasets from redsea and following is some
> general information:
>
> 5.8G 2012-03-16 06:01 AT0050m01/AT0050m01_annotations_items.xml
> 2.1G 2012-03-16 05:18 AT0200m01A1/AT0200m01A1_annotations_items.xml
> 1.5G 2012-03-16 18:00 AT0200m01B1/AT0200m01B1_annotations_items.xml
> 2.7G 2012-03-16 05:36 AT0700m01A1/AT0700m01A1_annotations_items.xml
> 687M 2012-03-16 04:07 AT0700m01B1/AT0700m01B1_annotations_items.xml
> 6.3G 2012-03-16 06:25 AT1500m01/AT1500m01_annotations_items.xml
>
> Following is number of reads considered as chromosomes from each dataset:
>
> AT0050m01/fasta/AT0050m01_annotations_chromosome.fasta:1177604
> AT0200m01A1/fasta/AT0200m01A1_annotations_chromosome.fasta:510874
> AT0200m01B1/fasta/AT0200m01B1_annotations_chromosome.fasta:313217
> AT0700m01A1/fasta/AT0700m01A1_annotations_chromosome.fasta:586482
> AT0700m01B1/fasta/AT0700m01B1_annotations_chromosome.fasta:151334
> AT1500m01/fasta/AT1500m01_annotations_chromosome.fasta:1242393
>
>
> Each read is considered as a chromosome and most of them do have (1-5) ORF
> predicted. For the database I include other features like Protein, Gene
> and Transcript and their attributes like cross references Pathways,
> Interpro Domains, GO terms etc.
>
> Following last item ids number shows the number of items stored from each
> of these data sets:
> ==>  ./AT0200m01A1/AT0200m01A1_annotations_items.xml<==
>     <item id="0_4074235" class="Location" implements="">
> ==>  ./AT0200m01B1/AT0200m01B1_annotations_items.xml<==
>     <item id="0_2772721" class="Location" implements="">
> ==>  ./AT1500m01/AT1500m01_annotations_items.xml<==
>     <item id="0_11891110" class="Location" implements="">
> ==>  ./AT0700m01A1/AT0700m01A1_annotations_items.xml<==
>     <item id="0_5052071" class="Location" implements="">
> ==>  ./AT0050m01/AT0050m01_annotations_items.xml<==
>     <item id="0_10887508" class="Location" implements="">
> ==>  ./AT0700m01B1/AT0700m01B1_annotations_items.xml<==
>     <item id="0_1313511" class="Chromosome" implements="">
>
>
> Following is the time taken to complete each action :
>
> ../bio/scripts/project_build -v -b localhost rmetagenomic_dup>dump_log
> 2>&1&
>
> action prokredsea-AT0200m01A1-largexml took 4286 seconds
> action prokredsea-AT0200m01B1-largexml took 3383 seconds
> action prokredsea-AT0700m01A1-largexml took 7207 seconds
> action prokredsea-AT0700m01B1-largexml took 1798 seconds
> action prokredsea-AT0050m01-largexml took 24509 seconds
> action prokredsea-AT1500m01-largexml took 28475 seconds
> action so took 86 seconds
> action interpro took 15663 seconds
> action go took 650 seconds
> action create-references took 14 seconds
> action make-spanning-locations took 5988 seconds
> action create-chromosome-locations-and-lengths took 10748 seconds
>
>
>
> Its the transfer of sequences going on since last update in the project
> folder:
>
> 18M 2012-03-18 04:38 pbuild.log
>
> In the postprocess folder, it kept writing the intermine.log files and
> when the file size reaches 101MB it saves the files as e.g.
> Intermine.log.1 and keep writing the log in the intermine.log file. When
> 10 of such files are written, it started overwriting the old
> intermine.log.(number), following is the most recent ls from the
> postprocess-dir:
>
> total 1.1G
> -rw-rw-r-- 1 intikhab intikhab  42M 2012-03-28 21:17 intermine.log
> drwxrwxr-x 8 intikhab intikhab 4.0K 2012-03-28 19:31 ./
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 19:31 intermine.log.1
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 15:20 intermine.log.2
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 11:12 intermine.log.3
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 07:01 intermine.log.4
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 02:53 intermine.log.5
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 22:47 intermine.log.6
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 18:33 intermine.log.7
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 14:30 intermine.log.8
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 10:22 intermine.log.9
> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 06:17 intermine.log.10
>
>
> I have included last 100 lines from latest intermine.log in the
> postprocess dir.
>
>
> It seems to be running ok but what I assume post-processing needs
> parallelization for speedup.
> Have you tried any metagenomic dataset before? Or consider it needs
> special settings?
>
> What do you think?
>
> Intikhab
>
>
>
> On 3/28/12 4:18 PM, "Richard Smith"<richard at flymine.org>  wrote:
>
>> Hi Intikhab,
>> Sorry we didn't get to this sooner.  That doesn't seem right at all,
>> can you send more of the log from the postprocess directory?
>>
>> Can you explain a bit more about the situation?  How many chromosome
>> objects do you have loaded?  Or have you created each read as a
>> chromosome feature?  What are the feature types loaded?
>>
>> Regards,
>> Richard.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> On 28/03/2012 16:07, Dr. Intikhab Alam wrote:
>>> Dear Richard/Julie,
>>>
>>> My post-processing (the transfer-sequences) is still running since March
>>> 18, last write:
>>> 2012-03-28 18:01:05 INFO org.intermine.bio.postprocess.TransferSequences
>>> - Starting transfer for AT0200m01A1 chromosome
>>> AT0200m01A1.GATTB1C02HV5CP
>>> 2012-03-28 18:01:05 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager - Creating new
>>> precomputed table CREATE TABLE precomp_1827553
>>>
>>> Is there a way to speedup post-processing e.g. Using multiple
>>> processors?
>>>
>>> Thanks,
>>>
>>> Intikhab
>>>
>>>
>>> From: Intikhab Alam<intikhab.alam at kaust.edu.sa
>>> <mailto:intikhab.alam at kaust.edu.sa>>
>>> Date: Sat, 24 Mar 2012 17:20:32 +0300
>>> To: "dev at intermine.org<mailto:dev at intermine.org>"<dev at intermine.org
>>> <mailto:dev at intermine.org>>
>>> Subject: [InterMine Dev] postprocessing speedup?
>>>
>>> Hi,
>>>
>>> I am developing a mine for a metagenomic project, containing 4 million
>>> DNA reads and similar amount of predicted ORFs and related
>>> attributes/features. Post-processing (transfer-sequences) started
>>> 2012-03-18 04:38 and still going on.
>>> Is there a way to paralellize post-processing to speed up these steps
>>> e.g. On a multiprocessor machine?
>>>
>>> Thanks,
>>>
>>> Intikhab
>>> --
>>> --
>>> Intikhab Alam, PhD
>>>
>>> Research Scientist
>>> Computational Bioscience Research Centre (CBRC), Building #2, Office
>>> #4336
>>> 4700 King Abdullah University of Science and Technology (KAUST)
>>> Thuwal 23955-6900, KSA
>>> W: http://www.kaust.edu.sa
>>> T +966 (0) 2 808-2423 F +966 (2) 802 0127
>>
>
>




More information about the dev mailing list