[InterMine Dev] postprocessing speedup?

Dr. Intikhab Alam intikhab.alam at kaust.edu.sa
Thu Mar 29 16:24:14 BST 2012


Dear Richard,

Thanks you so much for the detailed reply.





On 3/29/12 3:34 PM, "Richard Smith" <richard at flymine.org> wrote:

>Hi Intikhab,
>Thanks for the detailed information.
>
>The data you're loading is very different to the situation
>TransferSequences was written and optimised to solve.  We would
>normally expect a small number of chromosomes with large numbers of
>features located on them.

Do you consider for future an implementation of Intermine for data of the
type metagenomics?

>
>Chromosome sequence is stored in the database as a clob object, the
>transfer sequences process is to iterate over all features on a
>chromosome that have a continuous locations and don't have a sequence
>set and create a reference to the clob for that location.
>
>It looks like your process is taking about half a second per chromosome,
>but as there are four million of them that will take about 24 days!
>
>1. Do you need to use InterMine to set the sequences of the annotation
>features?  You will need the sequences set if you want to export fasta
>from your InterMine but if that isn't a priority then you could remove
>the transfer-sequences step from your project.xml.
>
>2. If the pipeline you used to generate the annotation also output
>the sequence you could include it in your items XML from the start.  If
>a feature already has sequence transfer-sequences won't do anything.

I am already providing the sequence from Chromosome and Protein
Translation but not for Gene and Transcript, does residues attribute will
hold the sequences for Gene and Transcript Object?


>
>3. For each chromosome TransferSequences processes we first create a
>precomputed table as we expect a large amount of data to read.  In your
>case the overhead of creating and indexing the table probably isn't
>necessary.  You could try commenting out lines 175 & 176 in
>bio/postprocess/main/src/org/intermine/bio/postprcess/TransferSequences.ja
>va:
>
>  ((ObjectStoreInterMineImpl) os).precompute(q, indexesToCreate,
>             Constants.PRECOMPUTE_CATEGORY);
>
>This means killing the processing and starting again, but if you run
>ant -Daction=transfer-sequences in the mine/postprocess directory rather
>than using the project build script to restore from the previous backup
>it shouldn't re-do the reads it has finished.

If I use the residues attribute for Gene and Transcript to load their
sequences, would region-search work later on?

Looks like I need to reload the data from the beginning and avoid transfer
sequences, the above commenting will work only for build script, right?

>
>4. It looks like there are some minor improvements we could make to
>TransferSequences for your situation but the use case is so different
>it may need a completely new method and some experimentation to get it
>to work quickly.  So we'll only do this when we have time and if it's
>essential to use this step for your Mine.


Your Intermine datawarehouse system is very useful and I think It helps
develop interfaces to data for people to explore and analyze the data with
more biological knowledge then programming. In future I guess more and
more studies appear from large e.g. Metagenomic projects and if there is
some help to parallelize post-processing it can help. For now I guess it
is best to load all sequences from the beginning and avoid the
transfer-sequences postprocess. Is there any other post-process dependent
on transfer sequences?


Many Thanks for your help,

Best Wishes,

Intikhab
>
>
>Cheers,
>Richard.
>
>On 28/03/2012 19:35, Dr. Intikhab Alam wrote:
>> Dear Richard,
>>
>> There were 6 metagenomic datasets from redsea and following is some
>> general information:
>>
>> 5.8G 2012-03-16 06:01 AT0050m01/AT0050m01_annotations_items.xml
>> 2.1G 2012-03-16 05:18 AT0200m01A1/AT0200m01A1_annotations_items.xml
>> 1.5G 2012-03-16 18:00 AT0200m01B1/AT0200m01B1_annotations_items.xml
>> 2.7G 2012-03-16 05:36 AT0700m01A1/AT0700m01A1_annotations_items.xml
>> 687M 2012-03-16 04:07 AT0700m01B1/AT0700m01B1_annotations_items.xml
>> 6.3G 2012-03-16 06:25 AT1500m01/AT1500m01_annotations_items.xml
>>
>> Following is number of reads considered as chromosomes from each
>>dataset:
>>
>> AT0050m01/fasta/AT0050m01_annotations_chromosome.fasta:1177604
>> AT0200m01A1/fasta/AT0200m01A1_annotations_chromosome.fasta:510874
>> AT0200m01B1/fasta/AT0200m01B1_annotations_chromosome.fasta:313217
>> AT0700m01A1/fasta/AT0700m01A1_annotations_chromosome.fasta:586482
>> AT0700m01B1/fasta/AT0700m01B1_annotations_chromosome.fasta:151334
>> AT1500m01/fasta/AT1500m01_annotations_chromosome.fasta:1242393
>>
>>
>> Each read is considered as a chromosome and most of them do have (1-5)
>>ORF
>> predicted. For the database I include other features like Protein, Gene
>> and Transcript and their attributes like cross references Pathways,
>> Interpro Domains, GO terms etc.
>>
>> Following last item ids number shows the number of items stored from
>>each
>> of these data sets:
>> ==>  ./AT0200m01A1/AT0200m01A1_annotations_items.xml<==
>>     <item id="0_4074235" class="Location" implements="">
>> ==>  ./AT0200m01B1/AT0200m01B1_annotations_items.xml<==
>>     <item id="0_2772721" class="Location" implements="">
>> ==>  ./AT1500m01/AT1500m01_annotations_items.xml<==
>>     <item id="0_11891110" class="Location" implements="">
>> ==>  ./AT0700m01A1/AT0700m01A1_annotations_items.xml<==
>>     <item id="0_5052071" class="Location" implements="">
>> ==>  ./AT0050m01/AT0050m01_annotations_items.xml<==
>>     <item id="0_10887508" class="Location" implements="">
>> ==>  ./AT0700m01B1/AT0700m01B1_annotations_items.xml<==
>>     <item id="0_1313511" class="Chromosome" implements="">
>>
>>
>> Following is the time taken to complete each action :
>>
>> ../bio/scripts/project_build -v -b localhost rmetagenomic_dup>dump_log
>> 2>&1&
>>
>> action prokredsea-AT0200m01A1-largexml took 4286 seconds
>> action prokredsea-AT0200m01B1-largexml took 3383 seconds
>> action prokredsea-AT0700m01A1-largexml took 7207 seconds
>> action prokredsea-AT0700m01B1-largexml took 1798 seconds
>> action prokredsea-AT0050m01-largexml took 24509 seconds
>> action prokredsea-AT1500m01-largexml took 28475 seconds
>> action so took 86 seconds
>> action interpro took 15663 seconds
>> action go took 650 seconds
>> action create-references took 14 seconds
>> action make-spanning-locations took 5988 seconds
>> action create-chromosome-locations-and-lengths took 10748 seconds
>>
>>
>>
>> Its the transfer of sequences going on since last update in the project
>> folder:
>>
>> 18M 2012-03-18 04:38 pbuild.log
>>
>> In the postprocess folder, it kept writing the intermine.log files and
>> when the file size reaches 101MB it saves the files as e.g.
>> Intermine.log.1 and keep writing the log in the intermine.log file. When
>> 10 of such files are written, it started overwriting the old
>> intermine.log.(number), following is the most recent ls from the
>> postprocess-dir:
>>
>> total 1.1G
>> -rw-rw-r-- 1 intikhab intikhab  42M 2012-03-28 21:17 intermine.log
>> drwxrwxr-x 8 intikhab intikhab 4.0K 2012-03-28 19:31 ./
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 19:31 intermine.log.1
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 15:20 intermine.log.2
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 11:12 intermine.log.3
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 07:01 intermine.log.4
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 02:53 intermine.log.5
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 22:47 intermine.log.6
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 18:33 intermine.log.7
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 14:30 intermine.log.8
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 10:22 intermine.log.9
>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 06:17 intermine.log.10
>>
>>
>> I have included last 100 lines from latest intermine.log in the
>> postprocess dir.
>>
>>
>> It seems to be running ok but what I assume post-processing needs
>> parallelization for speedup.
>> Have you tried any metagenomic dataset before? Or consider it needs
>> special settings?
>>
>> What do you think?
>>
>> Intikhab
>>
>>
>>
>> On 3/28/12 4:18 PM, "Richard Smith"<richard at flymine.org>  wrote:
>>
>>> Hi Intikhab,
>>> Sorry we didn't get to this sooner.  That doesn't seem right at all,
>>> can you send more of the log from the postprocess directory?
>>>
>>> Can you explain a bit more about the situation?  How many chromosome
>>> objects do you have loaded?  Or have you created each read as a
>>> chromosome feature?  What are the feature types loaded?
>>>
>>> Regards,
>>> Richard.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 28/03/2012 16:07, Dr. Intikhab Alam wrote:
>>>> Dear Richard/Julie,
>>>>
>>>> My post-processing (the transfer-sequences) is still running since
>>>>March
>>>> 18, last write:
>>>> 2012-03-28 18:01:05 INFO
>>>>org.intermine.bio.postprocess.TransferSequences
>>>> - Starting transfer for AT0200m01A1 chromosome
>>>> AT0200m01A1.GATTB1C02HV5CP
>>>> 2012-03-28 18:01:05 INFO
>>>> org.intermine.sql.precompute.PrecomputedTableManager - Creating new
>>>> precomputed table CREATE TABLE precomp_1827553
>>>>
>>>> Is there a way to speedup post-processing e.g. Using multiple
>>>> processors?
>>>>
>>>> Thanks,
>>>>
>>>> Intikhab
>>>>
>>>>
>>>> From: Intikhab Alam<intikhab.alam at kaust.edu.sa
>>>> <mailto:intikhab.alam at kaust.edu.sa>>
>>>> Date: Sat, 24 Mar 2012 17:20:32 +0300
>>>> To: "dev at intermine.org<mailto:dev at intermine.org>"<dev at intermine.org
>>>> <mailto:dev at intermine.org>>
>>>> Subject: [InterMine Dev] postprocessing speedup?
>>>>
>>>> Hi,
>>>>
>>>> I am developing a mine for a metagenomic project, containing 4 million
>>>> DNA reads and similar amount of predicted ORFs and related
>>>> attributes/features. Post-processing (transfer-sequences) started
>>>> 2012-03-18 04:38 and still going on.
>>>> Is there a way to paralellize post-processing to speed up these steps
>>>> e.g. On a multiprocessor machine?
>>>>
>>>> Thanks,
>>>>
>>>> Intikhab
>>>> --
>>>> --
>>>> Intikhab Alam, PhD
>>>>
>>>> Research Scientist
>>>> Computational Bioscience Research Centre (CBRC), Building #2, Office
>>>> #4336
>>>> 4700 King Abdullah University of Science and Technology (KAUST)
>>>> Thuwal 23955-6900, KSA
>>>> W: http://www.kaust.edu.sa
>>>> T +966 (0) 2 808-2423 F +966 (2) 802 0127
>>>
>>
>>
>




More information about the dev mailing list