[InterMine Dev] postprocessing speedup?

Richard Smith richard at flymine.org
Thu Mar 29 16:59:09 BST 2012


On 29/03/2012 16:24, Dr. Intikhab Alam wrote:
> Dear Richard,
>
> Thanks you so much for the detailed reply.
>
>
>
>
>
> On 3/29/12 3:34 PM, "Richard Smith"<richard at flymine.org>  wrote:
>
>> Hi Intikhab,
>> Thanks for the detailed information.
>>
>> The data you're loading is very different to the situation
>> TransferSequences was written and optimised to solve.  We would
>> normally expect a small number of chromosomes with large numbers of
>> features located on them.
>
> Do you consider for future an implementation of Intermine for data of the
> type metagenomics?

We don't have plans to do anything really specific.  Most of the Mines
we and our users build are for human or model organism data.

>>
>> Chromosome sequence is stored in the database as a clob object, the
>> transfer sequences process is to iterate over all features on a
>> chromosome that have a continuous locations and don't have a sequence
>> set and create a reference to the clob for that location.
>>
>> It looks like your process is taking about half a second per chromosome,
>> but as there are four million of them that will take about 24 days!
>>
>> 1. Do you need to use InterMine to set the sequences of the annotation
>> features?  You will need the sequences set if you want to export fasta
>>from your InterMine but if that isn't a priority then you could remove
>> the transfer-sequences step from your project.xml.
>>
>> 2. If the pipeline you used to generate the annotation also output
>> the sequence you could include it in your items XML from the start.  If
>> a feature already has sequence transfer-sequences won't do anything.
>
> I am already providing the sequence from Chromosome and Protein
> Translation but not for Gene and Transcript, does residues attribute will
> hold the sequences for Gene and Transcript Object?
>

Yes, you could just create a sequence in exactly the same way for gene
and transcript in the items and it should work.

>>
>> 3. For each chromosome TransferSequences processes we first create a
>> precomputed table as we expect a large amount of data to read.  In your
>> case the overhead of creating and indexing the table probably isn't
>> necessary.  You could try commenting out lines 175&  176 in
>> bio/postprocess/main/src/org/intermine/bio/postprcess/TransferSequences.ja
>> va:
>>
>>   ((ObjectStoreInterMineImpl) os).precompute(q, indexesToCreate,
>>              Constants.PRECOMPUTE_CATEGORY);
>>
>> This means killing the processing and starting again, but if you run
>> ant -Daction=transfer-sequences in the mine/postprocess directory rather
>> than using the project build script to restore from the previous backup
>> it shouldn't re-do the reads it has finished.
>
> If I use the residues attribute for Gene and Transcript to load their
> sequences, would region-search work later on?

Yes, region search will work fine.  It will even work if you don't set
the sequence at all.  The sequence is only needed for exporting fasta.

> Looks like I need to reload the data from the beginning and avoid transfer
> sequences, the above commenting will work only for build script, right?

If you comment out as above it will still work fine from the build
script.  But in fact it sounds like you will either remove transfer-
sequences from your project.xml or set the sequences at the the start
so transfer-sequences won't need to do anything.

>>
>> 4. It looks like there are some minor improvements we could make to
>> TransferSequences for your situation but the use case is so different
>> it may need a completely new method and some experimentation to get it
>> to work quickly.  So we'll only do this when we have time and if it's
>> essential to use this step for your Mine.
>
>
> Your Intermine datawarehouse system is very useful and I think It helps
> develop interfaces to data for people to explore and analyze the data with
> more biological knowledge then programming. In future I guess more and
> more studies appear from large e.g. Metagenomic projects and if there is
> some help to parallelize post-processing it can help. For now I guess it
> is best to load all sequences from the beginning and avoid the
> transfer-sequences postprocess. Is there any other post-process dependent
> on transfer sequences?

No, nothing else depends on it.

Regards,
Richard.


>
> Many Thanks for your help,
>
> Best Wishes,
>
> Intikhab
>>
>>
>> Cheers,
>> Richard.
>>
>> On 28/03/2012 19:35, Dr. Intikhab Alam wrote:
>>> Dear Richard,
>>>
>>> There were 6 metagenomic datasets from redsea and following is some
>>> general information:
>>>
>>> 5.8G 2012-03-16 06:01 AT0050m01/AT0050m01_annotations_items.xml
>>> 2.1G 2012-03-16 05:18 AT0200m01A1/AT0200m01A1_annotations_items.xml
>>> 1.5G 2012-03-16 18:00 AT0200m01B1/AT0200m01B1_annotations_items.xml
>>> 2.7G 2012-03-16 05:36 AT0700m01A1/AT0700m01A1_annotations_items.xml
>>> 687M 2012-03-16 04:07 AT0700m01B1/AT0700m01B1_annotations_items.xml
>>> 6.3G 2012-03-16 06:25 AT1500m01/AT1500m01_annotations_items.xml
>>>
>>> Following is number of reads considered as chromosomes from each
>>> dataset:
>>>
>>> AT0050m01/fasta/AT0050m01_annotations_chromosome.fasta:1177604
>>> AT0200m01A1/fasta/AT0200m01A1_annotations_chromosome.fasta:510874
>>> AT0200m01B1/fasta/AT0200m01B1_annotations_chromosome.fasta:313217
>>> AT0700m01A1/fasta/AT0700m01A1_annotations_chromosome.fasta:586482
>>> AT0700m01B1/fasta/AT0700m01B1_annotations_chromosome.fasta:151334
>>> AT1500m01/fasta/AT1500m01_annotations_chromosome.fasta:1242393
>>>
>>>
>>> Each read is considered as a chromosome and most of them do have (1-5)
>>> ORF
>>> predicted. For the database I include other features like Protein, Gene
>>> and Transcript and their attributes like cross references Pathways,
>>> Interpro Domains, GO terms etc.
>>>
>>> Following last item ids number shows the number of items stored from
>>> each
>>> of these data sets:
>>> ==>   ./AT0200m01A1/AT0200m01A1_annotations_items.xml<==
>>>      <item id="0_4074235" class="Location" implements="">
>>> ==>   ./AT0200m01B1/AT0200m01B1_annotations_items.xml<==
>>>      <item id="0_2772721" class="Location" implements="">
>>> ==>   ./AT1500m01/AT1500m01_annotations_items.xml<==
>>>      <item id="0_11891110" class="Location" implements="">
>>> ==>   ./AT0700m01A1/AT0700m01A1_annotations_items.xml<==
>>>      <item id="0_5052071" class="Location" implements="">
>>> ==>   ./AT0050m01/AT0050m01_annotations_items.xml<==
>>>      <item id="0_10887508" class="Location" implements="">
>>> ==>   ./AT0700m01B1/AT0700m01B1_annotations_items.xml<==
>>>      <item id="0_1313511" class="Chromosome" implements="">
>>>
>>>
>>> Following is the time taken to complete each action :
>>>
>>> ../bio/scripts/project_build -v -b localhost rmetagenomic_dup>dump_log
>>> 2>&1&
>>>
>>> action prokredsea-AT0200m01A1-largexml took 4286 seconds
>>> action prokredsea-AT0200m01B1-largexml took 3383 seconds
>>> action prokredsea-AT0700m01A1-largexml took 7207 seconds
>>> action prokredsea-AT0700m01B1-largexml took 1798 seconds
>>> action prokredsea-AT0050m01-largexml took 24509 seconds
>>> action prokredsea-AT1500m01-largexml took 28475 seconds
>>> action so took 86 seconds
>>> action interpro took 15663 seconds
>>> action go took 650 seconds
>>> action create-references took 14 seconds
>>> action make-spanning-locations took 5988 seconds
>>> action create-chromosome-locations-and-lengths took 10748 seconds
>>>
>>>
>>>
>>> Its the transfer of sequences going on since last update in the project
>>> folder:
>>>
>>> 18M 2012-03-18 04:38 pbuild.log
>>>
>>> In the postprocess folder, it kept writing the intermine.log files and
>>> when the file size reaches 101MB it saves the files as e.g.
>>> Intermine.log.1 and keep writing the log in the intermine.log file. When
>>> 10 of such files are written, it started overwriting the old
>>> intermine.log.(number), following is the most recent ls from the
>>> postprocess-dir:
>>>
>>> total 1.1G
>>> -rw-rw-r-- 1 intikhab intikhab  42M 2012-03-28 21:17 intermine.log
>>> drwxrwxr-x 8 intikhab intikhab 4.0K 2012-03-28 19:31 ./
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 19:31 intermine.log.1
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 15:20 intermine.log.2
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 11:12 intermine.log.3
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 07:01 intermine.log.4
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-28 02:53 intermine.log.5
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 22:47 intermine.log.6
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 18:33 intermine.log.7
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 14:30 intermine.log.8
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 10:22 intermine.log.9
>>> -rw-rw-r-- 1 intikhab intikhab 101M 2012-03-27 06:17 intermine.log.10
>>>
>>>
>>> I have included last 100 lines from latest intermine.log in the
>>> postprocess dir.
>>>
>>>
>>> It seems to be running ok but what I assume post-processing needs
>>> parallelization for speedup.
>>> Have you tried any metagenomic dataset before? Or consider it needs
>>> special settings?
>>>
>>> What do you think?
>>>
>>> Intikhab
>>>
>>>
>>>
>>> On 3/28/12 4:18 PM, "Richard Smith"<richard at flymine.org>   wrote:
>>>
>>>> Hi Intikhab,
>>>> Sorry we didn't get to this sooner.  That doesn't seem right at all,
>>>> can you send more of the log from the postprocess directory?
>>>>
>>>> Can you explain a bit more about the situation?  How many chromosome
>>>> objects do you have loaded?  Or have you created each read as a
>>>> chromosome feature?  What are the feature types loaded?
>>>>
>>>> Regards,
>>>> Richard.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 28/03/2012 16:07, Dr. Intikhab Alam wrote:
>>>>> Dear Richard/Julie,
>>>>>
>>>>> My post-processing (the transfer-sequences) is still running since
>>>>> March
>>>>> 18, last write:
>>>>> 2012-03-28 18:01:05 INFO
>>>>> org.intermine.bio.postprocess.TransferSequences
>>>>> - Starting transfer for AT0200m01A1 chromosome
>>>>> AT0200m01A1.GATTB1C02HV5CP
>>>>> 2012-03-28 18:01:05 INFO
>>>>> org.intermine.sql.precompute.PrecomputedTableManager - Creating new
>>>>> precomputed table CREATE TABLE precomp_1827553
>>>>>
>>>>> Is there a way to speedup post-processing e.g. Using multiple
>>>>> processors?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Intikhab
>>>>>
>>>>>
>>>>> From: Intikhab Alam<intikhab.alam at kaust.edu.sa
>>>>> <mailto:intikhab.alam at kaust.edu.sa>>
>>>>> Date: Sat, 24 Mar 2012 17:20:32 +0300
>>>>> To: "dev at intermine.org<mailto:dev at intermine.org>"<dev at intermine.org
>>>>> <mailto:dev at intermine.org>>
>>>>> Subject: [InterMine Dev] postprocessing speedup?
>>>>>
>>>>> Hi,
>>>>>
>>>>> I am developing a mine for a metagenomic project, containing 4 million
>>>>> DNA reads and similar amount of predicted ORFs and related
>>>>> attributes/features. Post-processing (transfer-sequences) started
>>>>> 2012-03-18 04:38 and still going on.
>>>>> Is there a way to paralellize post-processing to speed up these steps
>>>>> e.g. On a multiprocessor machine?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Intikhab
>>>>> --
>>>>> --
>>>>> Intikhab Alam, PhD
>>>>>
>>>>> Research Scientist
>>>>> Computational Bioscience Research Centre (CBRC), Building #2, Office
>>>>> #4336
>>>>> 4700 King Abdullah University of Science and Technology (KAUST)
>>>>> Thuwal 23955-6900, KSA
>>>>> W: http://www.kaust.edu.sa
>>>>> T +966 (0) 2 808-2423 F +966 (2) 802 0127
>>>>
>>>
>>>
>>
>
>




More information about the dev mailing list