[InterMine Dev] How long SHOULD TransferSequence take?

Joe Carlson jwcarlson at lbl.gov
Fri Jul 4 18:45:41 BST 2014


Hi Richard,

thanks for the tips. I’ll look into the simple object tip. I think it may be useful from the sounds of it.

Modifying the query worked well for me. I was able to add the remaining 7 million sequences overnight. It took a little while for the initial query, but zoomed through the chromosomes that needed work without the temp table buildup.

Joe 

On Jul 4, 2014, at 9:26 AM, Richard Smith <richard at FLYMINE.ORG> wrote:

> Hi Joe,
> It looks like TransferSequences doesn't explicitly drop the precomputed
> tables it creates, I think it relies on the fact that all precomputed
> tables are dropped when new data is committed to the database. So, as you
> say the tables would only be dropped when a chromosome with features were
> completed.
> 
> This could probably be changed to call goFaster() before the query is run
> and releaseGoFaster() after, this will create a table automatically and
> drop it immediately.  I'll need to check that the generated table gets the
> right indexes in this case.
> 
> Though it sounds like your change to the initial query will make the
> biggest difference as featureless chromosomes should be ignored
> completely.
> 
> Looking at the count of features in your database it may be possible to
> make some speed and size improvements to your build by using Simple
> Objects: http://extrac.sysbiol.cam.ac.uk/wiki/SimpleObjects, These have
> some restrictions but are much faster to load. You may be able convert
> SNPDiversitySample for example to be a Simple Object. And if you're not
> merging e.g. Sequences and Locations from multiple sources you could
> switch off data tracking for those classes.
> 
> I'll look into the counts and timing details you sent a bit more next week
> and see if there are some more concrete recommendations. If you could send
> your merged genomic_model.xml file as well that would be really useful.
> 
> All the best,
> Richard.
> 
> 
> 
>> 
>> 
>> -------- Original Message --------
>> Subject: 	Re: [InterMine Dev] How long SHOULD TransferSequence take?
>> Date: 	Wed, 2 Jul 2014 12:40:07 -0700
>> From: 	Joe Carlson <jwcarlson at lbl.gov>
>> To: 	Julie Sullivan <julie at flymine.org>
>> CC: 	dev at intermine.org <dev at intermine.org>, David Goodstein
>> <dmgoodstein at lbl.gov>
>> 
>> 
>> 
>> Hi Julie,
>> 
>> Thanks for the reply. Just to clarify in case I wasn’t clear before: the
>> operation I’m working on is the post processing step TransferSequences.
>> In particular, it is adding nucleotide sequence to the gene flanking
>> regions. The step to define the regions ran relatively quickly. It’s
>> just adding the genomic sequence to the puppies that is slow.
>> 
>> I spent a little time with it yesterday going through the code and log
>> files. I’m sure part of it has to do with the nature of the data in our
>> mine.
>> 
>> We have ~ 550K scaffolds, but only ~ 100K have any annotations on
>> them. The post processing step that puts sequence on the features is:
>> "give me all chromosomes that have sequence", then "go through them one
>> by one and find features on them that do not have sequence", then,
>> "extract sequence from the chromosome and put it in intermine".
>> 
>> I think a large amount of time spent could have been in iterating though
>> the 400K unannotated scaffolds.  Even if nothing gets inserted, there is
>> a second or two making temporary tables, indexing and then dropping the
>> tables when processing a chromosome with no features. (400K seconds is 4
>> 1/2 days).
>> 
>> I'm starting to see the numbers tick up at a faster click now. There is
>> a pretty clean distinction in the chromosomes that have been processed
>> and those that have not. Even though the number of gene flanking regions
>> with sequence is not huge (about 1/3 of the total), the number of
>> chromosomes in which the gene flanking regions have sequence (50K) is
>> closer to the number of chromosome in which the gene flanking regions do
>> not have sequence (57K). If I'm lucky and the speed depends on the
>> number of chromosomes to process and not the number of features, then we
>> may be closer than extrapolation of the region numbers would suggest.
>> 
>> This makes some sense, but it predicated on the assumption that there
>> was a lot of time spent last week iterating through the 400K unannotated
>> scaffolds without hitting the annotated ones.
>> 
>> For a resolution of this, I’m going to change the first query from "give
>> me all chromosomes that have sequence" to "give me all chromosomes that
>> have sequence and have features located on them without sequence". The
>> extra query time is not too horrible. This ought to speed it up on the
>> next run.
>> 
>> I still don’t quite understand what I was seeing last week; the postgres
>> server was under a reasonable load but not inserting any records. But
>> there was a HUGE number of temporary tables (~ 100K) in the database.
>> There was a day or so in which the only database activity was dropping
>> the tables. Dropping the temp tables seems not to be totally
>> synchronized with their creation. (you can see what I mean below when I
>> have a question about what I see in the log files.)
>> 
>> Here is how I was going to change the query for the chromosomes. This is
>> only in a test program right now. I’ve change the query to:
>> 
>>      Query q = new Query();
>>      QueryClass qcChr = new QueryClass(Chromosome.class);
>>      q.addFrom(qcChr);
>>      q.addToSelect(qcChr);
>>      QueryObjectReference seqRef = new QueryObjectReference(qcChr,
>> "sequence");
>>      QueryClass qcLoc = new QueryClass(Location.class);
>>      q.addFrom(qcLoc);
>>      QueryObjectReference locChromRef = new
>> QueryObjectReference(qcLoc,"locatedOn");
>>      QueryObjectReference locFeatRef = new
>> QueryObjectReference(qcLoc,"feature");
>>      QueryClass qcFeat = new QueryClass(SequenceFeature.class);
>>      q.addFrom(qcFeat);
>>      QueryObjectReference featSeqRef = new QueryObjectReference(qcFeat,
>> "sequence");
>> 
>>      ConstraintSet cSet = new ConstraintSet(ConstraintOp.AND);
>>      cSet.addConstraint(new ContainsConstraint(seqRef,
>> ConstraintOp.IS_NOT_NULL));
>>      cSet.addConstraint(new ContainsConstraint(featSeqRef,
>> ConstraintOp.IS_NULL));
>>      cSet.addConstraint(new
>> ContainsConstraint(locChromRef,ConstraintOp.CONTAINS,qcChr));
>>      cSet.addConstraint(new
>> ContainsConstraint(locFeatRef,ConstraintOp.CONTAINS,qcFeat));
>>      q.setConstraint(cSet);
>>      q.setDistinct(true);
>> 
>> this will help a bit I think. Doing the query and iterating through the
>> results takes 50 minutes instead of 5, but I have much fewer things to
>> go through. I’m guessing that tweaking the batch size parameter in the
>> execute process will help a bit.
>> On Jul 2, 2014, at 2:19 AM, Julie Sullivan <julie at flymine.org
>> <mailto:julie at flymine.org>> wrote:
>> 
>>> Hi Joe
>>> 
>>> 1 million / day is too slow, there is optimisation we can do. For
>>> comparison, my SNP source (for human intermine) gets speeds of 1
>>> million / minute. The average is 200,000 objects / minute.
>>> 
>>> To diagnose where the problem is (hardware, postgres, etc), can you
>>> provide us with a little bit more detail?
>>> 
>>> 1. intermine log files from intermine/integrate
>> 
>> 
>> Here is a snippet from intermine.log in post processing that is going on
>> right now, with some interspersed comments:
>> 
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.bio.postprocess.TransferSequences     - Starting
>>> transfer for P. virgatum chromosome contig235520
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.objectstore.intermine.ObjectStoreInterMineImpl     -
>>> Creating precomputed table for query SELECT a2_, a3_ FROM
>>> org.intermine.model.bio.Chromosome AS a1_,
>>> org.intermine.model.bio.SequenceFeature AS a2_,
>>> org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304156 AND
>>> a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND
>>> a2_.sequence IS NULL) ORDER BY a2_ with indexes [a2_id, a3_id, a2_id]
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> new precomputed table CREATE TABLE precomp_463776 AS SELECT a2_.id AS
>>> a2_id, a3_.id AS a3_id, (COALESCE(a2_.id::numeric,
>>> 49999999999999999999) * 100000000000000000000) +
>>> COALESCE(a3_.id::numeric, 49999999999999999999) AS orderby_field FROM
>>> Chromosome AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE
>>> a1_.id = 121304156 AND a3_.locatedOnId = a1_.id AND a3_.featureId =
>>> a2_.id AND a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> orderby_field index on precomputed table precomp_463776
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating 3
>>> indexes for precomp_463776
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> index on precomp_463776 (orderby_field)
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> index on precomp_463776 (a2_id)
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> index on precomp_463776 (a3_id, a2_id)
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - ANALYSEing
>>> precomputed table precomp_463776
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Finished
>>> creating precomputed table precomp_463776 for category precompute
>>> 2014-07-02 10:17:00 INFO                              sqllogger     -
>>> (VERBOSE) iql: SELECT a2_, a3_ FROM org.intermine.model.bio.Chromosome
>>> AS a1_, org.intermine.model.bio.SequenceFeature AS a2_,
>>> org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304156 AND
>>> a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND
>>> a2_.sequence IS NULL) ORDER BY a2_
>>> generated sql: SELECT a2_.id AS a2_id, a3_.id AS a3_id FROM Chromosome
>>> AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE a1_.id =
>>> 121304156 AND a3_.locatedOnId = a1_.id AND a3_.featureId = a2_.id AND
>>> a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id LIMIT 1000
>>> optimised sql: SELECT P278686.a2_id, P278686.a3_id FROM precomp_463776
>>> AS P278686 ORDER BY P278686.orderby_field LIMIT 1000
>>> bag tables: 0 ms, generate: 0 ms, optimise: 7 ms,  ms,  estimate: 0
>>> ms, execute: 1 ms, convert results: 0 ms, extra queries: 0 ms, total:
>>> 8 ms, rows: 0
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.bio.postprocess.TransferSequences     - Finished setting
>>> 0 feature sequences for P. virgatum chromosome contig235520 - took 334
>>> ms.
>> 
>> So P. virgatum chromosome contig235520 is a scaffold with no features on
>> it. But we still spent 334 ms to query/index/query.
>> 
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.bio.postprocess.TransferSequences     - Starting
>>> transfer for P. virgatum chromosome contig235771
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.objectstore.intermine.ObjectStoreInterMineImpl     -
>>> Creating precomputed table for query SELECT a2_, a3_ FROM
>>> org.intermine.model.bio.Chromosome AS a1_,
>>> org.intermine.model.bio.SequenceFeature AS a2_,
>>> org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304147 AND
>>> a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND
>>> a2_.sequence IS NULL) ORDER BY a2_ with indexes [a2_id, a2_id, a3_id,
>>> a3_id]
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> new precomputed table CREATE TABLE precomp_463777 AS SELECT a2_.id AS
>>> a2_id, a3_.id AS a3_id, (COALESCE(a2_.id::numeric,
>>> 49999999999999999999) * 100000000000000000000) +
>>> COALESCE(a3_.id::numeric, 49999999999999999999) AS orderby_field FROM
>>> Chromosome AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE
>>> a1_.id = 121304147 AND a3_.locatedOnId = a1_.id AND a3_.featureId =
>>> a2_.id AND a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> orderby_field index on precomputed table precomp_463777
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating 3
>>> indexes for precomp_463777
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> index on precomp_463777 (orderby_field)
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> index on precomp_463777 (a2_id, a3_id)
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Creating
>>> index on precomp_463777 (a3_id)
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - ANALYSEing
>>> precomputed table precomp_463777
>>> 2014-07-02 10:17:00 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Finished
>>> creating precomputed table precomp_463777 for category precompute
>>> 2014-07-02 10:17:00 INFO                              sqllogger     -
>>> (VERBOSE) iql: SELECT a1_ FROM org.intermine.model.InterMineObject AS
>>> a1_ WHERE a1_.id IN ? 1: [548522771, 548522770, 548522769, 548522768,
>>> 548522775, 548522774, 548522773, 548522772, 548522779, 548522778,
>>> 548522777, 548522776, 548522783, 548522782, 548522781, 548522780]
>>> generated sql: SELECT a1_.OBJECT AS a1_, a1_.id AS a1_id FROM
>>> InterMineObject AS a1_ WHERE a1_.id IN (548522768, 548522769,
>>> 548522770, 548522771, 548522772, 548522773, 548522774, 548522775,
>>> 548522776, 548522777, 548522778, 548522779, 548522780, 548522781,
>>> 548522782, 548522783) ORDER BY a1_.id
>>> optimised sql: SELECT a1_.OBJECT AS a1_, a1_.id AS a1_id FROM
>>> InterMineObject AS a1_ WHERE a1_.id IN (548522768, 548522769,
>>> 548522770, 548522771, 548522772, 548522773, 548522774, 548522775,
>>> 548522776, 548522777, 548522778, 548522779, 548522780, 548522781,
>>> 548522782, 548522783) ORDER BY a1_.id
>>> bag tables: 0 ms, generate: 0 ms, optimise: 0 ms,  ms,  estimate: 0
>>> ms, execute: 8 ms, convert results: 0 ms, extra queries: 0 ms, total:
>>> 8 ms, rows: 16
>>> 2014-07-02 10:17:00 INFO                              sqllogger     -
>>> (VERBOSE) iql: SELECT a2_, a3_ FROM org.intermine.model.bio.Chromosome
>>> AS a1_, org.intermine.model.bio.SequenceFeature AS a2_,
>>> org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304147 AND
>>> a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND
>>> a2_.sequence IS NULL) ORDER BY a2_
>>> generated sql: SELECT a2_.id AS a2_id, a3_.id AS a3_id FROM Chromosome
>>> AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE a1_.id =
>>> 121304147 AND a3_.locatedOnId = a1_.id AND a3_.featureId = a2_.id AND
>>> a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id LIMIT 1000
>>> optimised sql: SELECT P278687.a2_id, P278687.a3_id FROM precomp_463777
>>> AS P278687 ORDER BY P278687.orderby_field LIMIT 1000
>>> bag tables: 0 ms, generate: 0 ms, optimise: 8 ms,  ms,  estimate: 0
>>> ms, execute: 1 ms, convert results: 0 ms, extra queries: 8 ms, total:
>>> 17 ms, rows: 8
>>> 2014-07-02 10:17:00 INFO                              sqllogger     -
>>> (VERBOSE) iql: SELECT DISTINCT CLOB(121304149)
>>> generated sql: SELECT value AS a1_ FROM clob WHERE clobid = 121304149
>>> AND clobpage >= 0 AND clobpage < 20 ORDER BY clobpage
>>> optimised sql: SELECT value AS a1_ FROM clob WHERE clobid = 121304149
>>> AND clobpage >= 0 AND clobpage < 20 ORDER BY clobpage
>>> bag tables: 0 ms, generate: 0 ms, optimise: 0 ms,  ms,  estimate: 0
>>> ms, execute: 12 ms, convert results: 0 ms, extra queries: 0 ms, total:
>>> 12 ms, rows: 1
>>> 2014-07-02 10:17:00 INFO     org.intermine.sql.writebatch.Batch     -
>>> Batch flusher has spent 11492472 ms waiting for the database (duty
>>> cycle 1%) (current duty cycle 0%)
>>> 2014-07-02 10:17:03 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463765
>>> 2014-07-02 10:17:05 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463766
>>> 2014-07-02 10:17:07 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463767
>>> 2014-07-02 10:17:09 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463768
>>> 2014-07-02 10:17:11 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463769
>>> 2014-07-02 10:17:13 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463770
>>> 2014-07-02 10:17:15 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463771
>>> 2014-07-02 10:17:18 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463772
>>> 2014-07-02 10:17:20 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463773
>>> 2014-07-02 10:17:22 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463774
>>> 2014-07-02 10:17:24 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463775
>>> 2014-07-02 10:17:26 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463776
>>> 2014-07-02 10:17:28 INFO
>>> org.intermine.sql.precompute.PrecomputedTableManager     - Dropped
>>> precomputed table precomp_463777
>>> 2014-07-02 10:17:28 INFO
>>> org.intermine.bio.postprocess.TransferSequences     - Finished setting
>>> 8 feature sequences for P. virgatum chromosome contig235771 - took
>>> 28424 ms.
>> 
>> P. virgatum chromosome contig235771 has 1 gene on it. And we set the
>> sequence for 8 flanking regions in 28 sec. That seems very long. Maybe
>> what is happening is that my db is getting dirty from all the created
>> and dropped temp tables. I’ve done some vacuuming to see it this helped.
>> Am trying more right now.
>> 
>> The temp table precomp_463776 was empty. but it still took a couple
>> seconds to drop it.
>> 
>> And this goes back to what I had mentioned before about the temp tables
>> getting dropped asynchronously. precomp_463776 was not dropped right
>> away. it was only dropped after I was done with precomp_463777. What
>> triggers the precomp table to be dropped? Was the fact that I was seeing
>> a large number of precomp tables a few days ago a consequence of things
>> not getting dropped in time?
>> 
>>> 2. the final output printed to the console giving the speeds of each
>>> source
>> 
>> I’m afraid I don’t have these anymore. The genomes load reasonably
>> quickly (~ 1 to 2 hours each). SNP loading is the bigger deal but we can
>> deal with it. Everything else is pretty fast.
>> 
>>> 3. the results of this query:
>>> 
>>> select class, count(*) from intermineobject group by class;
>> 
>> Here is my breakdown. The biggest group of data is related to SNPs. The
>> model I’m using is that there is a SNPLocation that can have one or more
>> SNPs associated with it. The SNPs are connected to the DiversitySample
>> with SNPDiversitySample (the biggest table). And Consequence is the
>> snpeff call. The SNPs take a couple days to load but it’s manageable.
>> 
>> 
>> phytomine-load=# select class, count(*) from intermineobject group by
>> class;
>>                       class                      |   count
>> ------------------------------------------------+-----------
>>   org.intermine.model.bio.GOTerm                 |     37509
>>   org.intermine.model.bio.Exon                   |  10004173
>>   org.intermine.model.bio.ProteinDomain          |     23013
>>   org.intermine.model.bio.MRNA                   |   1862631
>>   org.intermine.model.bio.GOAnnotation           |   6562551
>>   org.intermine.model.bio.Ontology               |        18
>>   org.intermine.model.bio.GOEvidenceCode         |         1
>>   org.intermine.model.bio.OntologyTerm           |    197085
>>   org.intermine.model.bio.Sequence               |  35498614
>>   org.intermine.model.bio.Organism               |        47
>>   org.intermine.model.bio.Protein                |   1862583
>>   org.intermine.model.bio.SOTerm                 |         8
>>   org.intermine.model.bio.CrossReference         |     87950
>>   org.intermine.model.bio.FivePrimeUTR           |   1073065
>>   org.intermine.model.bio.RNAseqExperiment       |       128
>>   org.intermine.model.bio.OntologyTermSynonym    |     86862
>>   org.intermine.model.bio.GeneFlankingRegion     |  11804480
>>   org.intermine.model.bio.Location               |  58098059
>>   org.intermine.model.bio.ProteinFamilyOrganism  |   5109450
>>   org.intermine.model.bio.DataSet                |        30
>>   org.intermine.model.bio.ProteinAnalysisFeature |  12698750
>>   org.intermine.model.bio.DiversitySample        |      1133
>>   org.intermine.model.bio.SNPDiversitySample     | 262465686
>>   org.intermine.model.bio.Synonym                |   1851645
>>   org.intermine.model.bio.Publication            |     18506
>>   org.intermine.model.bio.ProteinFamily          |   3128889
>>   org.intermine.model.bio.Intron                 |   7919403
>>   org.intermine.model.bio.OntologyRelation       |    895799
>>   org.intermine.model.bio.Gene                   |   1499335
>>   org.intermine.model.bio.DataSource             |        31
>>   org.intermine.model.bio.Chromosome             |    553403
>>   org.intermine.model.bio.ConsequenceType        |        25
>>   org.intermine.model.bio.CufflinksScore         |   5971936
>>   org.intermine.model.bio.GOEvidence             |   6534468
>>   org.intermine.model.bio.CDS                    |   1864728
>>   org.intermine.model.bio.MSA                    |    243896
>>   org.intermine.model.bio.SNPLocation            |  20142531
>>   org.intermine.model.bio.ThreePrimeUTR          |   1132071
>>   org.intermine.model.bio.SNP                    |  20180101
>>   org.intermine.model.bio.Consequence            |   2876596
>> (40 rows)
>> 
>>> 
>>> That will tell us what sort of data you are loading and how long each
>>> source is taking, allowing us to pinpoint areas of your build that can
>>> be optimised.
>>> 
>>> Julie
>>> 
>>> On 30/06/14 18:37, Joe Carlson wrote:
>>>> Hi Julie (and others)
>>>> 
>>>> What is your experience with the postprocessing step TransferSequences?
>>>> 
>>>> I'm using it here to add the sequence to the gene flanking region
>>>> objects and am wondering why our performance is not-too-impressive.
>>>> 
>>>> I cut down on the sizes of the different regions; so we have only 500bp
>>>> and 5kb regions (but still up- and down-stream, with and not-with the
>>>> gene). But even with the cutback we still have close to 12 million
>>>> regions.
>>>> 
>>>> Adding the sequence to these features has been slower than I expected.
>>>> After running a week I didn't notice any sequence actually appearing
>>>> (but I had 100K temporary tables in the db!). Now things are getting
>>>> filled in at a rate of ~ 1 million a day. This is slower than when I
>>>> had
>>>> run things before from what I can recall.
>>>> 
>>>> I recently re-synced with the master branch 12 days ago. Were there
>>>> changes after 1.2 that may affect this postprocessing steps?
>>>> 
>>>> I'm just curious. And was wondering what sorts of experience you have
>>>> with the transfering sequences. Other than reducing the number of
>>>> regions (will do), or precomputing and storing in our chado backend
>>>> (considering it), do you have any suggestions?
>>>> 
>>>> Thanks,
>>>> 
>>>> Joe
>>>> 
>>>> _______________________________________________
>>>> dev mailing list
>>>> dev at intermine.org <mailto:dev at intermine.org>
>>>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>>>> 
>> 
>> 
>> 
>> 
>> 
> 
> 
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20140704/4f8455a0/attachment-0001.html>


More information about the dev mailing list