[InterMine Dev] How long SHOULD TransferSequence take?

Joe Carlson jwcarlson at lbl.gov
Wed Jul 2 20:40:07 BST 2014


Hi Julie,

Thanks for the reply. Just to clarify in case I wasn’t clear before: the operation I’m working on is the post processing step TransferSequences. In particular, it is adding nucleotide sequence to the gene flanking regions. The step to define the regions ran relatively quickly. It’s just adding the genomic sequence to the puppies that is slow.

I spent a little time with it yesterday going through the code and log files. I’m sure part of it has to do with the nature of the data in our mine.

We have ~ 550K scaffolds, but only ~ 100K have any annotations on them. The post processing step that puts sequence on the features is: "give me all chromosomes that have sequence", then "go through them one by one and find features on them that do not have sequence", then, "extract sequence from the chromosome and put it in intermine".

I think a large amount of time spent could have been in iterating though the 400K unannotated scaffolds.  Even if nothing gets inserted, there is a second or two making temporary tables, indexing and then dropping the tables when processing a chromosome with no features. (400K seconds is 4 1/2 days).

I'm starting to see the numbers tick up at a faster click now. There is a pretty clean distinction in the chromosomes that have been processed and those that have not. Even though the number of gene flanking regions with sequence is not huge (about 1/3 of the total), the number of chromosomes in which the gene flanking regions have sequence (50K) is closer to the number of chromosome in which the gene flanking regions do not have sequence (57K). If I'm lucky and the speed depends on the number of chromosomes to process and not the number of features, then we may be closer than extrapolation of the region numbers would suggest.

This makes some sense, but it predicated on the assumption that there was a lot of time spent last week iterating through the 400K unannotated scaffolds without hitting the annotated ones.

For a resolution of this, I’m going to change the first query from "give me all chromosomes that have sequence" to "give me all chromosomes that have sequence and have features located on them without sequence". The extra query time is not too horrible. This ought to speed it up on the next run.

I still don’t quite understand what I was seeing last week; the postgres server was under a reasonable load but not inserting any records. But there was a HUGE number of temporary tables (~ 100K) in the database. There was a day or so in which the only database activity was dropping the tables. Dropping the temp tables seems not to be totally synchronized with their creation. (you can see what I mean below when I have a question about what I see in the log files.)

Here is how I was going to change the query for the chromosomes. This is only in a test program right now. I’ve change the query to:

    Query q = new Query();
    QueryClass qcChr = new QueryClass(Chromosome.class);
    q.addFrom(qcChr);
    q.addToSelect(qcChr);
    QueryObjectReference seqRef = new QueryObjectReference(qcChr, "sequence");        
    QueryClass qcLoc = new QueryClass(Location.class);
    q.addFrom(qcLoc);
    QueryObjectReference locChromRef = new QueryObjectReference(qcLoc,"locatedOn");
    QueryObjectReference locFeatRef = new QueryObjectReference(qcLoc,"feature");
    QueryClass qcFeat = new QueryClass(SequenceFeature.class);
    q.addFrom(qcFeat);
    QueryObjectReference featSeqRef = new QueryObjectReference(qcFeat, "sequence");

    ConstraintSet cSet = new ConstraintSet(ConstraintOp.AND);
    cSet.addConstraint(new ContainsConstraint(seqRef, ConstraintOp.IS_NOT_NULL));
    cSet.addConstraint(new ContainsConstraint(featSeqRef, ConstraintOp.IS_NULL));
    cSet.addConstraint(new ContainsConstraint(locChromRef,ConstraintOp.CONTAINS,qcChr));
    cSet.addConstraint(new ContainsConstraint(locFeatRef,ConstraintOp.CONTAINS,qcFeat));
    q.setConstraint(cSet);
    q.setDistinct(true);

this will help a bit I think. Doing the query and iterating through the results takes 50 minutes instead of 5, but I have much fewer things to go through. I’m guessing that tweaking the batch size parameter in the execute process will help a bit.
 
On Jul 2, 2014, at 2:19 AM, Julie Sullivan <julie at flymine.org> wrote:

> Hi Joe
> 
> 1 million / day is too slow, there is optimisation we can do. For comparison, my SNP source (for human intermine) gets speeds of 1 million / minute. The average is 200,000 objects / minute.
> 
> To diagnose where the problem is (hardware, postgres, etc), can you provide us with a little bit more detail?
> 
> 1. intermine log files from intermine/integrate


Here is a snippet from intermine.log in post processing that is going on right now, with some interspersed comments:

> 2014-07-02 10:17:00 INFO  org.intermine.bio.postprocess.TransferSequences     - Starting transfer for P. virgatum chromosome contig235520
> 2014-07-02 10:17:00 INFO  org.intermine.objectstore.intermine.ObjectStoreInterMineImpl     - Creating precomputed table for query SELECT a2_, a3_ FROM org.intermine.model.bio.Chromosome AS a1_, org.intermine.model.bio.SequenceFeature AS a2_, org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304156 AND a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND a2_.sequence IS NULL) ORDER BY a2_ with indexes [a2_id, a3_id, a2_id]
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating new precomputed table CREATE TABLE precomp_463776 AS SELECT a2_.id AS a2_id, a3_.id AS a3_id, (COALESCE(a2_.id::numeric, 49999999999999999999) * 100000000000000000000) + COALESCE(a3_.id::numeric, 49999999999999999999) AS orderby_field FROM Chromosome AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE a1_.id = 121304156 AND a3_.locatedOnId = a1_.id AND a3_.featureId = a2_.id AND a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating orderby_field index on precomputed table precomp_463776
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating 3 indexes for precomp_463776
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating index on precomp_463776 (orderby_field)
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating index on precomp_463776 (a2_id)
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating index on precomp_463776 (a3_id, a2_id)
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - ANALYSEing precomputed table precomp_463776
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Finished creating precomputed table precomp_463776 for category precompute
> 2014-07-02 10:17:00 INFO                              sqllogger     - (VERBOSE) iql: SELECT a2_, a3_ FROM org.intermine.model.bio.Chromosome AS a1_, org.intermine.model.bio.SequenceFeature AS a2_, org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304156 AND a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND a2_.sequence IS NULL) ORDER BY a2_
> generated sql: SELECT a2_.id AS a2_id, a3_.id AS a3_id FROM Chromosome AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE a1_.id = 121304156 AND a3_.locatedOnId = a1_.id AND a3_.featureId = a2_.id AND a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id LIMIT 1000
> optimised sql: SELECT P278686.a2_id, P278686.a3_id FROM precomp_463776 AS P278686 ORDER BY P278686.orderby_field LIMIT 1000
> bag tables: 0 ms, generate: 0 ms, optimise: 7 ms,  ms,  estimate: 0 ms, execute: 1 ms, convert results: 0 ms, extra queries: 0 ms, total: 8 ms, rows: 0
> 2014-07-02 10:17:00 INFO  org.intermine.bio.postprocess.TransferSequences     - Finished setting 0 feature sequences for P. virgatum chromosome contig235520 - took 334 ms.

So P. virgatum chromosome contig235520 is a scaffold with no features on it. But we still spent 334 ms to query/index/query. 

> 2014-07-02 10:17:00 INFO  org.intermine.bio.postprocess.TransferSequences     - Starting transfer for P. virgatum chromosome contig235771
> 2014-07-02 10:17:00 INFO  org.intermine.objectstore.intermine.ObjectStoreInterMineImpl     - Creating precomputed table for query SELECT a2_, a3_ FROM org.intermine.model.bio.Chromosome AS a1_, org.intermine.model.bio.SequenceFeature AS a2_, org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304147 AND a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND a2_.sequence IS NULL) ORDER BY a2_ with indexes [a2_id, a2_id, a3_id, a3_id]
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating new precomputed table CREATE TABLE precomp_463777 AS SELECT a2_.id AS a2_id, a3_.id AS a3_id, (COALESCE(a2_.id::numeric, 49999999999999999999) * 100000000000000000000) + COALESCE(a3_.id::numeric, 49999999999999999999) AS orderby_field FROM Chromosome AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE a1_.id = 121304147 AND a3_.locatedOnId = a1_.id AND a3_.featureId = a2_.id AND a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating orderby_field index on precomputed table precomp_463777
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating 3 indexes for precomp_463777
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating index on precomp_463777 (orderby_field)
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating index on precomp_463777 (a2_id, a3_id)
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Creating index on precomp_463777 (a3_id)
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - ANALYSEing precomputed table precomp_463777
> 2014-07-02 10:17:00 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Finished creating precomputed table precomp_463777 for category precompute
> 2014-07-02 10:17:00 INFO                              sqllogger     - (VERBOSE) iql: SELECT a1_ FROM org.intermine.model.InterMineObject AS a1_ WHERE a1_.id IN ? 1: [548522771, 548522770, 548522769, 548522768, 548522775, 548522774, 548522773, 548522772, 548522779, 548522778, 548522777, 548522776, 548522783, 548522782, 548522781, 548522780]
> generated sql: SELECT a1_.OBJECT AS a1_, a1_.id AS a1_id FROM InterMineObject AS a1_ WHERE a1_.id IN (548522768, 548522769, 548522770, 548522771, 548522772, 548522773, 548522774, 548522775, 548522776, 548522777, 548522778, 548522779, 548522780, 548522781, 548522782, 548522783) ORDER BY a1_.id
> optimised sql: SELECT a1_.OBJECT AS a1_, a1_.id AS a1_id FROM InterMineObject AS a1_ WHERE a1_.id IN (548522768, 548522769, 548522770, 548522771, 548522772, 548522773, 548522774, 548522775, 548522776, 548522777, 548522778, 548522779, 548522780, 548522781, 548522782, 548522783) ORDER BY a1_.id
> bag tables: 0 ms, generate: 0 ms, optimise: 0 ms,  ms,  estimate: 0 ms, execute: 8 ms, convert results: 0 ms, extra queries: 0 ms, total: 8 ms, rows: 16
> 2014-07-02 10:17:00 INFO                              sqllogger     - (VERBOSE) iql: SELECT a2_, a3_ FROM org.intermine.model.bio.Chromosome AS a1_, org.intermine.model.bio.SequenceFeature AS a2_, org.intermine.model.bio.Location AS a3_ WHERE (a1_.id = 121304147 AND a3_.locatedOn CONTAINS a1_ AND a3_.feature CONTAINS a2_ AND a2_.sequence IS NULL) ORDER BY a2_
> generated sql: SELECT a2_.id AS a2_id, a3_.id AS a3_id FROM Chromosome AS a1_, SequenceFeature AS a2_, Location AS a3_ WHERE a1_.id = 121304147 AND a3_.locatedOnId = a1_.id AND a3_.featureId = a2_.id AND a2_.sequenceId IS NULL ORDER BY a2_.id, a3_.id LIMIT 1000
> optimised sql: SELECT P278687.a2_id, P278687.a3_id FROM precomp_463777 AS P278687 ORDER BY P278687.orderby_field LIMIT 1000
> bag tables: 0 ms, generate: 0 ms, optimise: 8 ms,  ms,  estimate: 0 ms, execute: 1 ms, convert results: 0 ms, extra queries: 8 ms, total: 17 ms, rows: 8
> 2014-07-02 10:17:00 INFO                              sqllogger     - (VERBOSE) iql: SELECT DISTINCT CLOB(121304149)
> generated sql: SELECT value AS a1_ FROM clob WHERE clobid = 121304149 AND clobpage >= 0 AND clobpage < 20 ORDER BY clobpage
> optimised sql: SELECT value AS a1_ FROM clob WHERE clobid = 121304149 AND clobpage >= 0 AND clobpage < 20 ORDER BY clobpage
> bag tables: 0 ms, generate: 0 ms, optimise: 0 ms,  ms,  estimate: 0 ms, execute: 12 ms, convert results: 0 ms, extra queries: 0 ms, total: 12 ms, rows: 1
> 2014-07-02 10:17:00 INFO     org.intermine.sql.writebatch.Batch     - Batch flusher has spent 11492472 ms waiting for the database (duty cycle 1%) (current duty cycle 0%)
> 2014-07-02 10:17:03 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463765
> 2014-07-02 10:17:05 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463766
> 2014-07-02 10:17:07 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463767
> 2014-07-02 10:17:09 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463768
> 2014-07-02 10:17:11 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463769
> 2014-07-02 10:17:13 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463770
> 2014-07-02 10:17:15 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463771
> 2014-07-02 10:17:18 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463772
> 2014-07-02 10:17:20 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463773
> 2014-07-02 10:17:22 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463774
> 2014-07-02 10:17:24 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463775
> 2014-07-02 10:17:26 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463776
> 2014-07-02 10:17:28 INFO  org.intermine.sql.precompute.PrecomputedTableManager     - Dropped precomputed table precomp_463777
> 2014-07-02 10:17:28 INFO  org.intermine.bio.postprocess.TransferSequences     - Finished setting 8 feature sequences for P. virgatum chromosome contig235771 - took 28424 ms.

P. virgatum chromosome contig235771 has 1 gene on it. And we set the sequence for 8 flanking regions in 28 sec. That seems very long. Maybe what is happening is that my db is getting dirty from all the created and dropped temp tables. I’ve done some vacuuming to see it this helped. Am trying more right now.

The temp table precomp_463776 was empty. but it still took a couple seconds to drop it.

And this goes back to what I had mentioned before about the temp tables getting dropped asynchronously. precomp_463776 was not dropped right away. it was only dropped after I was done with precomp_463777. What triggers the precomp table to be dropped? Was the fact that I was seeing a large number of precomp tables a few days ago a consequence of things not getting dropped in time?

> 2. the final output printed to the console giving the speeds of each source

I’m afraid I don’t have these anymore. The genomes load reasonably quickly (~ 1 to 2 hours each). SNP loading is the bigger deal but we can deal with it. Everything else is pretty fast.

> 3. the results of this query:
> 
> select class, count(*) from intermineobject group by class;

Here is my breakdown. The biggest group of data is related to SNPs. The model I’m using is that there is a SNPLocation that can have one or more SNPs associated with it. The SNPs are connected to the DiversitySample with SNPDiversitySample (the biggest table). And Consequence is the snpeff call. The SNPs take a couple days to load but it’s manageable.


phytomine-load=# select class, count(*) from intermineobject group by class;
                     class                      |   count   
------------------------------------------------+-----------
 org.intermine.model.bio.GOTerm                 |     37509
 org.intermine.model.bio.Exon                   |  10004173
 org.intermine.model.bio.ProteinDomain          |     23013
 org.intermine.model.bio.MRNA                   |   1862631
 org.intermine.model.bio.GOAnnotation           |   6562551
 org.intermine.model.bio.Ontology               |        18
 org.intermine.model.bio.GOEvidenceCode         |         1
 org.intermine.model.bio.OntologyTerm           |    197085
 org.intermine.model.bio.Sequence               |  35498614
 org.intermine.model.bio.Organism               |        47
 org.intermine.model.bio.Protein                |   1862583
 org.intermine.model.bio.SOTerm                 |         8
 org.intermine.model.bio.CrossReference         |     87950
 org.intermine.model.bio.FivePrimeUTR           |   1073065
 org.intermine.model.bio.RNAseqExperiment       |       128
 org.intermine.model.bio.OntologyTermSynonym    |     86862
 org.intermine.model.bio.GeneFlankingRegion     |  11804480
 org.intermine.model.bio.Location               |  58098059
 org.intermine.model.bio.ProteinFamilyOrganism  |   5109450
 org.intermine.model.bio.DataSet                |        30
 org.intermine.model.bio.ProteinAnalysisFeature |  12698750
 org.intermine.model.bio.DiversitySample        |      1133
 org.intermine.model.bio.SNPDiversitySample     | 262465686
 org.intermine.model.bio.Synonym                |   1851645
 org.intermine.model.bio.Publication            |     18506
 org.intermine.model.bio.ProteinFamily          |   3128889
 org.intermine.model.bio.Intron                 |   7919403
 org.intermine.model.bio.OntologyRelation       |    895799
 org.intermine.model.bio.Gene                   |   1499335
 org.intermine.model.bio.DataSource             |        31
 org.intermine.model.bio.Chromosome             |    553403
 org.intermine.model.bio.ConsequenceType        |        25
 org.intermine.model.bio.CufflinksScore         |   5971936
 org.intermine.model.bio.GOEvidence             |   6534468
 org.intermine.model.bio.CDS                    |   1864728
 org.intermine.model.bio.MSA                    |    243896
 org.intermine.model.bio.SNPLocation            |  20142531
 org.intermine.model.bio.ThreePrimeUTR          |   1132071
 org.intermine.model.bio.SNP                    |  20180101
 org.intermine.model.bio.Consequence            |   2876596
(40 rows)

> 
> That will tell us what sort of data you are loading and how long each source is taking, allowing us to pinpoint areas of your build that can be optimised.
> 
> Julie
> 
> On 30/06/14 18:37, Joe Carlson wrote:
>> Hi Julie (and others)
>> 
>> What is your experience with the postprocessing step TransferSequences?
>> 
>> I'm using it here to add the sequence to the gene flanking region
>> objects and am wondering why our performance is not-too-impressive.
>> 
>> I cut down on the sizes of the different regions; so we have only 500bp
>> and 5kb regions (but still up- and down-stream, with and not-with the
>> gene). But even with the cutback we still have close to 12 million regions.
>> 
>> Adding the sequence to these features has been slower than I expected.
>> After running a week I didn't notice any sequence actually appearing
>> (but I had 100K temporary tables in the db!). Now things are getting
>> filled in at a rate of ~ 1 million a day. This is slower than when I had
>> run things before from what I can recall.
>> 
>> I recently re-synced with the master branch 12 days ago. Were there
>> changes after 1.2 that may affect this postprocessing steps?
>> 
>> I'm just curious. And was wondering what sorts of experience you have
>> with the transfering sequences. Other than reducing the number of
>> regions (will do), or precomputing and storing in our chado backend
>> (considering it), do you have any suggestions?
>> 
>> Thanks,
>> 
>> Joe
>> 
>> _______________________________________________
>> dev mailing list
>> dev at intermine.org
>> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev
>> 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20140702/2d85d2a1/attachment-0001.html>


More information about the dev mailing list