[InterMine Dev] Why don't subclasses use postgres table inheritance?

Joe Carlson jwcarlson at lbl.gov
Fri Apr 8 18:15:14 BST 2016


On Apr 8, 2016, at 7:57 AM, Sam Hokin <shokin at ncgr.org> wrote:

> Hrm, I'm not sure that explains it, since you're free to define indexes and foreign key constraints on the child tables just as we are now doing on the standalone tables (which makes sense, since the child tables often have additional fields). But perhaps that creates a complexity that I'm not seeing. Maybe the constraints and keys don't apply to the data pulled up from children? That would mess things up.
> 
> In any case, it's great that you're "looking under the hood" (or bonnet?) this summer. I'd vote for implementing table inheritance for subclasses being a top priority, since that will decrease the size of the database enormously. And, I think, improve performance, since the performance of inherited tables is something they continue to work on, I think, and smaller is usually better. Thanks!
> 


I was biding my time on this until I head Julie’s comment. Like everyone (?) in this thread, I started using the code long after it was architected. I had a couple observations:

1) there are hints in the codebase that this was at one point running on an oracle server. If this is correct, then there may have been a conscious decision early on to make it vendor neutral.

2) But the main thing is performance. Table inheritance has its advantages, but performance is not one of them. The basic issue is that you effectively have a table join when you select from the child table when using information in the parent table. Here is an illustration. Implement a parent-child table and a single table of the same thing:
> 
> CREATE TABLE parent (
>   id integer not null primary key unique,
>   data1 varchar(32)
> );
> 
> CREATE TABLE child (
>   data2 varchar(32)
> ) INHERITS (parent);
> 
> CREATE TABLE single (
>   id integer not null primary key unique,
>   data1 varchar(32),
>   data2 varchar(32)
> );

These tables have a structure similar to intermine’s (id is non-null primary key integer)
Throw in some data:

> demo=# insert into single select generate_series(1,10000000),md5(random()::text),md5(random()::text);
> INSERT 0 10000000
> demo=# insert into child select generate_series(1,10000000),md5(random()::text),md5(random()::text);
> INSERT 0 10000000
> demo=# analyze single;
> ANALYZE
> demo=# analyze child;
> ANALYZE
> 

If you’re just doing a “select *”, then both implementations are the same:

> demo=# explain select * from single;
>                             QUERY PLAN                             
> -------------------------------------------------------------------
>  Seq Scan on single  (cost=0.00..223457.17 rows=10000017 width=70)
> (1 row)
> 
> demo=# explain select * from child;
>                             QUERY PLAN                            
> ------------------------------------------------------------------
>  Seq Scan on child  (cost=0.00..223457.17 rows=10000017 width=70)
> (1 row)


But bad things happen if you want to sort on id:

> demo=# explain select * from single order by id;
>                                       QUERY PLAN                                       
> ---------------------------------------------------------------------------------------
>  Index Scan using single_pkey on single  (cost=0.43..383144.69 rows=10000017 width=70)
> (1 row)
> 
> demo=# explain select * from child order by id;
>                                QUERY PLAN                               
> ------------------------------------------------------------------------
>  Sort  (cost=1386134.10..1411134.14 rows=10000017 width=70)
>    Sort Key: id
>    ->  Seq Scan on child  (cost=0.00..223457.17 rows=10000017 width=70)
> (3 rows)

This is bad. There is a scan on the child table. That’s going to take time. 

> demo=# \timing
> demo=# select * from single order by id limit 10 offset 100000;
>    id   |              data1               |              data2               
> --------+----------------------------------+----------------------------------
>  100001 | bd1be1e5b6ec5588d3ee151da7845c3b | d2455b73aee7eec7002ea4cdd88c628e
>  100002 | 0f7f0cc11ddbfba76726889eab0d0c3d | 71be2f0c1e174949a4f7518a9ac94745
>  100003 | d9d49e20a1a0d1a3e56de9fb7caeb88b | 122bcdc320e92a8fb11f85034afebcdb
>  100004 | 2d7f84a0050a984eec7a9ffb08473e37 | 8ad77cab8a90fee26110a71e4b26d377
>  100005 | 0f5347b7443de75e91e65788e3580a0f | 12bd265c48bb90d166453b71dd94d9ac
>  100006 | 0c95a1a3934a1549662dbf9bf9cb8523 | 06021763edb88defaf30847f921f88ef
>  100007 | 0e6f2aa8a8d5a71b248c19906cd63042 | d190abd8085fb8d88132dce3f5454401
>  100008 | 0ce76733fca8f8fbe5576a3de92cb221 | 62f0e9b0ca70b5758ebd2ed8fa127a3e
>  100009 | 8e310f9573ff2c2902ada186f9c1b8f0 | a9afd8cafd5a8d5820e95239a323c4a3
>  100010 | 3d31a7d75cedf514e478b8cc73df940e | dd156d0750a817f979ed201e82c38ba5
> (10 rows)
> 
> Time: 13.000 ms
> demo=# select * from child  order by id limit 10 offset 100000;
>    id   |              data1               |              data2               
> --------+----------------------------------+----------------------------------
>  100001 | 5e4da88d7f8369255964d21d7565b561 | eb05b644e9956d2262a0732a3b9b6441
>  100002 | 4a125147a3a344b8278b5c041a004fd0 | a944b9f53ae3416823c9bdc2eceb29f3
>  100003 | 22ae00720bf2aee0ddb8392726a5b464 | 736d23f8d8252be82329aa75817bc8e8
>  100004 | 2d931c518d18a5140c16bf20cf331cbc | 5d0589c82cc4d194e0e346afab0a2d3d
>  100005 | f90690aae2d50757a257a9a28ddb53ab | 6a75c4989a2a34df96944bdeb1d78014
>  100006 | e7e3d2ca171160fcf098b3ada6f90bb5 | 62576fc9bd8b63530bf46d933d1f683a
>  100007 | d684706e08029b7a21477ee1329f7dfa | 8106a09d4aaea117824754c7dfbe2154
>  100008 | 192fa62a8f68ce0baa7239f220c098e8 | b48f7199250a8b1e9347cf18f1751d05
>  100009 | 756ca7a425cece977d473ab280bd360b | bfb8141fa685ab9a6fb60d6c23fc9a8d
>  100010 | 4c82925ca9b7e5e1dea40e1b11c0ac27 | fa463ccc861eda7e3df37110786eda78
> (10 rows)
> 
> Time: 1135.699 ms



And similarly:

> demo=# explain select * from child where id=124356;
>                         QUERY PLAN                         
> -----------------------------------------------------------
>  Seq Scan on child  (cost=0.00..248457.21 rows=1 width=70)
>    Filter: (id = 124356)
> (2 rows)
> 
> demo=# explain select * from single where id=124356;
>                                 QUERY PLAN                                 
> ---------------------------------------------------------------------------
>  Index Scan using single_pkey on single  (cost=0.43..8.45 rows=1 width=70)
>    Index Cond: (id = 124356)
> (2 rows)


Another seq scan on the inherited table with an index scan on the single table.

> demo=# \timing
> Timing is on.
> demo=# select * from single where id=124356;
>    id   |              data1               |              data2               
> --------+----------------------------------+----------------------------------
>  124356 | 3b30b5e85d1485f21a70819f8bcd97ff | 1866f76642259dcb4a9d903c032af130
> (1 row)
> 
> Time: 0.969 ms
> demo=# select * from child where id=124356;
>    id   |              data1               |              data2               
> --------+----------------------------------+----------------------------------
>  124356 | 3ee5d23891aa596cb796183b1e28d4e0 | 53da5823f4244bd591255678ef41e267
> (1 row)
> 
> Time: 1114.170 ms



Since intermine uses a batching strategy for queries in which the id of the parent table is the criterion for including in a batch, this will have a big impact on the select performance. Let me know if there is some indexing trick that I’m not aware of.

Otherwise, please don’t do it! Disk is cheap. Time is precious. (Sam, you mention that your Bioentity table is 1M records. Mine is 300M records. And I’m expecting a lot more data in the next couple months that will make it - to quote a candidate - YUGE!)

The one thing that table inheritance would give you is the ability to manually do SQL without corrupting tables. I have a solution for that. I’ve shared it with Julie and am hoping to clean it up a bit and make a pull request.

Joe

> On 04/08/2016 03:26 AM, Julie Sullivan wrote:
>> Hi Sam,
>> 
>> Yes, as Justin says, this design decision was made prior to us. We are assuming the InterMine team knew about this feature and
>> decided not to use it!
>> 
>> Looking at the docs, I see one problem:
>> 
>> http://www.postgresql.org/docs/current/static/ddl-inherit.html
>> 
>> "A serious limitation of the inheritance feature is that indexes (including unique constraints) and foreign key constraints only
>> apply to single tables, not to their inheritance children."
>> 
>> Being a data warehouse, indexes are our everything. In some cases, when querying we don't even go to the tables, we use the index
>> only.
>> 
>> https://wiki.postgresql.org/wiki/Index-only_scans
>> 
>> Maybe that's why? Just a guess. However, there is a ray of hope....
> 
> _______________________________________________
> dev mailing list
> dev at intermine.org
> http://mail.intermine.org/cgi-bin/mailman/listinfo/dev

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.intermine.org/pipermail/dev/attachments/20160408/4f48d4f2/attachment-0001.html>


More information about the dev mailing list