我将双精度浮点数组(1.4百万个数值)存储在PostgreSQL表中。该表的创建语句如下。
CREATE TABLE analysis.expression
(
celfile_name character varying NOT NULL,
core double precision[],
extended double precision[],
"full" double precision[],
probeset double precision[],
CONSTRAINT expression_pkey PRIMARY KEY (celfile_name)
)
WITH (
OIDS=FALSE
);
ALTER TABLE analysis.expression ALTER COLUMN core SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN extended SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN "full" SET STORAGE EXTERNAL;
ALTER TABLE analysis.expression ALTER COLUMN probeset SET STORAGE EXTERNAL;
这个表中的每个条目只会被写入一次,但可能会在随机索引处多次读取。根据PostgreSQL 的测试结果,即使将STORAGE设置为EXTERNAL(O(n)),随着向量长度的增长,查询速度也不会有很好的扩展性。这使得像下面这样选择数组中许多单独值的查询非常缓慢(需要几分钟甚至几小时)。
SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression LIMIT 1000;
如果拉取足够多的单独索引,甚至比拉取整个数组还要慢。
有没有办法使这样的查询更快?
编辑:
- I am using PostgreSQL 9.3.
All the queries I am running are simple SELECTs possibly
SELECT probeset[2], probeset[15], probeset[102], probeset[1007], probeset[10033], probeset[200101], probeset[1004000] FROM expression JOIN samples s USING (celfile_name) WHERE s.study = 'x';
In one scenario the results of these queries are feed through prediction models. The prediction probability gets stored back into the DB in another table. In other cases select items are pulled from the arrays for downstream analysis.
Currently 1.4 million is the longest single array, the others are shorter with the smallest being 22 thousand and the average being ~ 100 thousand items long.
- Ideally I would store the array data as a wide table but with 1.4 million entries this isn't feasible, and long tables (i.e. rows with celfile_name, index, value) are much slower than PostgreSQL arrays if we want to pull a full array from the data from the DB. We do this to load our downstream data stores for when we do analysis on the full dataset.