TimescaleDB：高效选择最后一行

Question

TimescaleDB：高效选择最后一行

14

我有一个带有TimescaleDB扩展的PostgreSQL数据库。

我的主要索引是时间戳，我想选择最新的行。

如果我恰好知道最新的行在某个时间之后发生，那么我可以使用以下查询：

query = 'select * from prices where time > %(dt)s'

在这里我指定一个日期时间，然后使用psycopg2执行查询：

# 2018-01-10 11:15:00
dt = datetime.datetime(2018,1,10,11,15,0)

with psycopg2.connect(**params) as conn:
    cur = conn.cursor()
    # start timing
    beg = datetime.datetime.now()
    # execute query
    cur.execute(query, {'dt':dt})
    rows = cur.fetchall()
    # stop timing
    end = datetime.datetime.now()

print('took {} ms'.format((end-beg).total_seconds() * 1e3))

定时输出：

took 2.296 ms

但是，如果我不知道要输入上述查询的时间，则可以使用以下查询：

query = 'select * from prices order by time desc limit 1'

我以相似的方式执行查询

with psycopg2.connect(**params) as conn:
    cur = conn.cursor()
    # start timing
    beg = datetime.datetime.now()
    # execute query
    cur.execute(query)
    rows = cur.fetchall()
    # stop timing
    end = datetime.datetime.now()

print('took {} ms'.format((end-beg).total_seconds() * 1e3))

定时输出:

took 19.173 ms

所以这比慢了8倍以上。

我对SQL不是专家，但我认为查询规划器会发现“limit 1”和“按主索引排序”等同于O（1）操作。

问题：

有没有更有效的方法选择表中的最后一行？

如有用，这是我的表的描述：

# \d+ prices

                                           Table "public.prices"
 Column |            Type             | Collation | Nullable | Default | Storage | Stats target | Description 
--------+-----------------------------+-----------+----------+---------+---------+--------------+-------------
 time   | timestamp without time zone |           | not null |         | plain   |              | 
 AAPL   | double precision            |           |          |         | plain   |              | 
 GOOG   | double precision            |           |          |         | plain   |              | 
 MSFT   | double precision            |           |          |         | plain   |              | 
Indexes:
    "prices_time_idx" btree ("time" DESC)
Child tables: _timescaledb_internal._hyper_12_100_chunk,
              _timescaledb_internal._hyper_12_101_chunk,
              _timescaledb_internal._hyper_12_102_chunk,
              ...

- user123456789

1

2022年，Timescale博客发布了这篇文章：https://www.timescale.com/blog/select-the-most-recent-record-of-many-items-with-postgresql/ - undefined

4个回答

5

你的第一个查询可以排除除了最后一块外的所有块，而你的第二个查询必须查看每个块，因为没有信息帮助计划器排除块。因此，这不是O（1）操作，而是具有n个块数的O（n）操作。

您可以通过以下形式编写查询来向planner提供该信息：

select * from prices WHERE time > now() - interval '1day' order by time desc limit 1

根据您的块时间间隔，可能需要选择不同的间隔。

从TimescaleDB 1.2开始，如果可以在最新的块中找到条目，并且在WHERE子句中具有显式时间约束条件，则此操作的时间复杂度为O（1），如果按时间排序并具有LIMIT，则不再需要显式时间约束条件。

- Sven Klemm

2

虽然您提出的查询确实可以提高性能，但它的主要问题是我没有“now() - interval '1 day'”信息。有时它会在一天之内，而其他时候则更长。是否有任何方法可以获取最新的时间戳，而不管实际时间是什么？ - user123456789

1

如果你有一组ID，并且想要每个ID的最后一个时间戳，你会如何做呢？ - PirateApp

@user123456789 你是否在单独的表中将要分组的ID作为元数据？ - Sven Klemm

1

我尝试多种方法解决这个问题：使用last()，尝试创建索引以更快地获取最后的项目。最终，我只是创建了另一个表，在其中存储插入到hypertable中的第一个和最后一个项目，由WHERE条件键控，该条件是我的情况下的关系。

数据库编写器在插入条目到hypertable时也会更新此表
我通过简单的BTree查找获取第一个和最后一个项目-根本不需要去hypertable

以下是我的SQLAlchemy代码：

class PairState(Base):
    """Cache the timespan endpoints for intervals we are generating with hypertable.

    Getting the first / last row (timestamp) from hypertable is very expensive:
    https://dev59.com/jlUK5IYBdhLWcg3wmw4S

    Here data is denormalised per trading pair, and being updated when data is written to the database.
    Save some resources by not using true NULL values.
    """

    __tablename__ = "pair_state"

    # This table has 1-to-1 relationship with Pair
    pair_id = sa.Column(sa.ForeignKey("pair.id"), nullable=False, primary_key=True, unique=True)
    pair = orm.relationship(Pair,
                        backref=orm.backref("pair_state",
                                        lazy="dynamic",
                                        cascade="all, delete-orphan",
                                        single_parent=True, ), )

    # First raw event in data stream
    first_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    # Last raw event in data stream
    last_event_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    # The last hypertable entry added
    last_interval_at = sa.Column(sa.TIMESTAMP(timezone=True), nullable=False, server_default=text("TO_TIMESTAMP(0)"))

    @staticmethod
    def create_first_event_if_not_exist(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Sets the first event value if not exist yet."""
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, first_event_at=ts).
            on_conflict_do_nothing()
        )

    @staticmethod
    def update_last_event(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Replaces the the column last_event_at for a named pair."""
        # Based on the original example of https://dev59.com/xarka4cB1Zd3GeqPib1o#49917004
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, last_event_at=ts).
            on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_event_at": ts})
        )

    @staticmethod
    def update_last_interval(dbsession: Session, pair_id: int, ts: datetime.datetime):
        """Replaces the the column last_interval_at for a named pair."""
        dbsession.execute(
            insert(PairState).
            values(pair_id=pair_id, last_interval_at=ts).
            on_conflict_do_update(constraint=PairState.__table__.primary_key, set_={"last_interval_at": ts})
        )

- Mikko Ohtamaa

0

创建一个表来存储每次插入后的最新时间戳，并在查询中使用此时间戳。这是对我来说最有效的方法。

SELECT <COLUMN> FROM <TABLE_NAME>, <TABLE_WITH_TIMESTAMPS> WHERE time = TABLE_WITH_TIMESTAMPS.time;

- ВАСЯ ВОРОН

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DarkDiamonD · Accepted Answer

在 TimescaleDB 中获得最后 / 第一条记录的有效方法：

第一条记录：

SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time ASC LIMIT 1 ;

最后一条记录：

SELECT <COLUMN>, time FROM <TABLE_NAME> ORDER BY time DESC LIMIT 1 ;

这个问题已经有了答案，但我认为如果其他人到这里来可能会有用。在TimescaleDB中使用first()和last()需要更长的时间。