Hacker News — vinext + Cloudflare Workers

new
past
show
ask
show
jobs
submit

▲Optimizing Top K in Postgres (paradedb.com)

84 points by philippemnoel 1 days ago | 11 comments

bob1029 3 hours ago [-]

Lucene really does feel like magic sometimes. It was designed expressly to solve the top K problem at hyper scale. It's incredibly mature technology. You can go from zero to a billion documents without thinking too much about anything other than the amount of mass storage you have available.

Every time I've used Lucene I have combined it with a SQL provider. It's not necessarily about one or the other. The FTS facilities within the various SQL providers are convenient, but not as capable by comparison. I don't think mixing these into the same thing makes sense. They are two very different animals that are better joined by way of the document ids.

jmgimeno 4 hours ago [-]

Maybe I'm wrong, but for this query:

SELECT * FROM benchmark_logs WHERE severity < 3 ORDER BY timestamp DESC LIMIT 10;

this index

CREATE INDEX ON benchmark_logs (severity, timestamp);

cannot be used as proposed: "Postgres can jump directly to the portion of the tree matching severity < 3 and then walk the timestamps in descending order to get the top K rows."

Postgres with this index can walk to a part of the tree with severity < 3, but timestamps are sorted only for the same severity.

Cervisia 2 hours ago [-]

The SQLite documentation explains how (and how well) this works: https://www.sqlite.org/optoverview.html#the_skip_scan_optimi...

igorw 1 hours ago [-]

While Postgres did introduce skip scan in 18, it only works for equality matching: https://www.crunchydata.com/blog/get-excited-about-postgres-...

dragon96 3 hours ago [-]

If severity is a low cardinality enum, it still seems acceptable

mattashii 47 minutes ago [-]

The order returned from the Index Scan is not the ordering requested by the user, so there would still have to be a full (or topk) Sort over the dataset returned from the index scan, which could negate the gains you get from using an Index Scan; PostgreSQL itself does not produce merge join plans that merge a spread of index scans to get suffix-ordered data out of an index.

davidelettieri 4 hours ago [-]

The "But Wait, We Need Filters Too" paragraph mentions "US" filter which is introduced only later on.

GrayShade 4 hours ago [-]

And footnote 3 is unreferenced.

h1fra 1 hours ago [-]

Postgres is really good at a lot of things, but it's very unfortunate that it's really bad at simple analytics. I wish there was a plugin instead of having to have N databases

Vadim_samokhin 2 hours ago [-]

Just in case, there is a btree_gin extension which can be used in queries combining gin-indexable column and btree-indexable column. It doesn’t solve top-K ordering problem though.

JEONSEWON 5 hours ago [-]

[flagged]

bbshfishe 4 hours ago [-]

[dead]

tacone 3 hours ago [-]

The issue here is the row based format. You simply can't filter on arbitrary columns with that. Either use an external warehouse or a columnar plug-in like Timescale.

45 minutes ago [-]

Rendered at 10:51:55 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.