Thursday, 5 September 2013

How to retrieve the new rows of a table every minute

How to retrieve the new rows of a table every minute

I have a table, to which rows are only appended (not updated or deleted)
with transactions (I'll explain why this is important), and I need to
fetch the new, previously unfetched, rows of this table, every minute with
a cron.
How am I going to do this? In any programming language (I use Perl but
that's irrelevant.)
I list the ways I thought of how to solve this problem, and ask you to
show me the correct one (there HAS to be one...)
The first way that popped to my head was to save (in a file) the largest
auto_incrementing id of the rows fetched, so in the next minute I can
fetch with: WHERE id > $last_id. But that can miss rows. Because new rows
are inserted in transactions, it's possible that the transaction that
saves the row with id = 5 commits before the transaction that saves the
row with id = 4. It's therefore possible that the cron script retrieves
row 5 but not row 4, and when row 4 gets committed one split second later,
it will never gets fetched (because 4 is not > than 5 which is the
$last_id).
Then I thought I could make the cron job fetch all rows that have a date
field in the last TWO minutes, check which of these rows have been
retrieved again in the previous run of the cron job (to do this I would
need to save somewhere which row ids were retrieved), compare, and process
only the new ones. Unfortunately this is complicated, and also doesn't
solve the problem that will occur if a certain inserting transaction takes
TWO AND A HALF minutes to commit for some weird database reason, which
will cause the date to be too old for the next iteration of the cron job
to fetch.
Then I thought of installing a message queue (MQ) like RabbitMQ or any
other. The same process that does the inserting transaction, would notify
RabbitMQ of the new row, and RabbitMQ would then notify an always-running
process that processes new rows. So instead of getting a batch of rows
inserted in the last minute, that process would get the new rows
one-by-one as they are written. This sounds good, but has too many points
of failure - RabbitMQ might be down for a second (in a restart for
example) and in that case the insert transaction will have committed
without the receiving process having ever received the new row. So the new
row will be missed. Not good.
I just thought of one more solution: the receiving processes (there's 30
of them, doing the exact same job on exactly the same data, so the same
rows get processed 30 times, once by each receiving process) could write
in another table that they have processed row X when they process it, then
when time comes they can ask for all rows in the main table that don't
exist in the "have_processed" table with an OUTER JOIN query. But I
believe (correct me if I'm wrong) that such a query will consume a lot of
CPU and HD on the DB server, since it will have to compare the entire list
of ids of the two tables to find new entries (and the table is huge and
getting bigger each minute). It would have been fast if the receiving
process was only one - then I would have been able to add a indexed field
named "have_read" in the main table that would make looking for new rows
extremely fast and easy on the DB server.
What is the right way to do it? What do you suggest? The question is
simple, but a solution seems hard (for me) to find.
Thank you.

No comments:

Post a Comment