S6: Invalid page / page verification failed (data corrupted)

Category: Server Events
SQLSTATE: XX001 (Class XX - Internal Error: data_corrupted)
Urgency: high

Example Postgres Log Output:

When a checksum failure is detected, you will see the following:

WARNING: page verification failed, calculated checksum 20919 but expected 15254

Assuming ignore_checksum_failure is not enabled, you will then see the following:

ERROR: invalid page in block 335458 of relation base/16385/99454
STATEMENT: SELECT * FROM my_table

Or the following when zero_damaged_pages is enabled (see Recommended Action section):

WARNING: invalid page in block 335458 of relation base/16385/99454; zeroing out page
STATEMENT: SELECT * FROM my_table

Explanation:

This error indicates a critical problem with your database server, as one of its data files has become corrupted.

You will only see these messages when checksums are enabled on your database - when checksums are disabled you might see lower level errors instead, or get back bad data.

Assuming checksums are enabled, and you are seeing this log event, you should treat this as a matter that needs to be addressed immediately, as you may have to restore the corrupted data from a backup, or failover to a HA standby server.

Recommended Action:

First of all, its important to keep in mind that this error is about the pages written to the data directory, not about the WAL stream. That means the primary aim here, if the data of a table is indeed lost, is to get a copy of the data from the WAL stream.

Before we go into details on how to do that, lets first find out which table or index is affected.

In the log message invalid page in block 335458 of relation base/16385/99454 we get the following information:

  • 99454 is the pg_class.relfilenode value of the index or table affected
  • 16385 is the database OID, i.e. pg_database.oid
  • 335458 is the block number, from which we can determine the file segment as well as the location within the file (we'll get to that later)

That means we can find the affected table by first doing the following to locate the correct database:

postgres=# SELECT datname FROM pg_database WHERE oid = 16385;
 datname  
----------
 mydb
(1 row)

And then connecting to that database (mydb in this case) and running the following, which gives us the affected table or index:

postgres=# SELECT relname, relkind FROM pg_class WHERE relfilenode = 99454;
 relname  | relkind
--------- +---------
 my_table | r
(1 row)

In this case we can see it is a table (relkind = r, see the pg_class documentation) that has the name my_table. If this were an index, i.e. relkind = i, we could REINDEX the index to simply regenerate everything, but since it is a table, we need to do a bit more work.

Our available options are:

Option 1: Fail over to an HA standby server / follower database

Due to this corruption being in the data directory, there is a good chance that a follower will not have the same corruption, unless you've made the follower from a bad base backup, or encountered a Postgres bug that writes the bad data the same way on both primary and secondary.

If you have a follower available, it is worthwhile to first try and run the statement accessing the data on the follower (assuming its a SELECT) and if confirmed that the statement didn't error out there, fail over to that follower.

Option 2: Replay WAL from an old enough base backup

Assuming you store both your base backups and all WAL files between them, you should be able to make a copy/fork of your database from an old base backup and then replay the WAL all the way to the current timestamp.

The tricky part with this is that if the corrupted page got corrupted some time ago, and you only have base backups after that point. It might take a bit of trial and error to find the oldest base backup that does not have the corrupted page.

Again, important to remember that base backups will copy any corrupted pages, but the WAL stream has a clean version of the data as it was written.

Option 3: Accept that some data is lost, and let Postgres move on

Sometimes we have the data in another system, and we just want to zero out the corrupted part of the table.

The first setting that is useful in this scenario is ignore_checksum_failure = on, which instead of erroring out, lets the statement continue in the face of a checksum failure. This may return bad data, crash the Postgres server, or do other bad things, but it may sometimes allow you to recover some of the rows located in the damaged page.

Once we've recovered anything there is to recover, in order to get rid of the errors, we can use zero_damaged_pages = on to destroy all corrupted pages as they are accessed. This option will destroy data so you should enable it with a lot of caution only.

Once you've regained the ability to access the table, it is recommended you make a copy of the table before turning off zero_damaged_pages (the zeroed out page is not written to disk in the original table, just zeroed out in the shared memory).

Learn more - Viewing the affected area of the data file:

Sometimes it can be useful to examine the exact location of the data corruption, to understand better whether the root cause was a hard disk failure, a bit flipped due to cosmic rays, or another mysterious problem, maybe even a Postgres bug.

In the Recommended Action section before we described each part of the invalid page log message, and we will re-use this approach to first determine the location of the data file, and then the place in the file where the corruption is present:

LOG: invalid page in block 335458 of relation base/16385/99454

First, we need to know where our data directory is (SHOW data_directory), and then we can navigate to the base/16385 directory. There you will find one file thats named exactly like the relfilenode value (335458), and most likely additional files that have suffixes like .1, .2 and so forth to indicate additional segments (each segment has a maximum size of 1 GB).

In order to determine which of the segments the affected page is located in, you will need to do the following math:

page_size = 1024 * 8
pages_per_segment = 1024 * 1024 * 1024 / page_size

block_number = 335458
segment_number = block_number / pages_per_segment
page_offset = block_number % pages_per_segment
byte_offset = page_offset * page_size

That means that in this case the segment is 2 (always round down), and the page offset within the file is 73314. On almost all installations a Postgres page is 8 KB large , which means the byte offset in this case is 600588288.

Combining all of this, using the hexdump tool xxd to access the file, and translating 600588288 into hex (23cc4000), we can view the bad page like this:

xxd /var/lib/postgresql/9.6/main/base/16385/99454.2 | grep -A 512 23cc4000

As mentioned before this is most useful in order to visually see the corruption, and e.g. determine whether only a single bit is flipped or not. You may also find pg_hexedit useful to better work with the Postgres page structure.

Learn More - Verifying the data_checksums setting / How to enable checksums:

To enable checksums your cluster must have been initialized with the --data-checksums / -k option for initdb, which is the default on most cloud providers as well as most packages in PGDG and Linux distributions.

To verify whether checksumming is enabled, you can run the following on PostgreSQL 9.4+:

postgres=# SHOW data_checksums;
 data_checksums
----------------
 on
(1 row)

This will return on when they are enabled, and off if they are disabled.

Note that currently checksums can't be enabled on an existing cluster, so you need to dump your database using a tool like pg_dump and then restore into a cluster that already has checksums enabled.

In general it is recommended to operate your PostgreSQL cluster with checksums enabled, as they serve as an important early warning in the case of corruption.

Learn More:


Download Free eBook: The Top 6 Postgres Log Events
Couldn't find what you were looking for or want to talk about something specific?
Start a conversation with us →