Optimize GPDB with Vacuum
Just like your house gets dirty and requires cleaning from time to time, your Greenplum Database (GPDB) may accumulate dust too. Unwanted pieces of data make the database grow bigger and bigger. However, GPDB has a mechanism to get rid of junk data for you.
How Does a DBMS Get “Dirty”?
Let me explain you about concurrency control in databases. Generally, when multiple users/multiple transactions read or modify the data stored in a database at the same time, they may conflict with each other. For example, one user wants to read the row while another one may want to modify the same. Let's assume they do these operations within different transactions so they should be able to do their jobs without worrying about someone else doing something with the same data. It's the Database Management Systems's (DBMS) job to handle it. So lets look at this problem from a DBMS's perspective. There are few possible strategies to do this.
The first one is to play with locks. In this case, if one transaction wants to read the row, it must wait until the other transaction finishes updating it. Similarly, if a transaction wants to update the row, it must wait for all the transactions that are reading it to release the lock. A consequence of such behavior is poor performance; all read queries must wait for an update to finish, and vice versa (an update query must wait for all read operations to end).
Since different databases handle read consistency and locking differently, I will talk little bit about how Oracle, Microsoft SQL server, PostgreSQL, and finally Greenplum handle locks.
➠ Oracle:
Oracle handles read consistency using “rollback segments” (the rollback segments in Oracle contains old values of data that have been changed by uncommitted or recently committed transactions). When someone update a row in a database, first the row will get overwritten. The older version of the data is put into a rollback segment, and then the row is marked with a timestamp and a pointer to the older version. This mechanism allows the users to query the database and not be blocked when an update happens.
Let's take a look at a scenario. If a user started to query at time1, and then update happens at time2, the user will see a snapshot of the data as of time1. To make this work, the query follows the pointer to the rollback segment to get the old version of the rows that were updated.
Known Problems In Oracle:
There are couple of problems using the above solution. The first is the disk I/O, because the database has to first read the data file and then read the rollback segment. The second problem with this is the rollback segment must be sized pretty large in a data warehouse environment when you can easily have a query execute for a long period of time. You will eventually run out of rollback and get the “snapshot too old” error on long running queries, if you are actively changing data.
➠ SQL Server:
By default SQL Server uses locking to provide read consistency. This means an update or even an insert can block other sessions from reading a table. Starting with SQL Server 2005, you can enable Read Committed Snapshot Isolation (RCSI) which makes the database to behave more like Oracle, but it uses tempdb instead of rollback for the old version of the rows updated.
Known Problems In SQL Server:
To overcome locking, most users will use “read uncommitted” which is a dirty read (experts prefer not to do). RCSI uses tempdb, so more I/O is involved.
➠ PostgreSQL:
PostgreSQL uses Multi-Version Concurrency Control (MVCC) approach, to handle read consistency. The general concept is that each transaction has its own snapshot (or view) of all the data. Rows that are being modified by other transactions remain accessible with the values they had at the time the transaction started. It eliminates the stops caused by locks. The PostgreSQL documentation says – “The main advantage of using the model of concurrency control rather than locking is that in locks acquired for querying (reading) data do not conflict with locks acquired for writing data, and so reading never blocks writing and writing never blocks reading. PostgreSQL maintains this guarantee even when providing the strictest level of transaction isolation through the use of an innovative Serializable Snapshot Isolation (SSI) level.”
The question now is how does PostgreSQL handle updates. It simply inserts a new row for each row to be updated and marks the old row as out of date. The transactions that started before this updating transaction finishes will still see the old values (the same applies to a delete statement - the row being deleted is just marked as out of date). Actually, an update operation is a combination of delete and insert operations.
To a user, PostgreSQL behaves similarly like Oracle but it doesn't use rollback. Instead, an update adds a new row version to the existing table. Because PostgreSQL doesn't use rollback, you will never get a “snapshot too old” error on long running queries.
Known Problems In PostgreSQL:
The downside to PostgreSQL happens when the space occupied by the old rows is not reused even after all reading and modifying transactions finish. Why? Because every insert operation would need to scan the whole table in order to find free space.
In a nutshell, deleted or updated data rows still occupy physical space on disk even though they are not visible to new transactions. If you update an entire table, it effectively doubles the size on disk. This would be highly ineffective. So, in short, that's why the PostgreSQL database uses much more space than it really needs.
➠ Greenplum:
GPDB is an array of individual databases based upon PostgreSQL 8.2 working together to present a single database image. So, it also uses MVCC to handle read consistency. Readers don't block writers and writers don't block readers. Each transaction sees a snapshot (or view) of the database rather than locking tables.
Greenplum MVCC Exception:
Greenplum slightly differs from PostgreSQL for update and delete commands. It acquires an Exclusive lock (when a statement modifies data, its transaction holds an exclusive lock on data that prevents other transactions from accessing the data) on a table. However, this lock doesn't block readers (you will likely never even notice this lock because the database will handle the updates and deletes one at a time while you can still select data). SQL queries will return data and users will see a snapshot of the data that isn't dirty. This Exclusive lock does however block other update and delete commands, which is different from PostgreSQL.
How Do You Clean GPDB?
GPDB comes with a VACUUM operation as a solution to this problem. Here's how it works. Greenplum has a special structure called Free Space Map (FSM). It holds the information about the rows (or pages more concretely) which are no longer needed (i.e. they are marked as out of date and all transactions that could potentially use them are already finished). When a new row is being updated or inserted, DBMS looks for a free space in the FSM first. This way the space will be reused.
The operation that updates FSM is VACUUM. It is very important to run VACUUM frequently on GPDB, because FSM has limited capacity. If VACUUM is run too rarely, some old pages would still remain unused as there would be no space in FSM to keep information about them. In normal GPDB, tuples that are deleted or obsoleted by an update are not physically removed from their table; they remain present on disk until a VACUUM is done (VACUUM reclaims storage). Therefore it is necessary to do VACUUM periodically, especially on frequently updated tables. If the FSM is not sized large enough, space occupied by dead rows cannot be reclaimed by a regular VACUUM command.
It's important to note that a standard VACUUM operation is non-blocking, so it may be invoked on production database without requiring maintenance downtime (GPDB recommends that active production databases be vacuumed frequently; at least nightly, in order to remove expired rows). It does require additional resources (it performs many I/O operations) so it may temporarily affect the overall performance, but it doesn't require exclusive locks and thus doesn't block active transactions. Therefore, it is advisable to VACUUM the database at low usage times. Note: VACUUM cannot be executed inside a transaction block.
There is also a VACUUM FULL operation, which does more extensive processing. Instead of reusing old pages, it creates table structures from scratch. It gets rid of all old rows, which results in a better final effect. But, in opposition to standard VACUUM, it exclusively locks the tables and takes much more time to finish on large, distributed GPDB tables. That's why VACUUM FULL should be used only during maintenance downtime. Note: If you do get into a situation where the FSM has overflowed, it may be more timely to recreate the table with a CREATE TABLE AS statement, then rename and drop the old table. A VACUUM FULL is not recommended on large GPDB tables.
There are two more things worth mentioning.
The first one is, PostgreSQL has a separate optional server process called the autovacuum daemon, whose purpose is to automate the execution of VACUUM and ANALYZE commands (It looks at the statistics of inserts, updates and deletes and decides when to run). This feature is currently disabled in GPDB. To check when VACUUM and ANALYZE last run, use the below catalog table.
The last thing I'd like to point out is planner statistics. VACUUM ANALYZE performs a VACUUM and then an ANALYZE (updates statistics used by the planner to determine the most efficient way to execute a query) for each selected table. Note: GPDB best practice is not to run VACUUM ANALYZE together (this practice is generally discouraged because VACUUM and ANALYZE are different operations with different purposes).
References
Well describe in simple terminology..