Lets Be Great With Netezza
When we migrate from Oracle or SQL Server to Netezza, the temptation is to just migrate our star schema database with the same structures and the same ETL. While you will gain the basic performance improvements of a MPP platform, you will not even approach the potential of a well-designed MPP implementation. So here are a few basic best practices:
- Distribute, distribute, distribute (the Netezza version of location, location, location). When we create tables we spread the data across the nodes using the distribute keyword. There are two goals for distribution; one - to evenly spread the data and, two - to collocate data to support query joins, and three - to evenly spread the workload. Be smart. We don't want one node doing all the work. That is why we never normally use transaction date for distribution.
- Zone maps are critical to Netezza. A zone map is created for each integer or date or time attribute. They are not created for string or numeric attributes. A zone map identifies the min and max values for an attribute for the zone in the table. Netezza uses them to identify zones it does NOT need to read. For Netezza performance is about minimizing I/O - blocks we read into memory and blocks we need to transfer to other nodes to support queries.
3. It is not enough to have zone maps, you need well-designed zone maps. If you commonly use an attribute of low cardinality such as transaction type, the zone maps may all look the same and you still end up reading most of the data blocks for queries. In this situation you may want to consider using organize - that will sort the data on each node and give you more uniqueness in your zone maps.
4. Primary keys are not always the best choice for distribution keys. A great example is sales data. We have a header and a detail record. The header has a primary key of sales header key. The detail record has a primary key of sales header key and line item key. If we use the PK attributes the data will not colocate. We need to distribute both tables on just the sales header key.
5. Remember that distribution just places the data on the node. Every query executes on every node. If you distribute on a varchar attribute you are still executing a full table scan on every node just to find records with specific values. This is why we commonly distribute on an integer surrogate key.
6. Sometimes you need the same data twice. A great example of this is IBM Campaign. Treatment data is accessed by either treatmentinstid or celled/packageid. You can create two copies of the same table. Remember you have no indexes. Distribution is how we optimize for our queries.
7. Sometimes an integer version of a numeric attribute is useful too. A great example of this is sales amount. In our sales header we store it as dollars and cents - so there is no zone map. But when we need the attribute as a filter we only really need the dollar amount (all sales > $100), why not create an integer version of the attribute too.
8. Decouple strings when you can. Sometimes we need to search on strings. If we can we decouple them into a separate dimension. If our fact had 200 million rows but our string attribute only has one million unique values, we will save the space for the string repeated over many rows and also minimize the number of rows that we need to search to find a string value.
9. Make friends with the Netezza function Hash8. It is unique enough for most applications. You can create integer string indexes, and of course you get the zone maps (and you know how much we love zone maps).
10. Get comfortable with when a dimension is the right size to broadcast and when we need to rethink the data model. When a dimension is proving too large to broadcast or needs to broadcast too often we are putting our server at risk. Everybody will see the disk queues grow as we slam data around the nodes.
11. Groom your data. Groom restructures your data to remove space from updates and deletes and organizes it again where you are organizing. You will get the benefit of optimized zone maps and less blocks to read. Try to groom at least once a week, especially tables that go through updates.
12. Consider going insert only. Most data warehouse ETL performs updates and inserts. In some situations you can replaces inserts by using offsets and inserts. Also, updates and deletes on Netezza can be very slow if not designed efficiently.
13. Learn execution plans. Query zone maps and data slices. Make friends with Netezza talent - there are some smart people put there.
14. Be brave about your choices. People are comfortable with star schemas and traditional data models. A great Netezza model requires a change in mindset - you are designing for the hardware - not designing a classical 3NF set of data structures - not everybody will embrace your ideas.
I can still remember the first time I demonstrated a tuned Netezza database. The business user was so used to poor performance that they thought i hadn't executed their query against the full database. When they realized it was their data and their query, I have never seen a bigger smile!