Designing and manage partitioning of Data

Details: Category: Performance; Published: Thursday, 28 April 2016 22:06; Written by Super User; Hits: 1157

Is your table data getting to large to manage ? Does queries take to long to execute ? If you answer these two question with YES then you should consider to partition your large tables.

When data becomes to large for your system the solution is often to buy bigger hardware or assign more cpu and memory resources to the machine to solve the problem when the best, cheapest and most effective solution is to partition the data into more manageable parts.
Modern databases have and offer a feature called table partitioning which reduces query execution times and makes the data more manageable.

What is Table Partitioning.

Table partitioning is a data organization scheme to divide large quantity of data into multiple data storages (data partitions). These data storage objects can be in different table spaces, in the same table space, on same or different storage paths/disks or a combination of these. The whole purpose of table partitioning is to divide the table data into smaller and more manageable data units, shorten reorganization needs/ -time and speed up inserts/updates/deletes and queries execution time that on regular tables becomes slower as the data in the table grows.

Partitioning types

Most modern databases offer table partitioning and several types of partitioning. These types are : range partitioning, partition by growth, partition with MDC, list partitions, hash partitioning or partition with subpartitions (composite partitioning). In addition to have columns in a partitioning key you can have partitioning key with functions on those columns. An example : create table .... partition by (Year(OrderDate)) .... (Note that there are different limitations on this in databases).

Range partitioning
    Selects a partition by determining if the partitioning key is inside a certain range. An example could be a partition for all rows where the column AccountNo has a value between 100000 and 199999
List partitioning
    A partition is assigned a list of values. If the partitioning key has one of these values, the partition is chosen. For example all rows where the column Country is either Iceland, Norway, Sweden, Finnland, Denmark, Greenland or Faro Islands could build a partition for the Nordic countries.
Hash partitioning
    The value of a hash function determines membership in a partition. Assuming there are four partitions, the hash function could return a value from 0 to 3.

When does data in a table become candidate for partitioning ?

In my opinion and many others you should look at all tables that have 5 millions or more records and make them partitioned. In many cases you should consider partitioning if the statistical distribution of the data shows a good distribution even though total number of records are under 5 million records. Why ? This is all about balancing the workload. Look at the hardware you have or virtual machine setup, memory, operating system, disks and filesystem that your database resides in. All of these are part of the total workload of your system. When data and filesystem grows then that part becomes the bottleneck of the workload when it reaches certain amount. This is where table partitioning becomes handy and necessary. By partitioning your data you divide it into smaller more manageable parts for the whole system to work with whether it is reorg, runstats, insert/update/deletes or sql queries but what is most important is that you can reduce ( in most cases ) backup and recovery time which is vital part of your SLA's.

What do we have to look at when designing a partitioned table ?

Orders tableThe most important factor to look at when designing a partition table or choosing the partitioning key is the table statistics and the statistical distribution of your data. Statistical distribution of the data gives you a clear vision on how you can divide the data. Look at the column statistics of the table, execute several group by sql queries to determine what column(s) give you the most evenly distribution of data. Each partition should hold as close to similar amount of data to evenly distribute the data across partitions for optimal performance and maintenance. Note that you must also make sure to avoid creating hot spots partitions as much as possible.

Lets look at a table that holds information about Orders. This is a table that can grow really fast and become unmanageable in a short time if nothing is done. The first partitioning key that comes into consideration is the OrderDate column. This column gives the data most evenly distributed statistics over a certain period of time. As an example by using range partitioning you could have either yearly, quarterly or monthly partitions.

The next column in a partitioning key could be the OrderID or OrderNo (if that column is a Integer column / identity column ) and you could use that column to partition the data into X number of rows per partition. Another column could be PostalCode to distribute the data across different postalcodes.

Index and partition lookaside (Partition pruning) .

Many modern databases like Oracle, DB2 and MySQL that have table partitioning capabilities have a feature called partition look-aside. This feature allows the dbms system to look at the query and its where clause column values to determine, along with the partitioningkey max and min values for each partition and partitioned indexes, what partitions needs to be read and what partitions can be skipped. This feature obviously reduces the query time tremendously for very large to huge tables.

My experience and advice to you

In my research and work for the past 10-15 years in the area of database tuning I have seen and experienced tremendous amount of reduced time in select,insert,update and delete queries, reorg and runstats executions and backup/restore times.

Some of the reduced times went from days down to minutes. In one case a runstats execution went from 7 hours down to 30 minutes in a table with 300 million rows. Another case was a batch select query that went from 23 hour execution time down to 10 seconds on a table holding around 800 million rows.

Remember. General rule of thumb 5 million records.

So if your data is getting to large to manage and the queries take to long to execute then I strongly advice you to consider and implement table partitioning.