I was searching online recently and saw this headline clipped from the most recent Gartner Magic Quadrant for Enterprise Backup/Recovery Software (June 2015):
Gartner states that by 2016, 20% of organizations, up from 12% today, will employ only snapshot and replication techniques, abandoning traditional backup/recovery for the majority of their data…
I realize that structured data is maybe only 25% or less of what the typical enterprise is trying to protect, however, the fact that this many organizations are committed to a “snapshot/replication first” strategy is a big deal for anyone that manages large databases or large numbers of databases. Technologies that are applicable across multiple system domains are an important strategy for managing IT ownership and operational costs.
I heavily relied on intelligent storage array snapshots for data management associated with SQL Server proof-of-concept projects I conducted with customers during my 10 years at the Microsoft Technology Center in Silicon Valley. It was a big advantage to customers that were trying to minimize time in the lab and maximize results from their collaboration with Microsoft. Even after the SQL Server product team implemented database snapshots in the 2005 release, there were aspects of array based snapshots including read/write access and the ability to efficiently manage many snapshots per database that kept us using them in many projects.
I have heard a number of very knowledgeable SQL DBAs and developers refer to storage array snapshots as “SAN voodoo”. This is the first in a series of articles that I am writing intended to reduce or eliminate the knowledge gap between how SQL Server database snapshots are implemented vs snapshots created on storage arrays. Not all storage arrays implement snapshots in the same way, however there are some common patterns that you’ll see used. In this article I use some examples from EMC’s VNX and VMAX products. I currently work for EMC and I would not want to accidently misrepresent the implementations of snapshots on another vendor’s product. The views and information expressed here are my own. They have not been reviewed or approved by Microsoft or EMC. See the disclaimer on the website https://gotdisk.wordpress.com.
Snapshots of databases and other oddities
My wife is a photographer. When I told her about what I was working on she asked “snapshots of databases – what?” I was hoping she would help me come up with a clever title for this topic but she had to get school – she also is a teacher. I did a quick Bing search (yes I said Bing search) for database snapshots and for the heck of it clicked on the Images tab. I found this one and thought it would help me make a couple of points.
In this article I want to compare and contrast the implementations of:
- SQL Server Database Snapshots, and
- Intelligent storage array snapshots
Since research shows as many as 80% of database teams and storage teams don’t collaborate, I’m hoping articles like these will be interesting to both communities and start some dialog.
There are a lot more similarities than most storage or database professionals might expect between SQL database snapshots and intelligent storage array snapshots. Please share this content with other teams in your organization and start a conversation. Challenge your organization choose the best combination of SQL Server and intelligent storage array features to reduce IT cost and complexity.
I’m planning on extending coverage of this topic in future articles that will go into more depth on storage array snapshots starting with a deep dive on application consistent versus crash consistent (OS consistent) snapshots and some common myths about intelligent storage array snapshots. Please leave comments at the bottom and let me know what you would like to hear about.
In this article, I’ll first go into more details on the SQL Server database snapshot implementation and then compare and contrast that with a generic discussion of how storage array snapshots work. I’ll describe some specifics that I feel are relevant to many storage platforms. Remember, the intent of this discussion is to help storage and database administrators reach some common understanding of how the various features of applications and storage products can better solve business problems together.
SQL Server Database Snapshots
SQL Server introduced database snapshots in the 2005 release. There have been no significant enhancements including in the current previews of SQL 2016. Some of the advantages identified when the feature released included:
- Protecting your system from user or administrator error
- Offloading reporting
- Maintaining historical data
- System upgrades
- Materializing data on standby servers
- Recovering data
These use cases apply universally to all database “copy” technologies. Since snapshots are not full copies like a backup or storage clone they are generally very low overhead to create but don’t have all the protection features of full copies. This article should give you a good understanding of what make snapshots unique among data copy options.
At the heart of all snapshot technologies is the unit of replication or chunk size (snapshot granularity) that will be utilized for managing changes. In the case of SQL Server Database Snapshots that unit is a single 8k page. The second implementation decision that is common to all snapshot architectures is allocating storage space for the additional data chunks that get generated as changes occur. SQL Server uses a NTFS file structure called a sparse file for storing pages generated during change tracking for database snapshots. A NTFS sparse file can be created with a very small footprint and then expand as new data is written without explicitly setting a file size. Sparse files are also very efficient in handling data with large regions of zeros. I found this background article on Sparse Files if you are looking for a starting place for a deeper understanding.
SQL Server allocates new space in the sparse file in 64K chunks to match the size of a data file extent (8x8k per page). If only one page in an extent is modified, the SQL engine creates a 64K allocation in the sparse file and puts the page in the same slot that it occupies in the source data file extent. The other page locations for the extent are written with zeros. As more pages in that source extent are changed they are moved to the sparse in the same slot order as the original extent. This should provide for more sequential I/O if multiple pages are copied to the sparse file and the read only query needs to retrieve multiple pages from that extent. I have not seen any discussion on how much this approach impacts performance but the algorithm seems to be a good balance between creating unused space in the sparse file and giving some potential to cut down on purely random read access to the sparse file. There are a lot more implementation details of how SQL Server uses NTFS sparse files to manage efficient usage of disk space for database snapshots. Since databases that use SQL snapshots have a latency dependency on a sparse file, you need to factor the performance of that device into the design of an application that will include use of snapshots.
In the picture to the right you can see the relationship between the source database and the sparse file. There is only one database page that has been modified since the snapshot was created. Application users often times refer to a SQL snapshot as if it were a separate copy of the data and use terms like “the snapshot copy”. As this picture shows, there is no “snapshot copy”. There is only the current database and a separate storage area (the sparse file) that holds the original copies of the database pages that have changed since the snapshot was created. SQL Server database snapshots are implemented with read-only access support. You cannot write to a SQL Server snapshot. Users can read from and write to the source database. Writes to the source database will update the pages (shown on the left of the picture). When a page is written to in the source database, SQL Server:
- Puts a latch on the page.
- Copies the unmodified page to the sparse file
- Modifies the source page
- Releases the latch.
Since the duration of the latch includes the time it takes to make a copy of the page in the sparse file, the latency of “copy on write” operation will add to overall transaction latency. The time that SQL holds a latch on the page may result in increased wait time for any other operation that wants to access that page.
The picture shown to the left represents two points in time. The right side is after 30% of the pages have been modified in the original database and the one on the left shows when 80% of the original pages have been modified. The sparse file size is approaching the size of the original database over time.
Notice the blue line down the center of each drawing with the ‘Read operation on the snapshot”. Read activity on the snapshot is moving from the original database file locations to the sparse file. The performance of queries on the snapshot is going to be impacted by the performance of the storage supporting the sparse file.
SQL Server supports multiple snapshots on a single database. Each snapshot will need to be allocated a unique sparse file at the time of creation. In the copy on write process, a separate read only copy of each original version of changed pages will be written to each sparse file for all the active snapshots. The impacts of maintaining parallel updates for multiple snapshots should be evaluated carefully before implementation on any database.
The existence of dedicated sparse files for each snapshot permits matching the storage performance to the use of the snapshot. If DB01 needs a low latency snapshot, we could place that sparse file on local flash drives. If the performance impact from using snapshots with another database is less critical, you could place that NTFS sparse file on less expensive storage with lower performance.
SQL Server snapshots cannot be refreshed. To obtain a more recent snapshot delete the old snapshot and create a new one. SQL Server database snapshot names must be unique but you can reuse a name after the older copy is deleted.
SQL Server database snapshots are given a database ID like any user database. Therefore the total number of databases plus snapshots cannot exceed the limit of databases per instance (32,767 in recent versions).
Intelligent Storage Array Snapshots
In the first part of this article I discussed how SQL Server implemented three important aspects of data snapshot management.
- Chunk size for managing change data, also known as snapshot granularity
- Space allocation for managing the data change differences.
- Read/Write access
There is no standard chuck size across any of the storage arrays that I am familiar with. For the latest version of the EMC VNX line, the snapshot granularity is also 8K. Snapshot granularity on the EMC VMAX 3 is 64K. Setting a size for snapshot granularity involves making a tradeoff between the amount of space overhead consumed (small chunks are better) vs the amount of resources needed to manage the meta data for the total number of snapshots supported by the array (big chunks are better). Remember for SQL Server database snapshots, space is allocated in the sparse file in 64K chunks (1 extent) while data is written to the sparse file in 8K (1page) chunks using copy on write.
Snapshot Space Allocations
Space allocations for snapshot management on most modern storage arrays are made from the same pool of storage that is used for the source for the snapshot (see VNX Snapshot graphic below). This wasn’t always the case. You should check with the array vendor to determine how it is implemented on the equipment you are using or evaluating. When I was first learning about array based snapshots in the early 2000s you had to create a dedicated set of disks for managing snapshot change data. This pool of storage is directly comparable to the role that NTFS sparse files play in support of database snapshots for SQL Server.
The picture on the right shows a comparison of two different approaches to managing snapshots. These examples are options as implemented on the EMC VNX series of intelligent storage arrays. Other arrays may use different approaches.
The left side of the image labeled SnapView Snapshots is an older method that uses a dedicated storage pool for change block management. The blue box represents the Reserved LUN Pool that would be created using physical disks on the array typically dedicated to snapshot management. The RLP serves the same purpose as the NTFS sparse file for SQL Server snapshots, however, there are some significant differences which I will discuss next.
A Reserved LUN Pool (RLP) on VNX is used for all snapshots created on the array compared to one NTFS sparse file per snapshot with SQL Server. The size and resources for the RLP needs to match the expected needs of reading and writing to snapshots. Each array that implements snapshots will have guidance on sizing and managing space and performance. Array snapshots that use a RLP typically use the same COW process that SQL Server does. You can follow the steps for the COW process for VNX in the red numbered boxes in the SnapView Snapshot graphic on the right. Therefore, just as in SQL Server snapshots the latency of writing to a LUN that is using snapshots will be impacted by the performance of the storage location in this type of implementation.
The right of the image above labeled VNX Snapshot shows an example of how snapshots are implemented on most current storage arrays including EMC VNX and VMAX product lines. The storage for managing changes for LUNS using snapshots is the same as the pool that source LUN is allocated too. The process for handling changes is called Redirect on Write (ROW) and is shown in the graphic by the two boxes with red numbers. Straight away you can see there are less steps which is often times a good thing when writing data!
In advanced snapshot implementations, both the source LUN and any snapshot copies are implemented as a set of pointers to a group of storage locations that hold the unique data for that copy. The simple example diagramed in the image on the right shows a source LUN and two snapshots. The source LUN has pointers to 4 allocations so each snapshot needs the same number of pointers. At the time the first snapshot is created and before any changes are made, the source LUN and the snapshot have pointers to the same 4 allocations (A,B,C,D). Sometime after the purple snapshot copy is created the A allocation is written to on the source LUN. The storage processor marks the modified (A) data as requiring a new allocation (A’) and the pointers for the source LUN are updated to (B,C,D,A’). The data does not have to be physically written to disk until there is cache memory pressure. This process replaces the COW process used for RLP snapshots and SQL Server snapshots where copies of the original data are written to disk in the RLP or sparse file before the source LUN can be updated.
Also shown in the picture to the right is the implementation of multiple snapshots on one source LUN. The orange snapshot (Snap2) was created after the purple snapshot (Snap) and after changes to the source LUN created the (A’) allocation. The data stored in the (D) allocation was changed on the source LUN after Snap2 was created. The source LUN now points the allocation set (B,C,A’,D’). The allocation set for Snap is unchanged.
Notice that the (D) allocation was modified after both snapshots were created. The source LUN needed the (D’) allocation to keep the current copy of that data but both snapshots share access to a single snap copy of the (D) allocation. In the case of SQL Server snapshots two copies of the (D) allocation would be made, one for each of the sparse files associated with the two snapshots.
Another important distinction between most storage array based snapshots and SQL Server database snapshots is the ability to have simultaneous read/write access to multiple snapshots. Both VNX and VMAX support 256 snapshots for each LUN that are fully writeable. Based on the change management discussion above, the same process can be used to make writeable snapshots. If a user accessing the first snapshot (Snap) wants to write to the (D) allocation, the storage controller creates a new allocation in memory (D”) and writes the changes there. Then the allocations for the source and 2 snapshots would be as follows:
- SQL Server database snapshots provide read only copies of databases from local storage that supports NTFS sparse files.
- Creating multiple snapshots on a single database requires multiple copies of snap copy pages.
- Array snapshots are dependent on the specific vendor implementation
- Most array based snapshots are writeable.
- The snapshot granularity is comparable between some array based snapshots and SQL Server database snapshots.
- I will be continuing this series with more articles on snapshot technologies for SQL Server data management, check back or follow this site.
- If you are a DBA, contact your storage team and learn more about how array based snapshots are used in your organization
- If you are part of a storage management team, reach out to the SQL DBA team and see if they are using SQL Server database snapshots and learn more about the business needs they are supporting.
Thanks for reading
Phil Hummel - @GotDisk
San Jose, CA