After looking at the challenges involved in ensuring durability, let us see how database engines typically handle that. In general, I have seen three different approaches.
The append only model, the copy-on-write model and journaling. We discussed the append only mode a few times already, the main advantage here is that we always write to the end of the file, and we can commit (and thus make durable), but just calling fsync* on the file after we complete all the writes.
* Nitpicker corner: Iâ€™m using fsync as a general term for things like FlushFileBuffers, fdatasync, fsync, etc. Iâ€™m focused on the actual mechanics, rather than the specific proper API to use here.
There are some issues here that you need to be aware of, though. A file system (and block devices in general) will freely re-order writes as they wish, so the following bit of code:
May not actually do what you expect it to do. It is possible that during crash recovery, the second write was committed to disk (fully valid and functioning), but the first write was not. So if you validate just the transaction header, youâ€™ll see that you have a valid bit of data, while the file contains some corrupted data.
The other alternative is to copy-on-write, instead of modifying the data in place, we write it (typically at the end of the file), fsync that, then point to the new location from a new file, and fsync that in turn. Breaking it apart into two fsyncs means that it is much more costly, but it also forces the file system to put explicit barriers between the operations, so it canâ€™t reorder things. Note that you can also do that on a single file, with fsync between the two operations. But typically you use that on separate files.
Finally, we have the notion of explicit journaling. The idea is that you dedicate a specific file (or set of files), and then you can just write to them as you go along. Each transaction you write is hashed and verified, so both the header and the data can be checked at read time. And after every transaction, youâ€™ll fsync the journal, which is how you commit the transaction.
On database startup, you read the journal file and apply all the valid transactions, until you reach the end of the file or a transaction that doesnâ€™t match its hash, at which point you know that it wasnâ€™t committed properly. In this case, a transaction is the set of operations that needs to be applied to the data file in order to sync it with the state it had before the restart. That can be modifying a single value, or atomically changing a whole bunch of records.
I like journal files because they allow me to do several nice tricks. Again, we can pre-allocate them in advance, which means that we suffer much less from fragmentation, but more importantly, most of the writes in journal systems are done at the same location (one after another), so we get the benefit of having sequential writes, which is pretty much the best thing ever to getting good performance from the hard disk.
There are things that Iâ€™m skipping, of course, append only or copy on write typically write to the data file, which means that you canâ€™t do much there, you need the data available. But a journal file is rarely read, so you can do things like compress the data to the file on the fly, and reduce the I/O costs that you are going to pay. Other things that you can do is release the transaction lock before you actually write to the journal file, let the next transaction start, but not confirm the current transaction to the user until the disk let us know that the write has completed. That way, we can parallelize the costly part of the old transaction (I/O to disk) with the compute bound portion of the new transaction, gaining something in the meantime.
This is typically called early lock release, and while we played with it, we didnâ€™t really see good numbers here to actually implement it for production.
This postÂ was authoredÂ by Joseph Sirosh, Corporate Vice President of the Data Group at Microsoft
This week I had the opportunity to represent Microsoft in keynotes at both the Hadoop and Red Hat Summits in San Jose and San Francisco, and during that time speak to many customers and members of the open source and big data communities. The energy in these communities is incredible and Iâ€™m proud to see Microsoft engage as an active participant and contributor. Below I want to share a quick summary of recent news related to both these events and communities.
This year we celebrated 10 years of Hadoop and along with the rest of the community weâ€™re working hard to make big data easier for customers, no matter the type of data, what they need to do or what platform theyâ€™re on. At Hadoop Summit, Hortonworks also announced Microsoft Azure HDInsight as its Premier Connected Data Platforms solution to deliver Apache Hadoop in a cloud environment. This announcement culminates a long-standing partnership with Hortonworks that started in 2011 when Hortonworks was three-months-old. Itâ€™s been rewarding to see how far Hadoop has come, and itâ€™s now deployed in thousands of organizations. As an example, Jet.com is using Hadoop with HDInsight to help redefine the e-commerce category by providing consumers with completely transparent pricing that dynamically changes based on the actual costs of the transaction â€“ warehouse location, payment method, and number of items shipped.
Looking back at our journey with Hadoop, itâ€™s also been gratifying to see our contributions accelerate its adoption. Members of Microsoft have been contributing to the development of Apache YARN since its inception. Weâ€™ve also been leading or contributing to projects like bringing Hadoop onto Azure and Windows, speeding up the query processing of Hive, making cloud-based stores accessible via WebHDFS, and making Spark execution available through a REST endpoint. Recently, we also announced our commitment to Apache Spark at Spark Summit 2016 including:
- Spark for Azure HDInsight General Availability, is a fully managed Spark service from Hortonworks that is enterprise ready and easy to use
- R Server for HDInsight in the cloud powered by Spark, in preview today and will be generally available later this summer makes Spark integration easy no matter if you are working on premises or in the cloud.
- R Server for Hadoop on-premises now powered by Spark, as the leading solution in the world to run R at scale, R Server for Hadoop now supports both Microsoft R and native Spark execution frameworks made available this week. Combining R Server with Spark gives users the ability to run R functions over thousands of Spark nodes letting you train your models on data 1000x larger and 100x faster than was possible with open source R and nearly 2x faster than Sparkâ€™s own MLLib.
- Free R Client for Data Scientists, a new free tool for data scientists to build high performance analytics using R.
- Power BI support for Spark Streaming, General Availability, Spark support in Power BI now allows you to publish real-time events from Spark Streaming.
With our investments in R combined with Spark & Hadoop, statisticians and data scientists can rapidly train a variety of predictive models on large-scale data, limited only by the size of their Spark clusters. With Spark, R Serverâ€™s compiled code algorithms and transparent parallelization of regression, clustering, decision trees and other statistical algorithms speeds analysis 100x faster on terabytes of data.
Red Hat Summit
At Microsoft, weâ€™re serious about building an intelligent cloud through a comprehensive approach that includes the open source ecosystem. Today, our cloud offerings range from support for Linux in Azure Virtual Machines â€“ and nearly 1 in 3 VMs are running Linux today â€“ to a Hadoop solution in HDInsight, or deep integration of Docker Swarm and Apache Mesos in Azure Container Service that represent our commitment to the ecosystem and highlight the value of our partnerships. In November, Microsoft and Red Hat announced a partnership to add value to the open source investments in the enterprise. At Red Hat Summit, we announced a number of important partnership milestones, including:
- The general availability of .NET Core 1.0 and ASP.NET Core 1.0, a platform for creating modern applications for Windows, Linux and Mac OS X
- In partnership with Red Hat and 21Vianet, this week we are extending support for Red Hat Enterprise Linux to Azure China operated by 21Vianet
- Red Hat is making CloudForms 4.1 generally available, with deep support for Azure including state analysis, metrics, chargeback and retirement, making Azure the best supported cloud in CloudForms
- Availability of a new OpenShift solution template on GitHub that makes it simple to deploy OpenShift in Azure
In March, we announced our plans to bring SQL Server to Linux, starting with a private preview. In the research note Microsoft Diversifies With Linux Support for SQL Server, Gartner wrote â€śSQL Server on Linux represents a bold statement that the company understands there is more to the overall IT world than just Windows and this flexibility is necessary to compete in the DBMS market.â€ť Today, at the Red Hat Summit, I will show SQL Server running on Red Hat Enterprise Linux. Our goal is to make SQL Server the platform of choice to support any data, any application, on-premises or in the cloud, and providing you with platform choice. Bringing SQL Server to Red Hat Enterprise Linux will provide enterprise Linux customers with SQL Server’s mission-critical performance, industry-leading TCO, the least vulnerable database, and hybrid cloud innovations like Stretch Database to access data on-premises or in the cloud.Â Weâ€™ll first release the core relational database capabilities on Linux targeting mid-calendar year 2017.
MongoDB on Microsoft Azure
Additionally, this week, MongoDB announced MongoDB Atlas, a new elastic on-demand cloud service that will provide comprehensive infrastructure and management for its popular database. MongoDB Atlas will become available for Azure customers via a strategic partnership between the two companies. This partnership with MongoDB further reinforces Microsoftâ€™s commitment to providing customers with open source solutions and the most comprehensive cloud platform on the market.
 National Institute of Standards and Technology, National Vulnerability Database statistics as of 2/1/2016.
This post was authored by Ayo Olubeko, Program Manager, Data Developer Group.
The July update for SSMS includes the first substantial improvement in SQL PowerShell in many years. We owe a lot of thanks for this effort to the great collaboration with our community. We have several new CMDLETs to share with you, but firstly, there is a very important change we had to make to be able to ship monthly updates to the SQL PowerShell component.
Historically, SQL Serverâ€™s PowerShell components are included in both the SSMS (tools) installer as well as with the SQL Server engine install. In order to be able to ship SQL PowerShell update, we had to change the identity of the SQL PowerShell module as well as the wrapper known as SQLPS.exe. This change has an impact to scripts doing Import-Module.
This month we introduce CMDLETs for the following areas:
- Always Encrypted
- SQL Agent
- SQL Error Logs
Additionally, we have made some nice improvements to Invoke-SqlCmd and the SQL provider.New SQL PowerShell module
As alluded to above, in order to ship monthly updates, we have created a new SQL PowerShell module as well as have introduced a new wrapper EXE that SSMS uses to instantiate the SQL PowerShell environment. The SQL PowerShell module that ships with SSMS has changed from SQLPS to SqlServer (there is no change to the module used by SQL Agent). This means that if you have a PowerShell script doing Import-Module SQLPS, it will need to be changed to be Import-Module SqlServer in order to take advantage of the new provider functionality and new CMDLETs. The new module will be installed to â€ś%Program Files\WindowsPowerShell\Modules\SqlServerâ€ť and hence no update to $env:PSModulePath is required. Additionally, if you happen to have a script that is using a 3rd-party or community version of a module named SqlServer, you should add use of the Prefix parameter to avoid name collisions.
The motivation for these changes is that the tooling components are being moved to be â€śapplication localâ€ť and not share any components with the SQL Server engine. This is an important step to enable monthly tooling updates while not negatively impacting the components setup and updated by the SQL Server setup program.
SSMS has been updated to integrate with SQLTOOLSPS.exe rather than SQLPS.exe. Hence, if you launch PowerShell from within SSMS, it will launch PowerShell and configure the session with the new SQL PowerShell module. It is advised to avoid using these EXE wrappers; they exist for legacy reasons within SSMS and are likely to be removed in a future monthly update.
The new version of SQL Server PowerShell included with SSMS does not update the version of PowerShell used by SQL Server. This means that scripts executed by SQL Agent will not be able to use the new CMDLETs. Updates to SQLPS (the version used by SQL Agent) will be done through the traditional SQL Server update mechanisms; more specifically, major changes will be done as part of the next major version of SQL Server as it becomes available.New CMDLETs
In the July SSMS update, you will find several new CMDLETs. Once again, we owe thanks to our SQL PowerShell community leaders for helping us prioritize these investments. The CMDLETs all provide help within PowerShell for detailed information. As with the majority of the features in SSMS, the SQL PowerShell CMDLETs work against all supported versions of SQL Server. In some cases, such as Always Encrypted, the CMDLETs obviously only work on versions of SQL Server that support that specific feature set.CMDLET Description Add-SqlAzureAuthenticationContext Performs authentication to Azure and acquires an authentication token. Add-SqlColumnEncryptionKeyValue Adds a new encrypted value for an existing column encryption key object in the database. Complete-SqlColumnMasterKeyRotation Completes the rotation of a column master key. Get-SqlColumnEncryptionKey Returns all column encryption key objects defined in the database, or returns one column encryption key object with the specified name. Get-SqlColumnMasterKey Returns the column master key objects defined in the database, or returns one column master key object with the specified name. Invoke-SqlColumnMasterKeyRotation Initiates the rotation of a column master key. New-SqlAzureKeyVaultColumnMasterKeySettings Creates a SqlColumnMasterKeySettings object describing an asymmetric key stored in Azure Key Vault. New-SqlCertificateStoreColumnMasterKeySettings Creates a SqlColumnMasterKeySettings object referencing the specified certificate. New-SqlCngColumnMasterKeySettings Creates a SqlColumnMasterKeySettings object describing an asymmetric key stored in a key store supporting the Cryptography Next Generation (CNG) API. New-SqlColumnEncryptionKey Crates a new column encryption key object in the database. New-SqlColumnEncryptionKeyEncryptedValue Produces an encrypted value of a column encryption key. New-SqlColumnEncryptionSettings Creates a new SqlColumnEncryptionSettings object that encapsulates information about a single columnâ€™s encryption, including CEK and encryption type. New-SqlColumnMasterKey Creates a new column master key object in the database. New-SqlCspColumnMasterKeySettings Creates a SqlColumnMasterKeySettings object describing an asymmetric key stored in a key store with a Cryptography Service Provider (CSP) supporting Cryptography API (CAPI). Remove-SqlColumnEncryptionKey Removes the column encryption key object from the database. Remove-SqlColumnEncryptionKeyValue Removes an encrypted value from an existing column encryption key object in the database. Remove-SqlColumnMasterKey Removes the column master key object from the database. Set-SqlColumnEncryption Encrypts, decrypts or re-encrypts specified columns in the database. Â Get-SqlAgent Returns a SQL Agent (JobServer) object that is present in the target instance of the SQL Server. Get-SqlAgentJob Returns a SQL Agent Job object for each job that is present in the target instance of SQL Agent. Get-SqlAgentJobHistory Returns the JobHistory present in the target instance of SQL Agent. Get-SqlAgentJobSchedule Returns a JobSchedule object for each schedule that is present in the target instance of SQL Agent Job. Get-SqlAgentJobStep Returns a SQL JobStep object for each step that is present in the target instance of SQL Agent Job. Get-SqlAgentSchedule Returns a SQL JobSchedule object for each schedule that is present in the target instance of SQL Agent. Â Get-SqlErrorLog Retrieves the SQL Server Logs. Set-SqlErrorLog Sets or resets the maximum number of error log files before they are recycled. Invoke-SqlCmd improvements
Invoke-SqlCmd now supports an OutputAs parameter (or its alias -As). This parameter allows you to specify DataRows, DataTables or DataSet as the object type to return. These types map to the .Net types you find in System.Data. DataRows is the default, and corresponds to the old behavior.
Additionally, we added the ConnectionString parameter which allows the script author complete control over the connection context. This unlocks new capabilities such as connecting to SQL Azure using Azure Active Directory authentication.SQL PowerShell provider enhancements
The SQL PowerShell provider now properly supports the WhatIf and Confirm parameters. This allows you to see the potential impact of a script operation and have the ability to confirm an operation before it is executed.
One of the most important aspect of a database engine is that it needs to support durability. Now, that is a personal opinion as a database author, but I consider this to be a pretty important consideration when selecting a database. Other database engines disagree, from pure in memory databases that lose all the data on restart to databases that makes â€śbest effortâ€ť and will work as long as they donâ€™t crash in the wrong place.
Because there are several different metrics for what durable means, Iâ€™ll provide several levels of possible durability requirements.
- None â€“ if the application / server restarts, all the data is lost. Typically is used for in memory databases, explicitly giving up durability for performance.
- Try â€“ the data is written to disk, but no attempt is made to make sure that it is coherent / up to date. Typically is used to startup an in memory / near in memory database from cold state.
- Crash â€“ if the database has confirmed a write, and then immediately crashed, the data is still there. Most databases try to get to this level.
- Power loss â€“ if the database has confirmed a write, even complete power loss of the entire machine will still keep the written data. This is where you wanna be.
Note that in all those cases, Iâ€™m talking about single node databases, distributed stuff is a lot more complex, so Iâ€™ll not touch it here.
This is when talking about durability, but there is also the notion of atomicity, in other words, a transaction that is composed of multiple operations should either be there complete (and certain if confirmed to the client) or not be there at all (rollback), there should never be a situation where some of the records went in, and some didnâ€™t.
Finally, there is the paranoid mode for durability, in which you donâ€™t trust the hardware, so you write to multiple locations, hash it and validate. Personally, at that level, I think that this is the purpose of the file system to verify that, and this is where the responsibility of the database ends, but if you are stuck with a poor file system choice (like FAT a decade or two ago), that was certainly something that youâ€™ll have to consider.
At any rate, one of the major problems with gaining durability is that you have to pay so much for it. In order to actually be durable, you have to write your changes to the actual disk platter, and that is something that might actually require physical parts to move, so that is something that is very costly. How costly? A high end (15,000 RPM) hard disk can do a theoretical maximum of 250 such writes per second, and that is an extremely theoretical number. In most cases, even on high end hard disks, youâ€™ll see a maximum of a 100 â€“ 150 per second. You can probably double or triple that for high end SSD drive, but those are still very poor numbers, compared to the number of operations you can do in memory and in the CPU in that time frame.
That puts a pretty hard limit on the number of times you can hit the disk, but that is not something that we can tolerate, so the process of making a write in most operation systems, looks like this:
Notice the number of buffers in the middle? Each of them is going to gather the I/O as it can before issuing a (large) write to the layer below them. With the idea that this way, we can amortize the cost of going to the physical disk among many different writes. It works, it works quite well, in fact, to the point where most of the time, you donâ€™t really think how slow the hardware is really is.
But for databases, this is a horrendous problem. To start with, if Iâ€™m issuing a write to the disk, I have no idea when this actually hit the platter. For fun, all through this chain (which is actually simplified), each component is free to re-order the writes in any way it feels like it, so we canâ€™t even rely on the order of the writes to ensure consistency. No, in order to protect itself from data loss / corruption in the presence of power loss, we have to tell the operating system to actually skip all those layers and flush directly to disk.
As I said, this is expensive to do. In particular, the normal way to do to make your writes, and then to to call fsync(fd) in order to flush those changes down the chain, all the way to the disk. This has a few issues, however. In particular, note that in the gap between the file system and the disk driver, weâ€™ll lose the correlation between writes made to a particular file and any other writes made to that device. That end result is that the fsync command forces us to flush the entire disk driver buffer (and the disk buffers, etc). In production systems, that can be hundreds of MB that were innocently sitting there, slowly being written to disk, and suddenly you have this disruptive fsync that need to be written, so everything is flushed to disk, and the fsync (which you expect to be short, because you wrote on only a few dozen KB) now takes a minute, because it is actually flushing 100 MB writes from totally different process.
This post is getting long enough, so Iâ€™ll defer the actual discussion on how databases actually achieve durability to the next one, in the meantime, just consider the complexities involved in making sure that the data is on the disk, and how much of the design of modern databases is spent in optimizing just this part.