Hereâ€™s your chance to dive deeper into the new analytics capabilities in SQL Server 2016: The Microsoft Data Science Summit.
On September 26 – 27, the Microsoft Data Science Summit will host an interactive event on the disruptive technologies and innovative solutions in big data, machine learning and the Internet of Things.
Itâ€™s a two-day intensive event featuring:
- SQL Server sessions including U-SQL as the Choice for Processing Massive Batch Workloads and Scalable Data Science in Azure Data Lake with U-SQL: An End-to-End Walkthrough
- Hands-on labs with Cortana Intelligence Suite, Microsoft R Server, SQL Server 2016, and open-source technologies
Real code, real products, and real-world examples
- Keynotes and talks from industry visionaries
- Incredible demos
- Relaxed opportunities to connect with peers and Microsoft product experts
So if youâ€™re a data scientist, big data engineer, machine learning specialist, or team leader who wants insights into the latest big data, machine learning, and advanced analytics technologies, join us. Weâ€™ll have three in-depth tracks you can choose from to get the knowledge and hands-on experience you want for your business: Advanced Analytics, Big Data, and Solutions. Take a look at all three tracks to find the best match for your needs.
Register soon. September 26 – 27 is approaching fast. We look forward to talking data with you in Atlanta!
The Uber Engineering group have posted a really great blog post about their move from Postgres to MySQL. I mean that quite literally, it is a pleasure to read, especially since they went into such details as the on-disk format and the implications of that on their performance.
For fun, there is another great post from Uber, about moving from MySQL to Postgres, which also has interesting content.
Go ahead and read both, and weâ€™ll talk when you are done. I want to compare their discussion to what we have been doing.
In general, Uberâ€™s issue fall into several broad categories:
- Secondary indexes cost on write
- Replication format
- The page cache vs. buffer pool
- Connection handling
Postgres maintain a secondary index that points directly to the data on disk, while MySQL has a secondary index that has another level of indirection. The images show the difference quite clearly:Postgres MySQL
I have to admit that this is the first time that I ever considered the fact that the indirectionâ€™s manner might have any advantage. In most scenarios, it will turn any scan on a secondary index into an O(N * logN) cost, and that can really hurt performance. With Voron, we have actually moved in 4.0 from keeping the primary key in the secondary index to keeping the on disk position, because the performance benefit was so high.
That said, a lot of the pain the Uber is feeling has to do with the way Postgres has implemented MVCC. Because they write new records all the time, they need to update all indexes, all the time, and after a while, they will need to do more work to remove the old version(s) of the record. In contrast, with Voron we donâ€™t need to move the record (unless its size changed), and all other indexes can remain unchanged. We do that by having a copy on write and a page translation table, so while we have multiple copies of the same record, they are all in the same â€śplaceâ€ť, logically, it is just the point of view that changes.
From my perspective, that was the simplest thing to implement, and we get to reap the benefit on multiple fronts because of this.
Postgres send the WAL over the wire (simplified, but easier to explain) while MySQL send commands. When we had to choose how to implement over the wire replication with Voron, we also sent the WAL. It is simple to understand, extremely robust and we already had to write the code to do that. Doing replication using it also allows us to exercise this code routinely, instead of it only running during rare crash recovery.
However, sending the WAL has issues, because it modify the data on disk directly, and issue there can cause severe problems (data corruption, including taking down the whole database). It is also extremely sensitive to versioning issues, and it would be hard if not impossible to make sure that we can support multiple versions replicating to one another. It also means that any change to the on disk format needs to be considered with distributed versioning in mind.
But what killed it for us was the fact that it is almost impossible to handle the scenario of replacing the master server automatically. In order to handle that, you need to be able to deterministically let the old server know that it is demoted and should accept no writes, and the new server that it can now accept writes and send its WAL onward. But if there is a period of time in which both can accept write, then you canâ€™t really merge the WAL, and trying to is going to be really hard. You can try using distributed consensus to run the WAL, but that is really expensive (about 400 writes / second in our benchmark, which is fine, but not great, and impose a high latency requirement).
So it is better to have a replication format that is more resilient to concurrent divergent work.
OS Page Cache vs Buffer Pool
From the post:
Postgres allows the kernel to automatically cache recently accessed disk data via the page cache. â€¦ The problem with this design is that accessing data via the page cache is actually somewhat expensive compared to accessing RSS memory. To look up data from disk, the Postgres process issues lseek(2) and read(2) system calls to locate the data. Each of these system calls incurs a context switch, which is more expensive than accessing data from main memory. â€¦ By comparison, the InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool. This is logically similar to the Linux page cache but implemented in userspace. While significantly more complicated than Postgresâ€™s designâ€¦
So Postgres is relying on the OS Page Cache, while InnoDB implements its own. But the problem isnâ€™t with relying on the OS Page Cache, the problem is how you rely on it. And the way Postgres is doing that is by issuing (quite a lot, it seems) system calls to read the memory. And yes, that would be expensive.
On the other hand, InnoDB needs to do the same work as the OS, with less information, and quite a bit of complex code, but it means that it doesnâ€™t need to do so many system calls, and can be faster.
Voron, on the gripping hand, relies on the OS Page Cache to do the heavy lifting, but generally issues very few system calls. That is because Voron memory map the data, so access it is usually a matter of just pointer dereference, the OS Page Cache make sure that the relevant data is in memory and everyone is happy. In fact, because we memory map the data, we donâ€™t have to manage buffers for the system calls, or to do data copies, we can just serve the data directly. This ends up being the cheapest option by far.
Spawning a process per connection is something that I havenâ€™t really seen since the CGI days. It seems pretty harsh to me, but it is probably nice to be able to kill a connection with a kill â€“9, I guess. Thread per connection is also something that you donâ€™t generally see. The common situation today, and what we do with RavenDB, is to have a pool of threads that all manage multiple connections at the same time, often interleaving execution of different connections using async/await on the same thread for better performance.
In the process of working on RavenDB 4.0, we are going over our code and looking for flaws. Both in the actual implementation and in the design of the API. The idea is to clear away the things that we know are bad in practice. And that leads us to todayâ€™s topic. Subscriptions.
Subscriptions are a way to ask RavenDB to give us, in a reliable way, all documents that match a certain criteria. Here is what the code looks like in RavenDB 3.5:
You typically write this in some form of background processing / job style. The key aspect here is that RavenDB is responsible for sending the data to the subscription, and making sure that you will not lose any updates. So the code is resilient for client errors, for connection errors, database restarts, etc.
However, the topic today isnâ€™t actually subscriptions, it is their API. In practice, we noticed several deficiencies in the API above. To start with, the subscription actually started in an async manner, so opening the subscription and subscribing to it might actually be racy (it is possible to subscribe and start getting documents from the server before you attach your delegate to handle those documents). We had to write a bit of code to handle that scenario, and it was complex in the presence of adding / removing subscriptions when the subscription was live.
The other problem was that subscriptions are reliable. This means that if there is an error, the subscription will handle it. A disconnect from the server will automatically reconnect, and as far as the caller is concerned, nothing has really changed. It isnâ€™t aware of any network errors or recovery that the subscription is doing.
Which is all well and good, until you realize that your connection string is wrong, and the subscription is going to forever retry to connect to the wrong location. We have a way to report errors to the caller code, but in most cases, the caller doesnâ€™t care, the subscription is going to handle it anyway. So we have a problem. How do we balance both the need to handle errors and recover internally and let the caller know about our state?
We currently have the OnError() method weâ€™ll call on the caller to let it know about any errors that we have, but that is not really helping it. The caller doesnâ€™t have a good way to know if we can recover or not, and asking the caller to implement an error recovery policy is something that we donâ€™t really want to do. This is complex and hard to do, and not something that the client should do.
What we ended up with is the following. First, we changed the API so you canâ€™t just add / remove observers to a subscription. You create a new subscription, you attach all the observers that you want, and then you explicitly demarcate the stage in which the subscription starts. Having such an explicit stage frees us from the concurrency and race conditions that we previously had to handle. But it also gives us a chance to actually report something explicit to the calling code.
Letâ€™s look at code, then discuss it:
The idea is that if you wait on the task returned from the StartAsync, which is very natural to do, you can tell whether the first time connection was successful or not. Then you can make determination on what to do next. If there is an error, you can dispose the subscription and report it upward, or you can let it recover automatically. This still doesnâ€™t cover some scenarios, unfortunately. We also report all errors to the subscribers, which can make their own determination there.
The problem is that there are some errors that we just canâ€™t recover from. If a userâ€™s password has changed, eventually weâ€™ll recycle the connection, and get an unauthorized error. There is nothing that the subscription can do to fix that, but at the same time, there is a very little chance for it to know that (a bad password is easy, a firewall rule blocking access is very hard to distinguish from the db being down, and in either case there is nothing that the subscription can do).
We are thinking about adding some sort of limit, something like if we retried for certain number of times, or for a certain duration, and couldnâ€™t recover, weâ€™ll report it through an event / method that must be registered (otherwise weâ€™ll just throw it upward and maybe crash the process). The idea is that in this case, we need someone to pay attention, and an unhandled exception (if you didnâ€™t register to catch it) would be appropriate.
I would like to get some feedback on the idea, before going ahead and implementing it.