Yesterday we announced exciting news for Power BI – a cloud-based business analytics service (software-as-a-service) for non-technical business users. The preview introduces a number of new Power BI capabilities including dashboards, new visualizations, support for popular software-as-a-service applications, a native iPad app and live “hybrid” connectivity to on-premises SQL Server Analysis Services tabular models. With just a browser – any browser – or a Power BI mobile app, customers can keep a pulse on their business via live operational dashboards. They can explore their business data, through interactive visual reports, and enrich it with additional data sources.
How does it work with SQL Server?
To interact with SQL Server data in Power BI, connect to SSAS server via the ‘Get Data’ menu. From there, you can connect to a model and run queries for visualizations based on that model. Before your users can connect to an SSAS model, an administrator must configure a Power BI Analysis Services connector.
NorthWest Cadence on Configuration Management and Desired State Configuration: Bryon has a discussion around configuration drift and PowerShell DSC. For us science geeks, Bryon breaks out the model of the atom as part of his discussion.
Rene van Osnabrugge on Insert an inline image into a Work Item with the TFS API: Rene shows how to add an image directly to a work item using the TFS API.Â This will be helpful if you have to build your own custom migration tools to migrate to TFS, and have to deal with inline images.
Excerpts from the RavenDB Performance team report: Optimizing Compare, Donâ€™t you shake that branch at me!
By now we already squeeze almost all the obvious inefficiencies that we had uncovered through static analysis of the decompiled code, so now we will need another strategy. For that we need to analyze the behavior in runtime in the average case. We did something like that when in this post when we made an example using a 16 bytes compare with equal arrays.
To achieve that analysis live we will need to somehow know the size of the typical memory block while running the test under a line-by-line profiler run. We built a modified version of the code that stored the size of the memory chunk to compare and then we built an histogram with that (thatâ€™s why exact replicability matters). From our workload the histogram showed that there were a couple of clusters for the length of the memory to be compared. The first cluster was near 0 bytes but not exactly 0. The other cluster was centered around 12 bytes, which makes sense as the keys of the documents were around 11 bytes. This gave us a very interesting insight. Armed with that knowledge we made our first statistical based optimization.
You can notice the if statement at the start of the method, which is a pretty standard bounding condition. If the memory blocks are empty, therefore they are equal. In a normal environment such check is so simple that nobody would bother, but in our case when we are measuring the runtime in the nanoseconds, 3 extra instructions and a potential branch-miss do count.
That code looks like this:
That means that not only I am making the check, we are also forcing a short-jump every single time it happens. But our histogram also tells us that memory blocks of 0 size almost never happen. So we are paying with 3 instructions and a branch for something that almost never happen. But we also knew that there was a cluster near the 0 that we could exploit. The problem is that we would be paying 3 cycles (1 nanosecond in our idealized processor) per branch. As our average is 97.5 nanoseconds, we have 1% improvement in almost any call (except the statistically unlikely case) if we are able to get rid of it.
Resistance is futile, that branch needs to go.
In C and Assembler and almost any low level architecture like GPUs, there are 2 common approaches to optimize this kind of scenarios.
- The ostrich method. (Hide your head in the sand and pray it just work).
- Use a lookup table.
The first is simple, if you donâ€™t check and the algorithm can deal with the condition in the body, zero instructions always beats more than zero instruction (this case is a corner case anyways, no damage is dealt). This one is usually used in massive parallel computing where the cost of instructions is negligible while memory access is not. But it has its uses in more traditional superscalar and branch-predicting architectures (you just donâ€™t have so much instructions budget to burn).
The second is more involved. You need to be able to â€śindexâ€ť somehow the input and pay with less instructions than do the actual branches (at a minimum of 1 nanosecond each aka 3 instructions of our idealized processor). Then create a branch table and jump to the appropriate index which itself will jump to the proper code block using just 2 instructions.
Note: Branch tables are very well explained at http://en.wikipedia.org/wiki/Branch_table. If you made it that far you should read it, donâ€™t worry I will wait.
As the key take away if your algorithm have a sequence of 0..n, you are in the best world, you already have your index. Which we did .
I know what you are thinking: Will the C# JIT compiler be smart enough to convert such a pattern into a branch table?
The short answer is yes, if we give it a bit of help. The if-then-elseif pattern wonâ€™t cut it, but what about switch-case?
The compiler will create a switch opcode, in short our branch table, if our values are small and contiguous.
Therefore that is what we did. The impact? Big, but this is just starting. Here is what this looks like in our code:
Iâ€™ll talk about the details of branch tables in C# more in the next post, but I didnâ€™t want to leave you hanging too much.