Skip to content

Feed aggregator

Data Science Perspectives: Q&A with Microsoft Data Scientists Val Fontama and Wee Hyong Tok

You can’t read the tech press without seeing news of exciting advancements or opportunities in data science and advanced analytics. We sat down with two of our own Microsoft Data Scientists to learn more about their role in the field, some of the real-world successes they’ve seen, and get their perspective on today’s opportunities in these evolving areas of data analytics.

If you want to learn more about predictive analytics in the cloud or hear more from Val and Wee Hyong, check out their new book, Predictive Analytics with Microsoft Azure Machine Learning: Build and Deploy Actionable Solutions in Minutes.

First, tell us about your roles at Microsoft?

 [Val] Principal Data Scientist in the Data and Decision Sciences Group (DDSG) at Microsoft

 [Wee Hyong] Senior Program Manager, Azure Data Factory team at Microsoft

 And how did you get here? What’s your background in data science?

[Val] I started in data science over 20 years ago when I did a PhD in Artificial Intelligence. I used Artificial Neural Networks to solve challenging engineering problems, such as the measurement of fluid velocities and heat transfer. After my PhD, I applied data mining in the environmental science and credit industry: I did a year’s post-doctoral fellowship before joining Equifax as a New Technology Consultant in their London office. There, I pioneered the application of data mining to risk assessment and marketing in the consumer credit industry. I hand coded over ten machine learning algorithms, including neural networks, genetic algorithms, and Bayesian belief networks in C++ and applied them to fraud detection, predicting risk of default, and customer segmentation.    

[Wee Hyong] I’ve worked on database systems for over 10 years, from academia to industry.  I joined Microsoft after I completed my PhD in Data Streaming Systems. When I started, I worked on shaping the SSIS server from concept to release in SQL Server 2012. I have been super passionate about data science before joining Microsoft. Prior to joining Microsoft, I wrote code on integrating association rule mining into a relational database management system, which allows users to combine association rule mining queries with SQL queries. I was a SQL Server Most Valuable Professional (MVP), where I was running data mining boot camps for IT professionals in Southeast Asia, and showed how to transform raw data into insights using data mining capabilities in Analysis Services.

What are the common challenges you see with people, companies, or other organizations who are building out their data science skills and practices?

[Val] The first challenge is finding the right talent. Many of the executives we talk to are keen to form their own data science teams but may not know where to start. First, they are not clear what skills to hire – should they hire PhDs in math, statistics, computer science or other? Should the data scientist also have strong programming skills? If so, in what programming languages? What domain knowledge is required? We have learned that data science is a team sport, because it spans so many disciplines including math, statistics, computer science, etc. Hence it is hard to find all the requisite skills in a single person. So you need to hire people with complementary skills across these disciplines to build a complete team.

The next challenge arises once there is a data science team in place – what’s the best way to organize this team? Should the team be centralized or decentralized? Where should it sit relative to the BI team? Should data scientists be part of the BI team or separate? In our experience at Microsoft, we recommend having a hybrid model with a centralized team of data scientists, plus additional data scientists embedded in the business units. Through the embedded data scientists, the team can build good domain knowledge in specific lines of business. In addition, the central team allows them to share knowledge and best practices easily. Our experience also shows that it is better to have the data science team separate from the BI team. The BI team can focus on descriptive and diagnostic analysis, while the data science team focuses on predictive and prescriptive analysis. Together they will span the full continuum of analytics.

The last major challenge I often hear about is the actual practice of deploying models in production. Once a model is built, it takes time and effort to deploy it in production. Today many organizations rewrite the models to run on their production environments. We’ve found success using Azure Machine Learning, as it simplifies this process significantly and allows you to deploy models to run as web services that can be invoked from any device.

[Wee Hyong] I also hear about challenges in identifying tools and resource to help build these data science skills. There are a significant number of online and printed resources that provide a wide spectrum of data science topics – from theoretical foundations for machine learning, to practical applications of machine learning. One of the challenges is trying to navigate amongst the sea of resources, and selecting the right resources that can be used to help them begin.

Another challenge I have seen often is identifying and figuring out the right set of tools that can be used to model the predictive analytics scenario. Once they have figured out the right set of tools to use, it is equally important for people/companies to be able to easily operationalize the predictive analytics solutions that they have built to create new value for their organization.

What is your favorite data science success story?

[Val] My two favorite projects are the predictive analytics projects for ThyssenKrupp and Pier 1 Imports. I’ll speak today about the Pier 1 project. Last spring my team worked with Pier 1 Imports and their partner, MAX451, to improve cross-selling and upselling with predictive analytics. We built models that predict the next logical product category once a customer makes a purchase. Based on Azure Machine Learning, this solution will lead to a much better experience for Pier 1 customers.

[Wee Hyong] One of my favorite data science success story is how OSIsoft collaborated with the Carnegie Mellon University (CMU) Center for Building Performance and Diagnostics to build an end-to-end solution that addresses several predictive analytics scenarios. With predictive analytics, they were able to solve many of their business challenges ranging from predicting energy consumption in different buildings to fault detection. The team was able to effectively operationalize the machine learning models that are built using Azure Machine Learning, which led to better energy utilization in the buildings at CMU.

What advice would you give to developers looking to grow their data science skills?
[Val] I would highly recommend learning multiple subjects: statistics, machine learning, and data visualization. Statistics is a critical skill for data scientists that offers a good grounding in correct data analysis and interpretation. With good statistical skills we learn best practices that help us avoid pitfalls and wrong interpretation of data. This is critical because it is too easy to unwittingly draw the wrong conclusions from data. Statistics provides the tools to avoid this. Machine learning is a critical data science skill that offers great techniques and algorithms for data pre-processing and modeling. And last, data visualization is a very important way to share the results of analysis. A good picture is worth a thousand words – the right chart can help to translate the results of complex modeling into your stakeholder’s language. So it is an important skill for a budding data scientist.

[Wee Hyong] Be obsessed with data, and acquire a good understanding of the problems that can be solved by the different algorithms in the data science toolbox. It is a good exercise to jumpstart by modeling a business problem in your organization where predictive analytics can help to create value. You might not get it right in the first try, but it’s OK. Keep iterating and figuring out how you can improve the quality of the model. Over time, you will see that these early experiences help build up your data science skills.

Besides your own book, what else are you reading to help sharpen your data science skills?

[Val] I am reading the following books:

  • Data Mining and Business Analytics with R by Johannes Ledolter
  • Data Mining: Practical Machine Learning Tools and Techniques, Third Edition (The Morgan Kaufmann Series in Data Management Systems) by Ian H. Witten, Eibe Frank, and Mark A. Hall
  • Predictive Analytics: The Power to Predict Who Will Click, Buy, Lie or Die by Eric Siegel

[Wee Hyong] I am reading the following books:

  • Super Crunchers: Why Thinking-By-Numbers Is the New Way to Be Smart by Ian Ayres
  • Competing on Analytics: The New Science of Winning by Thomas H. Davenport and Jeanne G. Harris.

Any closing thoughts?

[Val]  One of the things we share in the book is that, despite the current hype, data science is not new. In fact, the term data science has been around since 1960. That said, I believe we have many lessons and best practices to learn from other quantitative analytics professions, such as actuarial science. These include the value of peer reviews, the role of domain knowledge, etc. More on this later.

[Wee Hyong] One of the reasons that motivated us to write the book is we wanted to contribute back to the data science community, and have a good, concise data science resource that can help fellow data scientists get started with Azure Machine Learning. We hope you find it helpful. 

Categories: Companies

The sideways stack trace

Stack trace from a series of asynchronous callsA common pattern in asynchronous programming is the callback pattern. Libraries such as async make use of that pattern to build asynchronous versions of common control flow structures. One unfortunate possible consequence of this is strange stack traces. For instance, here is code that executes five functions serially:

var functions = [
function one(done) {done();},
function two(done) {done();},
function three(done) {done();},
function four(done) {done();},
function five(done) {done();}
];

async.series(functions, function final() {
debugger;
});

The screenshot on the right shows the stack trace at the break point. As you can see, this trace is unreasonably deep, with lots of noise from the async library (that the IDE could offer to hide). What should really attract your attention however is the presence of all five functions in the call stack. That’s because each function is responsible for calling the next function in the series. As a result, we get this strange, “sideways” stack trace, where functions that are really executed one after the other linger on until the whole series is done calling. If there are lots of functions in the series, this could really become a problem. This is one of the reasons why Crockford and others are recommending that future versions of JavaScript implement tail calls as jumps.

As a point of comparison, here is similar synchronous code:

functions.forEach(function(f) {
f(function() {
debugger;
});
});

And here is the corresponding call stack, expectedly very short:

A much shorter synchronous stack trace

But wait. Some of you may have noticed something wrong with our “asynchronous” code: it’s not really asynchronous. It does use the callback pattern, but is really 100% synchronous. Could it be that those stack traces are an artifact of the callback pattern more then of asynchronous programming?

Let’s try it again with really asynchronous functions:

async.each(functions, function(f, next) {
process.nextTick(function() {
f(next);
});
}, function final() {
debugger;
done();
});

The stack trace for this code is indeed much more reasonable:Real asynchronous functions have better stack traces

So in the end, the toxic mix is callback-style APIs with synchronous implementations. Without going into premature optimizations, if you notice such call stacks in your applications, and determine they are a problem to its performance, or even to its ability to be easily debugged, I would recommend wrapping synchronous implementations of callback APIs inside process.nextTick calls. This will not only eliminate the stack problem, it will also yield flow control to the framework, making the application more responsive.

Categories: Blogs

Reading Habits

Ayende @ Rahien - 14 hours 5 min ago

I’m reading a lot, and I thought that I would post a bit about my favorite subjects. I decided to summarize this year with great books that don’t really fall into standard categories, which I really enjoyed.

The AlterWorld – By a Russian author, and with a great background there (how to identify a Russian was great), and are really good. The premise is that you can get stuck in a MMORPG and it is beautifully done. Unlike a fantasy book, the notion of levels, gaining strength and power is really nice. Especially since the hero isn’t actually taking the direct path to that. There is also a lot of interaction with the real world, and in general, this is a fully featured universe that is really good. It looks like there are going to be 3 more books, which is absolutely wonderful from my point of view.

AlterWorld The Clan The Duty

Those books were good enough that I started playing RPGs again, just because it was so much fun reading the status messages in the books. If you know of other books in the same space, I would love to know about it.

NPCs tells the tale from the point of view of Non Player Characters, which is quite interesting and done in a very believable way.

NPCs

Caverns & Creatures is a series of books (lost count, there are a lot of short stories as well as full length books) that deals with the idea of people getting stuck in RPG world. This one is mostly meant for humor’s sake, I think. And it does get to toilet level humor all too frequently, but it is entertaining.

Critical Failures

Waldo Rabbit tells the tale of a guy that really tries to be an evil overload, but his idea of scary beast is a… rabbit. It is a really well written, and I’m looking forward for the next book.

The (sort of) Dark Mage (Wa... After The Rabbit (Waldo Rab...
Wizard 2.0 talks about finding proof that the entire world is a computer simulation, and what happens when certain people find out about it. My guess is that this is written by a programmer, because the parts where they talk about software and programming wasn’t made up in whole cloth and didn’t piss me off at all. This is also really good series, and I’m looking forward to reading the 3rd book.  I especially liked that there isn’t some big Save The World theme going on, this is just life as you know it, if you are a bunch of pixels.

Off to Be the Wizard (Magic... Spell or High Water (Magic ... An Unwelcome Quest (Magic 2...

Velveteen is a “superhero” novel, but a very different one than the usual one. I’m not really sure how to categorize it, but it was a really great read.

Velveteen vs. The Junior Su... Velveteen vs. The Multivers...

Daniel Black’s is a single book series, with a second book, Black Coven set to follow Fimbulwinter. It is a great book, with a very well written background and story. What is more, the hero doesn’t rely on brute force or the author to rescue him when he stupidly gets into trouble, he thinks and plans, and that is quite great to read. I’m eagerly waiting for the next book.

Fimbulwinter (Daniel Black #1)

Categories: Blogs

Using XAML Label in WPF

C-Sharpcorner - Latest Articles - 17 hours 5 min ago
This article demonstrates how to create and use a Label control in WPF using XAML and C#.
Categories: Communities

How to Fetch Google Analytics Statistics and Display it in Your C# Application

C-Sharpcorner - Latest Articles - 17 hours 5 min ago
In this article you will learn how to query Google Analytics data and display it in your C# application.
Categories: Communities

Azure Mobile Services: Creating Develoment and Production Environments

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
The purpose of this article is to show how to create development and production environments for a .Net Backend from Azure Mobile Services.
Categories: Communities

Resource Request: Content Type Example

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
In this article I would like to share a real-life scenario of using content types.
Categories: Communities

C# Corner Delhi Student's Day 20 December, 2014: Official Recap

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
The C# Corner Delhi Chapter organized its monthly event in the Noida office on 20 December, 2014.
Categories: Communities

ASP.Net Web API Self Hosting

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
In this article, we will discuss the concept of self-hosting the Web API.
Categories: Communities

Microsoft Release Management: Web.Config Tokenization For Website Project

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
Here I have tried to explain one of my implementations of web.config tokenization applicable to website type projects in Visual Studio.
Categories: Communities

Design a Master Page With Header, Footer and Body in MVC Application

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
In this article you will learn how to design a Master Page with a header, footer and body in a MVC Application.
Categories: Communities

Current Location Tracking and Reverse Geocoding in Windows Store Apps

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
This article covers methods to display the user's location using GPS, the usage of Pushpins and Reverse Geocoding.
Categories: Communities

Unleashing Visual States in Expression Blend For VS 2013 (WPF, WP, Windows Store )

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
In this article you will learn how to unleash Visual States in Expression Blend for VS 2013 (WPF, Windows Phone and Windows Store).
Categories: Communities

How to Sync Two SQL Azure Databases

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
This explains Azure SQL Data Sync.
Categories: Communities

Grouped LongListSelector in Windows Phone 8 Silverlight

C-Sharpcorner - Latest Articles - Sun, 12/21/2014 - 08:00
This article explains how to create a LongListSelector with grouping in a very simple and easy manner.
Categories: Communities

Area and SplineArea Chart in ASP.Net

C-Sharpcorner - Latest Articles - Sat, 12/20/2014 - 08:00
In this article we will learn about the Area and SplineArea Chart of ASP.Net.
Categories: Communities

Login Form With SQL in C#

C-Sharpcorner - Latest Articles - Sat, 12/20/2014 - 08:00
In this article I will tell you about database connections in Visual Studio 2012.
Categories: Communities

Microsoft Virtual Academy Links for 2014

Decaying Code - Maxime Rouiller - Sat, 12/20/2014 - 02:13

So I thought that going through a few Microsoft Virtual Academy links could help some of you.

Here are the links I think deserve at least a click. If you find them interesting, let me know!

Categories: Blogs

Temporarily ignore SSL certificate problem in Git under Windows

Decaying Code - Maxime Rouiller - Sat, 12/20/2014 - 02:13

So I've encountered the following issue:

fatal: unable to access 'https://myurl/myproject.git/': SSL certificate problem: unable to get local issuer certificate

Basically, we're working on a local Git Stash project and the certificates changed. While they were working to fix the issues, we had to keep working.

So I know that the server is not compromised (I talked to IT). How do I say "ignore it please"? Temporary solution

This is because you know they are going to fix it.

PowerShell code:

$env:GIT_SSL_NO_VERIFY = "true"

CMD code:

SET GIT_SSL_NO_VERIFY=true

This will get you up and running as long as you don’t close the command window. This variable will be reset to nothing as soon as you close it. Permanent solution

Fix your certificates. Oh… you mean it’s self signed and you will forever use that one? Install it on all machines.

Seriously. I won’t show you how to permanently ignore certificates. Fix your certificate situation because trusting ALL certificates without caring if they are valid or not is juts plain dangerous.

Fix it.

NOW.

Categories: Blogs

The Yoda Condition

Decaying Code - Maxime Rouiller - Sat, 12/20/2014 - 02:13

So this will be a short post. I would like to introduce a word in my vocabulary and yours too if it didn't already exist.

First I would like to credit Nathan Smith for teaching me that word this morning. First, the tweet:

Chuckling at "disallowYodaConditions" in JSCS… https://t.co/unhgFdMCrh — Awesome way of describing it. pic.twitter.com/KDPxpdB3UE

— Nathan Smith (@nathansmith) November 12, 2014

So... this made me chuckle.

What is the Yoda Condition?

The Yoda Condition can be summarized into "inverting the parameters compared in a conditional".

Let's say I have this code:

string sky = "blue";if(sky == "blue) {    // do something}

It can be read easily as "If the sky is blue". Now let's put some Yoda into it!

Our code becomes :

string sky = "blue";	if("blue" == sky){    // do something}

Now our code read as "If blue is the sky". And that's why we call it Yoda condition.

Why would I do that?

First, if you're missing an "=" in your code, it will fail at compile time since you can't assign a variable to a literal string. It can also avoid certain null reference error.

What's the cost of doing this then?

Beside getting on the nerves of all the programmers in your team? You reduce the readability of your code by a huge factor.

Each developer on your team will hit a snag on every if since they will have to learn how to speak "Yoda" with your code.

So what should I do?

Avoid it. At all cost. Readability is the most important thing in your code. To be honest, you're not going to be the only guy/girl maintaining that app for years to come. Make it easy for the maintainer and remove that Yoda talk.

The problem this kind of code solve isn't worth the readability you are losing.

Categories: Blogs