It is not easy to monitor how our code behaves on a vast array of different machines. A myriad of different configurations can lead to errors that are difficult to reproduce and even more difficult to anticipate. And when the customer calls complaining about a crash, provided information on what lead to the problem is often incomplete or misleading. Fortunately, remote telemetry of software applications is here to help and is going mainstream even in the desktop area. Let’s see how easy it is to monitor a desktop Windows application using the new Azure Application Insights service: this article on the Azure site explains all the necessary steps. Summing up, here is what we need to do:
- create an Application Insights resource in the Azure portal
- make a copy of the Instrumentation key, as we will need it later in our app
- add one of the following NuGet packages: Microsoft.ApplicationInsights.WindowsServer for the full set of functionalities, including performance counter collection and dependency monitoring, or Microsoft.ApplicationInsights that includes the core API only
- initialize the TelemetryClient object in your app
- set the intrumentation key in the code of the app:
TelemetryConfiguration.Active.InstrumentationKey = " your key ";
- insert telemetry calls, like TrackPageView, TrackException etc.
For additional reference on these steps, check out the official ApplicationInsights repository on GitHub.
However, before peppering our code with calls to the TelemetryClient like the following ones
- TrackPageView(pageName) on switching forms, pages, or tabs
- TrackEvent(eventName) for other user actions
- TrackMetric(name, value) to send regular reports of metrics not attached to specific events
- TrackTrace(logEvent) for diagnostic logging
- TrackException(exception) in catch clauses
let’s try to adapt the logging interface we already have in place to also support telemetry data (TrackTrace and TrackException will likely match the Info and Error log levels) and create a new implementation of the logging interface that forwards data to both local log and remote telemetry at the same time with minimal effort. Once the benefits of telemetry are tangible, adding monitoring of users’ actions with TrackPage and TrackEvent will be easy to justify.
Once the telemetry integration is done and the software is deployed, we log into the Azure dashboard and open Application Insights for the test app:
Being a demo app, there is not a lot of usage data available, but still the graph on top gives us information on how many users are interacting with our app and how much. What should catch our attention is the number of crashes in the last 24 hours:
Luckily crashes are trending down, but still 587 crashes are still way too many, so let’s click on Crashes box to gain a greater insight on what is happening:
The graph on top shows the distribution of crashes over the last 24 hours, and how many users were impacted by them, but we are eager to get to the culprit, and the table below is very clear:
A specific System.ArgumentException is the major cause of concern here with 96,9% of crashes, so let’s click on it to expand the crash information:
The first half of the screen shows generic information about the environment hosting our app (OS version, device name) and the troubling version(s) of the app. Let’s scroll down in this panel to get to the log of exceptions:
Here they are, logged and sorted by time of occurence! So let’s click on the first one to see what happened:
Now we know not only the exception thrown (System.ArgumentException) but also the method throwing the exception (GetThumbnailFileName). However, we’re not done yet, as scrolling down reveals the Call Stack:
Just remember to turn off “Show Just My Code”, so that calls to the .NET CLR are visible too, and here we are: GetThumbnailFileName is calling System.IO.Path.GetDirectoryName, which in turn calls System.IO.Path.NormalizePath. With all these hints, debugging the code is very easy, the call to GetDirectoryName was using an invalid path name and causing the exception. Problem solved!
And there is more: as the code is logging Trace information, we can also see the full log surrounding the exception, understanding all the steps that lead to the crash such as user’s actions: