March 11, 2022

How to Achieve System Observability with BlazeMeter: The Guide

Performance Testing

With the ever-increasing use of microservices and the rate of change of a platform, it is getting more challenging to monitor and be able to quickly identify the reason for performance degradation in applications. As a result there has been an increased interest in the topic of system observability lately.  Much has been written and discussed, but what does it all really mean? In this post, we will cover the topic of observability from the perspective of BlazeMeter Performance Testing.  Further, we will speak to the common sense approach of where to look for the most meaningful metrics that will enrich your understanding of the health and responsiveness of your application.  

Table of Contents:

What is System Observability? Is it the Same as Monitoring?

System observability means the ability to understand a certain situation and explain it. In order to be able to ‘observe' something, we need to have enough information to be able to discern what is going on. This information is collected from monitoring. In other words, monitoring is a first step towards observability.

Let’s think of this through an example.

A magician stands in front of you. She shows you that there is nothing up either of her sleeves.  Then she takes off her top hat. The hat is tall black and deep. She then presents it to you for your inspection. The magician takes a large red handkerchief and waves it over the overturned hat and says a few choice words. Suddenly, a rabbit pops up out of the hat.

You certainly were present in the situation, but did you really monitor it so you could gain observability and understand what happened?

The key here is that if you had really monitored the entire trick, you might have been able to notice that the hat had a false compartment at the top, and the rabbit was hanging out there until it was pulled out for everyone to view. If you had all that information and the tools to interpret it, you would have been able to observe the trick.

Without this, you were simply left wondering how the magician fooled you!

System Performance Observability

System observability is achieved when we can explain why the performance of a system is what it is. To achieve observability, we need metrics to analyze. Being able to generate predictable and repeatable traffic against systems is a great way to collect these metrics. These will help us observe the performance capabilities of an application under test so we can determine whether performance is acceptable or not.

Getting Started with Performance Observability with BlazeMeter

BlazeMeter is a great tool to help with the generation of performance loads against a system under test, so we can collect metrics that will help us understand and explain performance.  What we really want to find out is this:  “Does the application respond within the acceptance criteria - both for functional and performance tests?”.

By defining functional tests, you are defining how the application should respond. Taking this to the next level with performance tests, you are gaining an understanding if the application will continue to function as expected while it is under a load of your choosing.

All BlazeMeter components generate synthetic transactions against defined endpoints and record the results. These testing and reporting capabilities, including the ability to compare different runs of load tests against each other, is where the rubber meets the road from an observability perspective.

The primary metrics that are highlighted on the Summary page of a given report are:

  • Test Start / End / Duration
  • Maximum Concurrent Users
  • Hits per second (Average Throughput)
  • Error %
  • Response Times:
    • Overall Average
    • 90th Percentile
  • Average Network Bandwidth (Kilobytes per second)

These metrics are helpful at providing a birds eye view of system performance.

As mentioned, metrics are the first step towards achieving observability. To gain a more detailed understanding about system performance, the Graphical Timeline report provides much more information.

This report enables seeing how the metrics from specific calls perform over time. Keep in mind that you can observe response time averages, minimum, maximum and percentile statistics here as well.

If you want to do some deeper analysis, the tabular form of this data can be downloaded from the Request Stats page of the report.  In the screenshot below - notice how the average response time is typically much lower than the 90% or 95% so. This would indicate there is a daily widespread responsiveness.

So often we just look at just hits per second and average response times, but fail to visualize that there are differences in how the same transaction performs for all users.  Be sure to realize the observability you have available to you with BlazeMeter’s reporting capabilities.

Have We Achieved System Performance Observability?

Performance testing metrics provide us with visibility into the system. However, if the test is not meeting the goals that have been established, this is an opportunity to dig deeper into the data to reach a more comprehensive understanding of the different components’ behavior.

We will divide this quest into three:

  1. Ensuring BlazeMeter isn’t the bottleneck
  2. Speed bump analysis
  3. Third party services

1. Ensure BlazeMeter Is Not The Bottleneck

The first step to understanding a system is to ensure you are really inspecting your system, and not an external application. If your load engines are overwhelmed, they may not be able to efficiently drive the test. So make sure your BlazeMeter engines are not impacting your test results.

Ensure your load engines are running no more than 75% usage of either CPU or Memory consumption. If they are, understand what is driving this. You may have to lower the targeted number of users per engine, or you may have to alter the number of labels that are being managed by a given test.  Here are links to good information about calibration:

2. Speed Bump Analysis

If your test results show issues you need to further understand, we recommend spending the time doing so. You really want to understand what, if anything, is holding up the processing of requests in this precious test environment. It is better to learn the issues here rather than in production when real users and real business transactions are stalled or lost.

Review the metrics as reported by BlazeMeter in the Timeline and and Request Statistics pages of the report. Which are the biggest drivers? As mentioned above, don’t just look at averages, look at 90th percentile, and maximum values. 

Understanding Errors to Achieve System Observability

Error Reporting in BlazeMeter is by design at a summary level. It provides overall counts and types of response codes, but it does not provide details as to the details of the request and responses. In order to understand the errors more clearly - you should understand how to dig deeper to get these details.

For example - here we have a run with nearly 20% errors:

Example of run errors

Going to the Errors page of the report, you have options to group errors by Label, Response Code, or Assertion Name.

Grouping errors

You can look at the errors from any of these perspectives, but when you do, you can only go so deep. So for an example - you can drill down and see errors here that had 304 and 403 response codes:

Specify error codes

You still cannot tell what exactly was called. To see actual calls and responses, go to your logs, and inside of the artifacts.zip file you should find the error.jtl file:

Error logs

Download the artifacts.zip file, unzip it and make sure you have the error.jtl file:

Download the error file

Use JMeter to read the error.jtl file.  Open up JMeter and then create a View Results Tree Listener:

Error logs

To read in the error.jtl file, look at the section ‘write results to file / Read from file’ - provide path and filename for the file, or click on Browse to search for it.

Read the error report

Now you can see that the file has been read by JMeter and the red labels show that these were calls that resulted with an error.

 

Confirm the request that was made:

Confirm the request

And now you can see exactly what was provided in the response:

View Results Tree Apache

This should give you the information for a given request and why it failed.

3. Third Party Services

When you are dealing with a third party service, you typically are calling an endpoint over which you have little control. Just put yourself in a good position to work with your service provider by being able to nail down frequency and performance of calls. Provide the right information so that they can be in a better position to support your needs. This is important so whoever is supporting them has a chance of figuring out what, if anything is wrong. Don't just call us up and say 'it does not work'. Explain what was attempted and what the response was.

Be ready with the following information:

  1. Time of the test - This would include both beginning and ending and the timezone. You just want to make sure you are giving them enough information so they are looking at exactly the same timeframe you are interested in.
    • If there are specific windows of time within the test when the performance deviated from an acceptable range - highlight these times as well.
  2. Number of calls during the test as well as the characteristics of the calls. Are they a mix of typical calls, or are they hardcoded to a single or just a few options?
  3. The response time metrics that are generated by BlazeMeter. Both for total test as well as identified trouble spots. Be able to share the following information:
    • Samples
    • Average Response Time
    • Average Hits/sec
    • 90% percentile average response time
    • Min and max
    • Average bandwidth
    • Error Percentage
  4. Are all requests having issues, or does it look like the issues are coming on a certain cycle or is it random?

Knowing this information to this level should shed a strong light on where the issue may be coming from. In addition, this information will also provide insights into the systems that you have more access to.

Mock Services from BlazeMeter also provides a very elegant solution to test for unavailable services. You can virtualize parts of the system that are not under test, or not available (eg, still in development), and get discrete insight into the quality and performance of what you’re testing. Mock Services realistically simulate the real-world behavior of a service. You can test your app under both good and difficult conditions with both happy paths and negative responses (slow response times, incomplete content, unexpected errors, or even chaotic behavior). For more information, see the Mocks tab under the product section of BlazeMeter.com.

Achieving Deeper System Observability

You can only go as deep into analysis as far as you have metrics to support the research. Therefore, to be able to dig deeper and understand why something may be taking longer than anticipated, we need more advanced metrics. These can be achieved through APM metrics, and additional metric types. These can all be exported into BlazeMeter reports.

APM Metrics

You can install an APM tool within your environment to monitor the detailed health of:

  • Web Servers
  • Application Servers
  • Database Servers
  • And further still, install a network monitoring tool to monitor the traffic between components.

Typically using these types of tools requires a significant investment (both financially and time) to realize benefits. The investment is ongoing because one has to stay on top of it.  The application will continue to grow and change over time (and acquire even more data that needs to be analyzed).

Key APM Metrics

When dealing with APM data, the biggest drivers of poor performance typically come from the following:

  • Servers
    • CPU usage over 80%
      • If the server or process is really busy, it is a challenge to take on the next request.
    • Memory paging 
      • If memory is overused and chunks of memory have to be committed to disk, this is a real speedbump.
    • Excessive Disk Usage
  • Java or .NET Processes
    • Garbage Collection
      • Excessive Major Garbage Collection is a stop the world event
    • Excessive CPU of the process itself
  • Databases
    • Locking / Blocked threads
    • Excessive Deadlocking
    • Table Scans (lack of effective indexes)

Current APM Integrations

To see which current APM integrations are supported, go to guide.blazemeter.com. Using the navigation pane, search for Integrations & Plugins > APM Integrations:

Confirm the request

Alternatives to System Observability

If you are using an APM that does not have an out of the box integration with Blazemeter, then you can use a BlazeMeter API to import time series data into a BlazeMeter report.

If you do not have an APM, you can leverage data from other sources, such as Splunk, or even native operating system commands to monitor your application environment, you can leverage the same API described below to import data into a BlazeMeter report.

Importing External Metrics into BlazeMeter Reports for Deeper Analysis

Let’s say that you are trying to analyze data from multiple sources - and you do not have one of the configurable APMs that BlazeMeter supports. Then, you may want to import metrics from some other sources into your BlazeMeter report so you can document and analyze data in one place. You can conceivably import any number of metrics via an API that is described here.

Likely candidates for this type of activity is to obtain information from logs, or from native system monitoring tools.

Log Data

Splunk or DataDog are popular tools to monitor system and application logs for events as well as uptime. By looking at logs we can tell:

  • Startup and Shutdown
  • Event Information
    • Major garbage collection is an example of an event that can affect application performance.
    • Periodic batch jobs can affect application performance. Jobs like ‘accounting end of day’ or ‘purge abandoned shopping carts’ are examples of jobs that may enhance the probability of locking or contention for resources.

Native System Monitoring

Operating system performance metrics can be obtained by leveraging commands below. Examples of typical commands for this purpose:

  • Windows system performance
    • perfmon
  • Unix performance monitor 
    • nmon
    • vmstat
    • iostat

Conclusion and Next Steps

When it comes to the topic of ‘observability’, be mindful of where you get your data and how to interpret it.  You can only go as deep as you have metrics to support an analysis. Be aware that you have options to dig deeper with APM and other data sources.  When you are dealing with a system that you have no visibility into, make sure you provide the right information to your service provider to make it clear what you are expecting and what you are getting in return.

START TESTING NOW