Tag: Observability

  • How to Deploy Jaeger on AWS EC2: a Step-by-Step Guide

    How to Deploy Jaeger on AWS EC2: a Step-by-Step Guide

    Jaeger is an open-source distributed tracing system that is used to monitor and troubleshoot microservices-based architectures. Deploying Jaeger on AWS can help to improve the visibility and performance of your applications.

    In this article, we will provide a step-by-step guide on how to deploy Jaeger on AWS.

    Step 1: Set up an AWS Account

    The first step in deploying Jaeger on AWS is to set up an AWS account. If you already have an AWS account, you can skip this step. Otherwise, you can sign up for a free AWS account at aws.amazon.com.

    Step 2: Launch an EC2 Instance

    The next step is to launch an EC2 instance on AWS. An EC2 instance is a virtual machine that runs on the AWS cloud. You can use any EC2 instance type, but we recommend using a t2.micro instance for testing purposes.

    To launch an EC2 instance, follow these steps:

    1. Go to the EC2 dashboard in the AWS Management Console.
    2. Click on the “Launch Instance” button.
    3. Choose the Amazon Linux 2 AMI.
    4. Select the t2.micro instance type.
    5. Configure the instance details and storage.
    6. Configure the security group to allow inbound traffic on port 22 for SSH access and port 16686 for Jaeger access.
    7. Launch the instance and create a new key pair.

    Step 3: Install Jaeger

    Once your EC2 instance is up and running, you can install Jaeger on it. Follow these steps:

    1. Connect to your EC2 instance using SSH.
    2. Update the system packages by running the command: sudo yum update -y
    3. Install Jaeger by running the command: sudo yum install jaeger-all -y
    4. Verify that Jaeger is installed by running the command: jaeger-all --version

    Step 4: Configure Jaeger

    After installing Jaeger, you need to configure it to work with your applications. Follow these steps:

    1. Open the Jaeger configuration file by running the command: sudo vi /etc/jaeger/agent.yaml
    2. Edit the configuration file to specify the correct collector endpoint and sampling rate. For example, you can set the following values:
      collector:
        endpoint: "http://your-collector-endpoint:14268/api/traces"
      sampler:
        type: "const"
        param: 1
    

    3.   Save the configuration file and exit.

    Step 5: Start the Jaeger Agent

    After configuring Jaeger, you need to start the Jaeger agent. The Jaeger agent is responsible for receiving trace data from your applications and forwarding it to the Jaeger collector.

    Follow these steps to start the Jaeger agent:

    1. Open a new terminal window and connect to your EC2 instance using SSH.
    2. Start the Jaeger agent by running the command: sudo systemctl start jaeger-agent

    Step 6: Access the Jaeger UI

    Once the Jaeger agent is running, you can access the Jaeger UI to view your trace data. Follow these steps:

    1. Open a web browser and navigate to http://your-ec2-instance-public-ip:16686
    2. The Jaeger UI should load, and you can start exploring your trace data.

    Step 7: Integrate Jaeger with Your Applications

    Finally, you need to integrate Jaeger with your applications to start collecting trace data. To do this, you need to add the Jaeger client libraries to your application code and configure them to send trace data to the Jaeger agent.

    The exact process for integrating Jaeger with your applications will depend on the programming language and framework you are using. However, most Jaeger client libraries have similar APIs and can be integrated with minimal changes to your application code.

    For example, if you are using Node.js, you can install the Jaeger client library using npm:

    npm install --save jaeger-client
    

    Then, you can configure the Jaeger client by adding the following code to your application:

    const initJaegerTracer = require('jaeger-client').initTracer;
    
    const config = {
      serviceName: 'my-service',
      sampler: {
        type: 'const',
        param: 1,
      },
      reporter: {
        agentHost: 'localhost',
        agentPort: 6832,
      },
    };
    
    const options = {};
    
    const tracer = initJaegerTracer(config, options);
    

    This code initializes the Jaeger tracer with a sampler that always samples traces and a reporter that sends trace data to the Jaeger agent running on the local machine.

    Once you have integrated Jaeger with your applications, you can start collecting and analyzing trace data to improve the performance and reliability of your microservices.

    Conclusion

    Deploying Jaeger on AWS can help you gain visibility into your microservices-based architectures and troubleshoot performance issues. In this article, we provided a step-by-step guide on how to deploy Jaeger on AWS and integrate it with your applications.

    By following these steps, you can set up a distributed tracing system that can help you improve the performance and reliability of your applications running on AWS.

  • A Crash Course in OpenTelemetry

    A Crash Course in OpenTelemetry

    In today’s world, monitoring your application is more important than ever before. As applications become more complex, it becomes increasingly challenging to identify bottlenecks, troubleshoot issues, and optimize performance. Fortunately, OpenTelemetry provides a powerful framework for collecting, exporting, and processing telemetry data, making it easier to gain insight into your application’s behavior. In this article, we’ll provide a crash course in OpenTelemetry, explaining what it is, how it works, and how you can use it to monitor your applications.

    What is OpenTelemetry?

    OpenTelemetry is an open-source framework that provides a standard way to collect, export, and process telemetry data for distributed systems. It supports various languages and platforms, making it easy to integrate into your existing applications. The framework consists of three main components: the SDK, the OpenTelemetry Collector, and the exporters.

    The SDK is responsible for instrumenting your application code and collecting telemetry data. It provides libraries for various languages, including Java, Python, Go, and .NET. The SDK also supports various metrics and trace APIs, allowing you to customize the telemetry data you collect.

    The OpenTelemetry Collector is responsible for receiving, processing, and exporting telemetry data. It provides a flexible way to ingest data from various sources, including the SDK, third-party agents, and other collectors. The Collector also provides various processing pipelines for transforming and enriching the telemetry data.

    Finally, the exporters are responsible for sending the telemetry data to various backends, including observability platforms such as Prometheus, Grafana, and Jaeger.

    How does OpenTelemetry work?

    OpenTelemetry works by instrumenting your application code with the SDK, which collects telemetry data and sends it to the OpenTelemetry Collector. The Collector then processes the data and exports it to the backends specified by the exporters. This process allows you to gain insight into your application’s behavior, identify issues, and optimize performance.

    Let’s take a look at an example. Suppose we have a simple Python application that runs on a server and provides a REST API. We want to monitor the application’s performance, including the request latency, error rate, and throughput. We can use OpenTelemetry to collect this data and export it to Prometheus for visualization and analysis.

    First, we need to install the OpenTelemetry SDK for Python:

    pip install opentelemetry-api opentelemetry-sdk opentelemetry-exporter-prometheus
    

    Next, we need to instrument our application code with the SDK. We can do this by adding the following lines of code:

    from opentelemetry import trace
    from opentelemetry.instrumentation.wsgi import OpenTelemetryMiddleware
    from opentelemetry.sdk.trace import TracerProvider
    from opentelemetry.exporter.prometheus import PrometheusMetricsExporter
    
    # Initialize the tracer provider
    trace.set_tracer_provider(TracerProvider())
    
    # Create the Prometheus exporter
    exporter = PrometheusMetricsExporter(endpoint="/metrics")
    
    # Add the Prometheus exporter to the tracer provider
    trace.get_tracer_provider().add_span_processor(
        BatchExportSpanProcessor(exporter)
    )
    
    # Instrument the WSGI application with OpenTelemetryMiddleware
    app = OpenTelemetryMiddleware(app)
    

    This code initializes the tracer provider, creates a Prometheus exporter, adds the exporter to the tracer provider, and instruments the WSGI application with OpenTelemetryMiddleware. Now, every request to our API will be instrumented with OpenTelemetry, and the telemetry data will be exported to Prometheus.

    Finally, we can use Prometheus to visualize and analyze the telemetry data. We can open the Prometheus web UI and navigate to the /metrics endpoint to view the exported data. We can then create graphs, alerts, and dashboards to monitor our application performance and identify issues.

    Why use OpenTelemetry?

    OpenTelemetry provides several benefits for monitoring your applications:

    1. Standardization: OpenTelemetry provides a standard way to collect, export, and process telemetry data, making it easier to integrate with various platforms and tools.
    2. Flexibility: OpenTelemetry supports various languages, platforms, and backends, making it easy to use with your existing infrastructure.
    3. Customization: OpenTelemetry provides various APIs for customizing the telemetry data you collect, allowing you to monitor specific aspects of your application’s behavior.
    4. Open-source: OpenTelemetry is open-source and community-driven, ensuring that it remains relevant and up-to-date with modern monitoring practices.
    5. Interoperability: OpenTelemetry integrates with various observability platforms, making it easy to share telemetry data across your organization.

    Conclusion

    Monitoring your applications is essential for identifying issues, optimizing performance, and ensuring a good user experience. OpenTelemetry provides a powerful framework for collecting, exporting, and processing telemetry data, making it easier to gain insight into your application’s behavior. By using OpenTelemetry, you can standardize your monitoring practices, customize the telemetry data you collect, and integrate with various observability platforms.

  • Semantic Conventions in OpenTelemetry

    Semantic Conventions in OpenTelemetry

    In this article, we’re going to learn about semantic conventions in OpenTelemetry and how they are used to make data processing much easier. We’ll also discuss the different types of semantic conventions. without further ado let’s get started.

    What Are Semantic Conventions?

    Semantic conventions in general are the agreed-upon meaning of words and phrases within a particular language or culture. They help us communicate with each other by providing a shared understanding of our symbols.

    For example, the “thumbs up” gesture is a convention that means “good job” or “I agree” in many cultures.

    Without these conventions, communication would be much more difficult; we would constantly have to explain the meaning of every single word we use.

    What Do Semantic Conventions Mean In OpenTelemetry?

    Semantic conventions are important in OpenTelemetry because they help to define the meaning of resources and metrics. Semantic conventions provide a common language for all users of the system. This allows for a more accurate interpretation of data and helps to ensure that everyone is on the same page when it comes to resource usage and performance.

    In this lesson, We will take a quick look at the different kinds of semantic conventions provided by Opentelemetry. Let’s start with metric semantic conventions,

    What Are Metric Semantic Conventions?

    The OpenTelemetry project has published a set of metric semantic conventions that can be used by any software that collects or displays metrics data.

    The metric semantic conventions define a set of core dimensions that should be used when recording metric data. These dimensions include name, description, unit, and type. In addition, the conventions define a set of recommended labels that can be used to further describe the data. By following these standards, it is possible to create easily understood metrics that can be effectively compared.

    A quick example for naming conventions is limit which means the known total amount of something. For example, system.memory.limit for the total amount of memory on a system.

    utilization – means the fraction of usage out of its limit should be called entity.utilization. For example, system.memory.utilization for the fraction of memory in use. Utilization values are in the range [0, 1].

    For more information about metric semantic conventions, please check the official documentation of opentelemetry from here

    In addition to the metric semantic conventions, the OpenTelemetry team has also published standards for logging and tracing data. By using these standards, it is possible to create software that can generate consistent results regardless of the underlying implementation.

    Let’s take a closer look at semantic conventions for spans and traces.

    semantic conventions for spans and traces

    It’s recommended to use attributes to describe the dimensions of the telemetry data collected. For example, when dealing with network data, attributes might describe the source and destination IP address, the port numbers, etc. Attributes can also be used to describe metadata about the data itself. For example, when dealing with log data, attributes might describe the timestamp, the logging level, etc.

    Attributes are stored in so-called AttributeMaps. An AttributeMap is a map from attribute keys to attribute values. The keys are typically strings, but they can also be other data types. The values can be any data type that can be represented as a JSON value.

    One of the benefits of using attributes is that they provide a way to add additional information to the data without changing the data itself. This is especially useful when dealing with legacy systems that cannot be modified.

    Another benefit of attributes is that they can be used to filter and group data. For example, if one has a log file that contains messages from multiple sources, they can use attributes to filter out messages from certain sources. Or, if they want to group all messages with the same logging level, they can use attributes.

    Here is an example of manually defining attributes for a service

    # Resource attributes that describe a service.
    namespace = Company.Shop
    service.name = shoppingcart

    manually defining attributes for a service

    Events

    Events are one of the core concepts in OpenTelemetry, providing a way to record significant moments or states in the system that can be used for monitoring and analysis. They can be generated manually by operators or automatically by the OpenTelemetry SDK. OpenTelemetry events contain metadata about the event and any relevant data that was collected at the time of the event.

    Events can be used to track the progress of a system through its lifecycle or to identify changes in state that may indicate an issue. They can also be used to record performance data, such as response times or resource utilization. By analyzing events, it is possible to understand how a system is functioning and where potential problems may lie.

    Here is an example event in a span

    {
      "name": "Hello",
      "context": {
        "trace_id": "0x5b8aa5a2d2c872e8321cf37308d69df2",
        "span_id": "0x051581bf3cb55c13"
      },
      "parent_id": null,
      "start_time": "2023-01-19T18:52:58.114201Z",
      "end_time": "2023-01-19T18:52:58.114687Z",
      "attributes": {
        "namespace": "Company.Shop",
        "service.name": "shoppingcart"
      },
      "events": [
        {
          "name": "Guten Tag!",
          "timestamp": "2023-01-19T18:52:58.114561Z",
          "attributes": {
            "event_attributes": 12
          }
        }
      ]
    }

    The event here has a name, timestamp and some attributes.

    Conclusion

    Semantic conventions are important in OpenTelemetry because they provide a common language for all users of the system. The OpenTelemetry project has published a set of metric semantic conventions that can be used by any software that collects or displays metrics data. In addition, the OpenTelemetry team has also published standards for logging and tracing data. By using these standards, it is possible to create software that can generate consistent results regardless of the underlying implementation.

    To learn more Semantic Conventions, Please check official OpenTelemetry documentation:

  • Root Cause Analysis (RCA) Using Distributed tracing

    Root Cause Analysis (RCA) Using Distributed tracing

    Distributed tracing is a method of tracking the propagation of a single request as it’s handled by various services that make up an application. Tracing in that sense is “distributed” because in order to fulfill its function, a single request must often traverse process, machine and network boundaries.

    Once we instrumented our application and exported our telemetry data to an observability backend (like somulogic or new relic), It’s time to use this data to debug our production system efficiently. In this article, we will explore debugging techniques applied to observability data and what separates them from traditional techniques used to debug production applications.

    To learn more about tracing and what it means to instrument an application and export telemetry data, please check this article.

    Before we start with how we can use traces and spans to debug our production applications during incidents, it’s important to take a brief look at how we used to do it using logs and metrics, the old way.

    Old way of debugging an application using logs and metrics

    Prior to distributed tracing, system and application debugging mostly occurred by building upon what you know about a system. This can be observed when looking at the way the most senior members of an engineering team approach troubleshooting. It can seem magical when they know what is the right question to ask and instinctively know the right place to look at. That magic is born from intimate familiarity with the application.

    To pass this magic to other team members, managers usually ask senior engineers to write detailed runbooks in an attempt to identify and solve every possible problem (Root Cause) they might encounter. But that time spent creating runbooks and dashboards is largely wasted, because modern systems rarely fail in precisely the same way twice.

    Anyone who has ever written or used a runbook can tell you a story about just how woefully inadequate they are. Perhaps they work to temporarily address technical debt: there’s one recurring issue, and the runbook tells other engineers how to mitigate the problem until the upcoming sprint when it can finally be resolved. But more often, especially with distributed systems, a long thin tail of problems that almost never happen are responsible for cascading failures in production. Or, five seemingly impossible conditions will align just right to create a large-scale service failure in ways that might happen only once every few years.

    Yet engineers typically embrace that dynamic as just the way that troubleshooting is done—because that is how the act of debugging has worked for decades. First, you must intimately understand all parts of the system—whether through direct exposure and experience, documentation, or a runbook. Then you look at your dashboards and then you…intuit the answer? Or maybe you make a guess at the root cause, and then start looking through your dashboards for evidence to confirm your guess.

    Even after instrumenting your applications to emit observability data, you might still be debugging from known conditions. For example, you could take that stream of arbitrarily wide events and pipe it to tail -f and grep it for known strings, just as troubleshooting is done today with unstructured logs. Or you could take query results and stream them to a series of infinite dashboards, as troubleshooting is done today with metrics. You see a spike on one dashboard, and then you start flipping through dozens of other dashboards, visually pattern-matching for other similar shapes.

    But what happens when you don’t know what’s wrong or where to start looking? When debugging conditions are completely unknown to you??

    The real power of observability is that you don’t have to know so much in advance of debugging an issue. You should be able to systematically and scientifically take one step after another, to methodically follow the clues to find the answer, even when you are unfamiliar (or less familiar) with the system. The magic of instantly jumping to the right conclusion by inferring an unspoken signal, relying on past scar tissue, or making some leap of familiar brilliance is instead replaced by methodical, repeatable, verifiable process.

    Debugging a production application using traces

    Debugging a production application using traces and spans is different. It doesn’t require much experience with the application itself. you just need to be curious to learn more about what’s actually happening with the application in the production environment. It simply works like this:

    1. Start with the overall view of what prompted your investigation: what did the customer or alert tell you?
    2. then verify that what you know so far is true: is a notable change in performance happening somewhere in this system? Data visualizations can help you identify changes of behaviour as a change in a curve somewhere in the graph.
    3. Search for dimensions that might drive that change in performance. Approaches to accomplish that might include: Examining sample rows from the area that shows the change: are there any outliers in the columns that might give you a clue? Slicing those rows across various dimensions looking for patterns: do any of those views highlight distinct behaviour across one or more dimensions? Try an experimental group by on commonly useful fields, like status_code. Filtering for particular dimensions or values within those rows to better expose potential outliers.
    4. Do you now know enough about what might be occurring? If so, you’re done! If not, filter your view to isolate this area of performance as your next starting point. Then return to step 3.

    You can use this loop as a brute-force method to cycle through all available dimensions to identify which ones explain or correlate with the outlier graph in question, with no prior knowledge or wisdom about the system required.

    Example

    For example, Let’s say we have a spike in request latency of some APIs for different users. If we isolated those slow requests, we would easily see that slow-performing events are mostly originating from one particular availability zone (AZ) from our cloud infrastructure provider (assuming we have the AZ information in the spans). After digging deeper, we might notice that one particular virtual machine instance type appears to be more affected than others.

    This information has been tremendously helpful: we now know the conditions that appear to be triggering slow performance. A particular type of instance in one particular AZ is much more prone to very slow performance than other infrastructure we care about. In that situation, the glaring difference pointed to what turned out to be an underlying network issue with our cloud provider’s entire AZ.

    Another Example

    Here’s another example of root cause analysis using spans to make sure it’s clear. Let’s assume that after deploying a new version of our application, we noticed that some APIs are getting slower. To investigate this issue, we will follow our traditional way of debugging using distributed tracing. We started by taking a deeper look at the slow APIs and looking for dimensions that might drive that change in performance. After diving deeper, we found out that all those APIs are calling a payment_service. After diving deeper into spans related to payment_service, we found out that it fetches data from a postgresql db specifically from a db table called user_payments_history. Comparing those spans with similar spans from the same API calls but before that deployment, we found that those queries to user_payments_history table are new and actually they are taking some time to get the required data.

    The problem here might be a missing index that causes the query to be slow or that db table user_payments_history might have too many records. There is no way to be sure what is the root cause here but at least we know for sure that there is something wrong with this db table user_payments_history in the payment_service.

    Not all issues are as immediately obvious as this underlying infrastructure issue. Often you may need to look at other surfaced clues to triage code-related issues. The process remains the same, and you may need to slice and dice across dimensions until one clear signal emerges, similar to the preceding example.

    Conclusion

    With complex distributed systems, It became really hard to figure out what is really going on in a production application. That’s why metrics and logs alone are not enough to debug those apps and find what is the root cause of an incident.

    Traces and spans can help in that situation. With high cardinality events, We can collect lots of information about our system that will be really handy when dealing with incidents under time pressure. We have a systematic approach to find the root cause of incidents assuming we are collecting enough information (dimensions) in spans.

    To learn more about observability, please check:

    References

  • Sampling Traces In OpenTelemetry

    Sampling Traces In OpenTelemetry

    At a scale, the cost to collect, process and save traces can dramatically outweigh the benefits because many of these events are virtually identical and successful. The point of debugging is to search for patterns or examine failed events during an outage. That’s why it’s wasteful to transmit 100% of all events to the observability backend.

    To debug effectively, we just need a representative sample of successful events which can be compared to bad events.

    We can sample events by using the strategies outlined in this article and still provide granular visibility into system state. Unlike pre-aggregated metrics that collapse all events into one coarse representation of system state over a given period of time, sampling allows us to make informed decisions about which events can help us surface unusual behaviour, while still optimizing for resource constraints. The difference between sampled events and aggregated metrics is that full cardinality is preserved on each dimension included in the representative event.

    In OpenTelemetry, there are two approaches to achieve sampling, Head-Based Sampling and Tail Based Sampling. Let’s review both approaches and see when to use them.

    Head-based sampling

    As the name suggests, head-based sampling means to make the decision to sample or not at the beginning of the trace.

    This is the most common way of doing sampling today because of the simplicity, but since we don’t know everything in advance we’re forced to make arbitrary decisions (like a random percentage of all spans to sample) that may limit our ability to understand everything.

    A disadvantage of head-based sampling is the fact that you can’t decide that you want to sample only spans with errors since you do not know this in advance because the decision to sample or not happens before the error happens.

    Built-in samplers include ( ParentBased,  AlwaysOn,  AlwaysOff and ParentBased Samplers)

    “AlwaysOn” (AlwaysSample) sampler

    As the name suggests, It essentially means to sample all events – and take 100% of the spans. In a perfect world, we would use this only, without any cost considerations.

    “AlwaysOff” (NeverSample) sampler

    Also as the name suggests, the AlwaysOff sampler samples 0% of the spans. This means that no data will be collected whatsoever. You probably won’t be using this one much, but it could be useful in certain cases. For example, when you run load tests and don’t want to store the traces created by them.

    ParentBased Sampler

    This is the most popular sampler and is the one recommended by the official OpenTelemetry documentation. When a trace begins we make a decision whether to sample it or not. Whatever the decision is, The child span will follow it.

    The main advantage to ParentBased Sampler is that you always get the complete picture.

    How does this work? For the root span, we decide whether it will be sampled  or not. The decision is sent to the rest of the child spans in this trace via context propagation, making each child know if it needs to be sampled or not.

    It is important to understand that this is a composite sampler, which means it does not live on its own but it lets us define how to sample for each use case. For example, we can define what to do when we have no parent by using the root sampler.

    ParentBased(root=TraceIDRatioBased)

    It’s recommended to use the parent-based sampler with TraceIDRatioBased sampler as the root sampler.

    The TraceIDRatioBased based sampler uses the trace ID to calculate whether or not the trace should be sampled or not, with respect to the sample rate we choose.

    Tail-based sampling

    Contrary to head-based sampling, in Tail-based sampling we make the decision to sample or not at the collector level. This can be useful for metrics, for example, when we want to gather the latency, We must know the exact start and end times which cannot be done in advance.

    Also, what was a disadvantage of the head-based is an advantage for tail-based sampling which is being able to only sample spans with errors.

    So where should sampling be implemented?

    Well, that depends on your specific use case so there is no one solution that fits all.

    If you choose to do it at the OTEL distro level (Head-based sampling) , you remove redundant data at the source, never needing to worry about it again. You also minimize data transported in the network. However, when you need to update the sample rate you have to redeploy your services each time.

    If you implement it in the collector you have a centralized place that controls sampling so you don’t need to redeploy your server when you change your sample rate. However, making the sampling decision requires buffering the data until a decision can be made and thus adds overhead.

    Conclusion

    Sampling traces and spans is almost always a good idea since it will save lots of money and most likely won’t affect the debugging process using spans and traces in production. There are different approaches to implement it in opentelemetry. Head-based sampling is simpler to implement but it requires redeploying for each change. Tail-based sampling is a little bit harder to implement but it gives us the ability to only sample traces with errors.

  • Observability vs Monitoring

    Observability vs Monitoring

    Observability is a measure of how well we can understand and explain any state our system can get into, no matter how weird it is. We must be able to  debug that strange state across all dimensions of system state data, and combinations of dimensions, in an ad hoc iterative investigation, without being required to define or predict those debugging needs in advance. If we can understand any bizarre or novel state without needing to ship new code, we have observability.

    Observability alone is not the entire solution to all of software engineering problems. But it does help clearly see what’s happening in all the corners of our software, where we are otherwise typically stumbling around in the dark and trying to understand things.

    A production software system is observable if we can understand new internal system states without having to make random guesses, predict those failure modes in advance, or ship new code to understand that state.

    Why Are Metrics and Monitoring Not Enough?

    Monitoring and metrics-based tools were built with certain assumptions about the architecture and organisation, assumptions that served in practice as a cap on complexity. These assumptions are usually invisible until we exceed them, at which point they cease to be hidden and become the bane of our ability to understand what’s happening. Some of these assumptions might be as follows:

    • Our application is a monolith.
    • There is one stateful data store (“the database”), which we run.
    • Many low-level system metrics are available (e.g., resident memory, CPU load average).
    • The application runs on containers, virtual machines (VMs), or bare metal, which we control.
    • System metrics and instrumentation metrics are the primary source of information for debugging code.
    • We have a fairly static and long-running set of nodes, containers, or hosts to monitor.
    • Engineers examine systems for problems only after problems occur.
    • Dashboards and telemetry exist to serve the needs of operations engineers.
    • Monitoring examines “black-box” applications in much the same way as local applications.
    • The focus of monitoring is uptime and failure prevention.
    • Examination of correlation occurs across a limited (or small) number of dimensions.

    When compared to the reality of modern systems, it becomes clear that traditional monitoring approaches fall short in several ways. The reality of modern systems is as follows:

    • The application has many services.
    • There is polyglot persistence (i.e., different databases and storage systems).
    • Infrastructure is extremely dynamic, with capacity flicking in and out of existence elastically.
    • Many far-flung and loosely coupled services are managed, many of which are not directly under our control.
    • Engineers actively check to see how changes to production code behave, in order to catch tiny issues early, before they create user impact.
    • Automatic instrumentation is insufficient for understanding what is happening in complex systems.
    • Software engineers own their own code in production and are incentivized to proactively instrument their code and inspect the performance of new changes as they’re deployed.
    • The focus of reliability is on how to tolerate constant and continuous degradation, while building resiliency to user-impacting failures by utilizing constructs like error budget, quality of service, and user experience.
    • Examination of correlation occurs across a virtually unlimited number of dimensions.

    The last point is important, because it describes the breakdown that occurs between the limits of correlated knowledge that one human can be reasonably expected to think about and the reality of modern system architectures. So many possible dimensions are involved in discovering the underlying correlations behind performance issues that no human brain, and in fact no schema, can possibly contain them.

    With observability, comparing high-dimensionality and high-cardinality data becomes a critical component of being able to discover otherwise hidden issues buried in complex system architectures.

    Distributed tracing and Why it matters?

    Distributed tracing is a method of tracking the propagation of a single request – called a trace – as it’s handled by various services that make up an application. Tracing in that sense is “distributed” because in order to fulfill its function, a single request must often traverse process, machine and network boundaries.

    Traces help understand system interdependencies. Those interdependencies can obscure problems and make them particularly difficult to debug unless the relationships between them are clearly understood. For example, if a database service experiences performance bottlenecks, that latency can cumulatively stack up. By the time that latency is detected three or four layers upstream, identifying which component of the system is the root of the problem can be incredibly difficult because now that same latency is being seen in dozens of other services.

    Instrumentation with OpenTelemetry

    OpenTelemetry is an open-source CNCF (Cloud Native Computing Foundation) project formed from the merger of the OpenCensus and OpenTracing projects. It provides a collection of tools, APIs, and SDKs for capturing metrics, distributed traces and logs from applications.

    With OTel (short for OpenTelemetry), we can instrument our application code only once and send our telemetry data to any backend system of our choice (like Jaeger).

    Automatic instrumentation

    For this purpose, OTel includes automatic instrumentation to minimize the time to first value for users. Because OTel’s charter is to ease adoption of the cloud native eco-system and microservices, it supports the most common frameworks for interactions between services. For example, OTel automatically generates trace spans for incoming and outgoing grpc, http, and database/cache calls from instrumented services. This will provide us with at least the skeleton of who calls whom in the tangled web of microservices and downstream dependencies.

    To implement that automatic instrumentation of request properties  and timings, the frameworks needs to call OTel before and after handling each request. Thus, common frameworks often support wrappers, interceptors, or middleware that OTel can hook into in order to automatically read context propagation metadata and create spans for each request.

    Custom instrumentation

    Once we have automatic instrumentation, we have a solid foundation for making an investment in custom instrumentation specific to our business logic. We can attach fields and rich values, such as user IDs, brands, platforms, errors, and more to the auto-instrumented spans inside our code. These annotations make it easier in the future to understand what’s happening at each layer.

    By adding custom spans within our application for particularly expensive, time-consuming steps internal to our process, we can go beyond the automatically instrumented spans for outbound calls to dependencies and get visibility into all areas of our code. This type of custom instrumentation is what will help you practice observability-driven development, where we create instrumentation alongside new features in our code so that we can verify it operates as we expect in production in real time as it is being released.

    Adding custom instrumentation to our code helps us work proactively to make future problems easier to debug by providing full context – that includes business logic – around a particular code execution path.

    Exporting telemetry data to a backend system

    After creating telemetry data by using the preceding methods, we’ll want to send it somewhere. OTel supports two primary methods for exporting data from our process to an analysis backend, we can proxy it through the openTelemetry collector or we can export it directly from our process to the backend.

    Exporting directly from our process requires us to import, depend on and instantiate one or more exporters. Exporters are libraries that translate OTel’s in-memory span and metric objects into the appropriate format for various telemetry analysis tools.

    Exporters are instantiated once, on program start-up, usually in the main function.

    Typically, we’ll need to emit telemetry to only one specific backend. However, OTel allows us to arbitrarily instantiate and configure many exporters, allowing our system to emit the same telemetry to more than one telemetry sink at the same time. One possible use case for exporting  to multiple telemetry sinks might be to ensure uninterrupted access to our current production observability tool, while using the same telemetry data to test the capabilities of a different observability tool we’re evaluating.

    Conclusion

    Monitoring is best suited to evaluate the health of your systems. Observability is best suited to evaluate the health of your software.

    OTel is an open source standard that enables you to send telemetry data to any number of backend data stores you choose. OTel is a new vendor-neutral approach to ensure that you can instrument your application to emit telemetry data regardless of which observability system you choose.

    References