In the realm of microservices, distributed logging is king!

Overview

Microservices are everywhere. A large application is split up to smaller deployable units that can be scaled individually. Cool!

But is it really?

As soon as the interplay between these microservice applications goes haywire, the other side of the coin reveals itself. What’s the problem and how is this different from a monolithic application?

As soon as there is an issue with our application, we go and have a look at the log files to figure out what went wrong. In a monolithic application, this is easy (compared to a microservice application) since there is only a single log file that contains all our log statements.
When analyzing a problem in a microservice architecture, it looks a bit like this:

As soon as a problem is detected, we have a couple of tedious steps to go through:

  • We first need to collect all logs from the involved microservices (not necessarily a trivial thing to do in a complex system)
  • Afterwards, we need to filter these logs to only match the ones related to the problem case. Especially in a large system with many other users (that hopefully don’t encounter the issue you are investigating), the logs also contain events from these other interactions.
  • Next up, our analysis can start. But if we are unlucky, it’s not too clear what is going wrong and we need to add some log statements and redeploy. We are already in a tricky situation here because this implies that we need to be able to reproduce the problem.
  • We keep repeating this process as long as we need to identify the problem, fix it and carry on with our actual work.

Damn. Feels like we should be able to do better. And especially, we should be more efficient debugging a microservice system.

What if we could introduce some tooling to improve the workflow to the following:

Optimized workflow when debugging a microservice system

Distributed logging to the rescue

To tackle these problems, we’ll have a look at the different components needed to set up a workflow like the one above.

Bare in mind that the amount of tools that can help us out is enormous. We will only introduce the concepts and pseudorandomly choose a couple of tools to prove the point.

First things first, we need a demo application made up of microservices. We’ll borrow an extended version of a microservice application from one of our previous blogs: the cookie app.

The diagram above will show some of the interactions between the different microservices, but it’s not that important to know exactly how they cooperate. Actually, that’s even the point! We’ll see how our logging tools can help us understand how these applications interact.

In this blog, we’ll focus on distributed logging and tracing. There are a lot of other metrics that come almost for free out of the logging tools we’ll introduce but we’ll ignore them for now.

Collecting logs

A first important part of the solution will be to collect all logs from our applications in a central place. For this demo, we used Datadog as a logging backend and interface. This is what the log collection dashboard looks like:

Collecting logs in a central place already solves our first problem by automatically bundling all log files in a single searchable log. But it’s still difficult to know what logs entries are linked to one another.

That’s where distributed tracing shines!

Distributed traces

Traces record the path that a request follows throughout the different microservices involved. Alongside this flow, the trace can also be linked to the logs that are recorded during this request.

Under a trace, there are multiple spans. A span is a piece of the workflow that reflects subparts of the total interaction caused by the request.

A trace for a single API request on the monster-service is visualized in the Datadog dashboard on the top while multiple spans are linked under the trace:

For our API call on the monster-service to fetch the amount of all cookies present, we can see how this service uses the other services to fulfill the request. First it will fetch all cookies that are configured in the cookie-service (green span). Then we can see that it will request the amount per cookie type from the inventory-service (red spans) in parallel.

Additionally, the relevant logs that are correlated with our trace are bundled within this trace (on the bottom part of the UI). That means that we no longer manually need to filter the logs that are relevant to the issue that we are investigating. Sweet!

Now let’s see how this could benefit us when some interaction between services goes south when there is a cookie of type blaster:

Without adding any specific logging to the application, we can identify which of the parallel calls caused the cookie-monster API to return an internal server error as well as the logs that are relevant in this context.

Time to get our hands dirty

That’s cool and all, but how does this affect my codebase, you ask. Let’s have a look at the different steps that are needed to actually set this all up.

Collecting logs

Let’s start with the easy part. Logs don’t have to be linked to traces and collecting the logs in a central place can already improve our debugging process. When using Kubernetes, conceptually it would work like this:

Some agent will collect all the logs that are also accessible in Kubernetes and send them to a log collection backend. In our example, a Datadog agent will send the logs to the managed Datadog service.

Setting up the Datadog agent in your cluster is very well documented.
For example, when using the Helm deploy, the following configuration is needed to collect logs:

## Enable logs agent and provide custom configs
  logs:
    enabled: true
    containerCollectAll: true
    containerCollectUsingFiles: true

Distributed traces

Distributed tracing also does not require manual code to be written, but some configuration will be needed in your application code.

All involved components are shown in the diagram:

First of all we need to instrument our code. It’s main responsibility is to create your traces. The instrumentation is specific to your language and potentially framework of choice, hence it is able to record the interactions with other components in the system, for example by means of an API call, a future that is created, … .

Since our demo application is written in the Scala language and the Lagom framework, we have opted to use Lightbend Telemetry (shown in the diagram by its codename Cinnamon). This library is tailored to instrument all aspects of our Lagom application, also opening up a lot more insights than only traces.

Adding the library to our codebase is as easy as adding the dependency:

addSbtPlugin("com.lightbend.cinnamon" % "sbt-cinnamon" % "2.15.0")

Enabling the plugin and adding the components you want:

lazy val `cookie-impl` = (project in file("cookie-impl"))
  .enablePlugins(LagomScala, Cinnamon)
  .settings(
    addCompilerPlugin(scalafixSemanticdb),
    libraryDependencies ++= Seq(
      lagomScaladslPersistenceJdbc,
      lagomScaladslKafkaBroker,
      lagomScaladslTestKit,
      macwire,
      scalaTest,
      scalaLogging,
      logback,
      postgresDriver,
      h2Driver,
      lagomScaladslAkkaDiscovery,
      akkaDiscoveryKubernetesApi,
      Cinnamon.library.cinnamonLagom,
      Cinnamon.library.cinnamonOpenTracing,
      Cinnamon.library.cinnamonOpenTracingDatadog,
      Cinnamon.library.cinnamonSlf4jMdc,
      Cinnamon.library.cinnamonCHMetrics,
      playJsonJoda))

And lastly, configuring the exact measurements you want to be instrumented by the library:

cinnamon.lagom.http.servers {
  "*:*" {
    paths {
      "*" {
        metrics = on
        traceable = on
      }
    }
  }
}

cinnamon.lagom.http.clients {
  "*:*" {
    paths {
      "*" {
        metrics = on
        traceable = on
      }
    }
  }
}

cinnamon {
  application = cookie-app
}

cinnamon.opentracing.datadog {
  service.name = cookie-app
}

cinnamon.akka {
  persistence.entities {
    "/user/*" {
      report-by = group
      command-type = on
      event-type = on
      traceable = on
    }
    "/system/sharding/?/*" {
      report-by = group
      traceable = on
    }
  }
}

cinnamon.scala {
  future.instrumentation = on
}

Log correlation for traces

While traces are very useful on their own, correlating the logs within context of your trace could be very powerful to capture additional information. The instrumentation library can inject the constructed trace id into the log context (via the MDC) to allow log correlation as we saw before.

As you could see above, we need to activate the component responsible in the build.sbt configuration:

Cinnamon.library.cinnamonSlf4jMdc

And configure it to include the trace ids:

cinnamon.slf4j.mdc {
  log-correlation += opentracing-trace-id
  opentracing-trace-id {
    name = "dd.trace_id"
  }
}

Additionally, it’s important to configure your logging framework to include the trace ids (made accessible via the MDC context) in the log messages for the logging backend to correlate logs with traces:

<appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <encoder>
        <Pattern>%d{yyyy-MM-dd HH:mm:ss} %-5p %c{1}:%L - %X{dd.trace_id:-0} %X{dd.span_id:-0} - %m%n</Pattern>
    </encoder>
</appender>

Integrate with your logging agent

The instrumentation library has to be configured to contact your collection agent. This agent has the knowledge to send the traces to your monitoring backend where they can be visualized. We’ll show how to integrate with Datadog. However, choosing another monitoring backend is as easy as reconfiguring the instrumentation library to contact another agent. This is possible because of the OpenTracing standards that are used in between. We visualized this in the diagram above with the integration with another tracing backend like Jaeger. The Jaeger agent is running as a sidecar in the Kubernetes pod, while the Datadog agent runs by default as a stand-alone pod on each node in your cluster.

To integrate with Datadog, the tracing agent is configured:

cinnamon.opentracing {
  datadog {
    trace.agent.url = ${?DD_AGENT_HOST}
    trace.global.tags = "env:prod,span.kind:server,span.type:web"
  }
}

And the host environment variable should be set, along side some tagging of your application container:

...
template:
  metadata:
    annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port: '9001'
      "sidecar.jaegertracing.io/inject": "simplest"
      ad.datadoghq.com/cookie-app.logs: '[{"source":"java","service":"cookie-app"}]'
    labels:
      app: cookie-app
      actorSystemName: cookie-app
      tags.datadoghq.com/env: "PROD"
      tags.datadoghq.com/service: "cookie-app"
      tags.datadoghq.com/version: "1.0.0"
...
- name: DD_ENV
  valueFrom:
    fieldRef:
      fieldPath: metadata.labels['tags.datadoghq.com/env']
- name: DD_SERVICE
  valueFrom:
    fieldRef:
      fieldPath: metadata.labels['tags.datadoghq.com/service']
- name: DD_VERSION
  valueFrom:
    fieldRef:
      fieldPath: metadata.labels['tags.datadoghq.com/version']
- name: DD_PROFILING_ENABLED
  value: "true"
- name: DD_LOGS_INJECTION
  value: "true"

Note that we need to set the source to java for Datadog to be able to interpret the logs format and collect the trace ids automatically. An alternative is to configure a custom log format parser in Datadog to define your own format. By setting the source to java, the parsing works out of the box.

It’s nice to add labels like version, env and service to identify them in the dashboards.

The result: traces in combination with their logs in a dashboard to explore.
A dream come true.

Conclusion

Et voila, without writing any custom code, we have demonstrated how to configure Lightbend Telemetry and Datadog as a monitoring backend for our microservice demo application.

In general, these monitoring tools can help us break the tedious debugging workflow of collecting and filtering logs in a microservice system. It allows us to move straight to analyzing the problem at hand within its context and so reduce the time between noticing the problem and fixing it. A win-win for you and your “angry” manager.