Applying Observability with OpenTelemetry - Part 1 - Distributed Tracing

Gopher Tracing
By Lev Polyakov (www.polyakovproductions.com)

When I had to investigate a production issue in the past, one of the first steps was correlating log entries from multiple servers to find out full journey of a request. It was a real pain in the neck. In many instances, I had to add an additional log entry, redeploy app and check again on the next day. The full process is painful and unproductive.

To make the matter worse, most of modern software systems are no longer running in a single server, but across multiple machines, databases, SaaS, and cloud vendors. They are implemented in different programming languages, leveraging a variety of architecture patterns and communication protocols. As a result, measuring accuracy, latency, correctness and consistency of distributed systems is an intractable problem.

Observability (a term from control theory) is a quality of a software system that is implemented to generate enough data points to reason about the system during its operation. Fast high-quality feedback from production (without introducing additional code) is crucial for engineers to improve and automate highly performant and reliable systems at scale. For years, OSS libraries and SaaS vendors have already been presenting observability capabilities to export logs, metrics and traces out of software systems. In May 2019, OpenTelemetry (OTel) was announced as a CNCF sandbox project with a focus to standardise telemetry data model, architecture and implementation for observable software. At the time of writing, OTel SDKs are in beta stage and GA releases may be out soon in 2021 for major languages.

To gain a comprehensive view of OTel, I have assembled a project (called ShortenIt) in Golang to play with its key features. This custom solution is not an example of a production-ready one so please bear that in mind.

This is the first part in a series about OpenTelemetry:

  • Part 1 - Distributed Tracing (this post)
  • Part 2 - Logs and Metrics
  • Part 3 - Cloud-native integration with AWS

Table of Contents

Explore OTel examples with ShortenIt

ShortenIt solution provides similar functionalities as tinyurl.com website. Basically, we can generate a short URL from a very long one. Short URLs save space for sharing, displaying or printing. In the following design, two main services api-shortenit-v1 and grpc-alias-provider-v1 are implemented with OTel to send its telemetry data (metrics, traces) to backend collectors (Prometheus and Jaeger).

Shortenit Design

Two use cases we want to add distributed trace are:

Creating a short URL via POST /shortenit

A client sends a long URL to api-shortenit-v1. This service calls grpc-alias-provider-v1 to get a short key and the URL mapping is persisted in MongoDB. If the request contains an authenticated user, we need to save the mapping to the user’s URL list as well. We need to obtain the request trace from the client to api-shortenit-v1, grpc-alias-provider-v1, MongoDB, and Redis cache (alias store).

Get the original URL for an alias via GET /shortenit/{alias}

When the client wants to get back the original URL, api-shortenit-v1 will check the mapping in MongoDB. If found, the long URL is returned for the provided key. For the analytical purpose, the API also publishes success responses to a Kafka topic.

If you are keen on running the example locally, please follow instructions outline https://github.com/thanhnamit/shortenit

Distributed tracing in action

Now let’s have a look at an end-to-end trace of the request POST /shortenit. Client app can extract initial TraceId and SpanId in response header Traceparent. For example, 8d63ac3f01d517d71c5a803f3e37e981 is my unique TraceId. To view its trace timeline, enter the Id in Jeager at http://localhost:16686/search.

Trace Timeline

What the trace timeline telling us:

  • Number of participants: 2
  • The latency of request: 7.33ms
  • Number of operations (spans) and execution time for each operation: 17
  • Causality relationships between spans

This single end-to-end view of a request is extremely valuable for engineers to analyse and detect abnormality if service quality degraded. Additionally, each span also carries information collected at the time the operation executed, for example:

Span Details

This span is instrumented by otelgrpc auto-instrumentation library for gRPC. It collects information about gRPC status code, service name & method, and host information. Span can be customised in application code to capture operation details if more visibility of the trace is required.

Let’s look at Distributed tracing from the implementation perspective. The language I used here is Golang but a similar approach should be available for other languages as they all conform to the same OTel specification.

Configuring Exporter, Tracer, Sampler and Propagator

The first step is to config necessary components for tracing. OTel Tracing API standardises the following concepts:

  • Exporter: a protocol-specific, pluggable component that implements OTel’s SpanExporter interface to send telemetry data to collectors (i.e Jaeger)
  • TraceProvider: creating a Tracer
  • Tracer: making and linking Span together
  • Sampler: controlling when a span is recorded
  • Propagator: reading and writing context data across process boundaries

The following code configs a full pipeline backed by Jaeger’s collector endpoint normally available at http://127.0.0.1:14268/api/traces. TraceProvider, TextMapPropagator and AlwaysSample are configured and registered as global variables.

func InitTracer(serviceName string, traceCollector string) func() {
	flush, err := jaeger.InstallNewPipeline(
		jaeger.WithCollectorEndpoint(traceCollector),
		...
		jaeger.WithSDK(&sdktrace.Config{DefaultSampler: sdktrace.AlwaysSample()}),
	)
	...
	otel.SetTextMapPropagator(propagation.NewCompositeTextMapPropagator(propagation.TraceContext{}, propagation.Baggage{}))
	return flush
}

With that configuration, let’s look at several instrumentation use cases:

Custom span and in-process context passing

Tracing an individual function with blocks of logic could give us a meaningful measurement of how the main logic performs. An application should be able to propagate in-process context in its call stack so that an existing span can be extracted for linking and a new span can be inserted. Depending on the language we choose, the Context is available explicitly (i.e Golang) or implicitly via Thread-local vars (i.e Java). In this instance, I created a custom span called service.GetNewAlias that comprises multiple child spans when the main service function calls other functions to fulfil its logic.

Custom Span

To create a new span in Go, we obtain an instance of the tracer. If the function has access to an existing context, it can pass the context to OTel trace.Start function to create a new child span.

// main function
func (d DefaultService) GetNewAlias(ctx context.Context, request core.ShortenURLRequest) (core.ShortenURLResponse, error) {
	tr := otel.Tracer(d.cfg.TracerName)
	ctx, span := tr.Start(ctx, "service.GetNewAlias")
	defer span.End()
	...

// child function
func (r *UserRepo) GetUserByEmail(ctx context.Context, email string) (*core.User, error) {
	tr := otel.Tracer(r.cfg.TracerName)
	ctx, span := tr.Start(ctx, "repository.user.GetUserByEmail")
	defer span.End()

In-process context passing is not new as it has been implemented in many frameworks. Inter-process context passing, however, requires a different strategy for interoperability.

Tracing remote calls and propagating inter-process context

In a lot of situations, our service will need to call an upstream service, hence this is an ideal point for instrumentation. In Shortenit case, api-shortenit-v1 calls grpc-alias-provider-v1 using gRPC protocol. As you might guess, there will be two spans created, one on the caller’s side and one on the receiver’s side.

Cross boundaries

As above, all we need is to add dependencies and configure on client and server

go get -u go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc@v0.15.0

gRPC client needs to be configured with an Interceptor

func NewAliasClient(cfg *config.Config) *AliasClient {
	conn, err := grpc.DialContext(context.Background(), cfg.AliasCon, grpc.WithInsecure(), grpc.WithBlock(), grpc.WithUnaryInterceptor(otelgrpc.UnaryClientInterceptor()))
    //...
}

An inceptor on the server-side

s := grpc.NewServer(
    grpc.UnaryInterceptor(otelgrpc.UnaryServerInterceptor()),
)

It sounds like magic but if we look at the Interceptor code all it does is interacting with OTel SDK’s propagators to inject metadata to an outgoing request and extract that from an incoming request. Currently, OTel SDKs bundled with W3C Trace Context implementation as default.

Instrumenting HTTP listener

Instrumenting at the listener level provides valuable information about how a client interacts with a REST service. We can record HTTP related attributes in Span tags (method, path, user_agent, status code…).

Http Listener

The instrumentation library is otelhttp, not part of OTel spec but provided by community or vendor. We need to add this dependency and register it as middleware in HTTP pipeline.

go get -u go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp@v0.15.0
func NewGlobalHandler(handler http.Handler, operation string) func(w http.ResponseWriter, r *http.Request) {
	return toHandlerFunc(withCORS(withAPIKey(otelhttp.NewHandler(handler, operation))))
}

Instrumenting storage clients

As a lot of performance issues are database-related, instrumenting database interaction is a crucial part to analyse and troubleshoot the issues. Commercial and OSS database vendors start adopting OTel instrumentation by providing their version of the library. The span below, created by otelmongo, exposes details about the query execution and database instance that our API interacts with.

Mongo

go get -u go.opentelemetry.io/contrib/instrumentation/go.mongodb.org/mongo-driver/mongo/otelmongo

Setting up new mongo connection with an additional monitor

opts := options.Client()
opts.Monitor = otelmongo.NewMonitor(cfg.AppName)
opts.ApplyURI(cfg.MongoCon)
db, err := mongo.NewClient(opts)

Instrumentation for messaging

Instrumenting for messaging or asynchronous calls is not always easy as it involves a message queue/topic in between and context has to be passed together with messages (requires a change in message structures). Shopify’s engineering team has contributed otelsarama library to support Kafka client/server instrumentation.

Kafka

go get -u "go.opentelemetry.io/contrib/instrumentation/github.com/Shopify/sarama/otelsarama"

Configuration to Sarama’s producer:

...
producer, err := sarama.NewAsyncProducer(brokerList, cfg)
...
producer = otelsarama.WrapAsyncProducer(cfg, producer)
...

With that we need to explicitly inject the metadata using the configured propagator before sending a message to a topic:

...
otel.GetTextMapPropagator().Inject(ctx, otelsarama.NewProducerMessageCarrier(&msg))
ep.producer.Input() <- &msg
...

Web and mobile instrumentation

Most of the frontend apps today interact with REST api via HTTP, so this scenario is identical to the remote call instrumentation. However, since frontend apps are deployed closer to end-user, a trust layer needs to be established to make sure that tracing data is valid and correctly adjusted before assembling traces.

Summary

Distributed tracing is the first pillar of Observability, its main goal is to enable engineers to understand and debug programmatic failures in a distributed environment. In the next post, we will look at how OTel handles logs and metrics. Stay tuned!

References

Nam Nguyen
Nam Nguyen
Software engineer
comments powered by Disqus

Related