Maximizing Observability and Monitoring with Datadog

Alex Alves
5 min readOct 30, 2023

When we talk about observability and monitoring, we talk about the capacity to understand the internal state of a system. This means the capacity to observe, collect, and understand the internal function of apps, services, and infrastructures, based on metrics, registers, and tracks. And, for this, and, for this, we need to look at three observability pillars:

  • Metrics: quantitative values that describe the state and performance of a system.
  • Logs: detailed registers of events and actions that occur in a system
  • Traces: detailed information collected about a journey, while trafficking between different components and services.

To do this in an IT environment, we can use some tools, like Datadog.

What is the Datadog?

It’s an observability and monitoring platform for infrastructures and applications, and we can use it to identify problems faster and optimize performance. There we can find some functions, such as:

  1. Data collect
  2. Real-time metrics
  3. Logs
  4. Traces
  5. Alerts and Notifications
  6. Custom views

Monitoring with Datadog

This allows IT teams to follow the performance and integrity of systems and apps in real-time, which helps us to identify problems and allow immediate action. So, let’s see some examples of metrics and data that could be monitored:

  1. CPU and Memory usage
  2. Network latency
  3. Error rate
  4. Network traffic
  5. Availability of services
  6. Requests count
  7. Database performance
  8. Error logs
  9. Cloud resource usage

Log Tracking and Analysis

A very important function to detection of problems. Datadog supports the logs collected by many resources, including servers, apps, and databases. Besides this, it’s possible to configure agents or integrations to send logs to the platform automatically. All this information is centralized in a unique local, which facilitates us with the access and search process, and we can configure the time retention for this information.

For analysis, we can use some search expressions like “service: [your name service] AND *[CONTENT LOG]*”, besides the tool correlates some logs with default tags like “ddd_trace” and/or “ddd_id”, for example.

Custom View and Dashboards

We can generate some dashboards that may contain engineering information and/or product information (yes, we can do it). There we can include some types of widgets like, query values, graphs and top lists. With the aggregate data, we can see the metrics and data in a unique place, which facilitates our comprehension and makes our evaluation of something. Follow below some advantages of dashboards:

  1. Anomaly detection
  2. KPIs monitoring
  3. Product metrics
  4. Colaboration and sharing

So, let’s see some real examples of dashboards:

An example of services observability

And for product view, we can get these metrics in logs. Where each one has important and declarative information. For example, for a log that contains a value of an order, we can catch the value and sum it.

Notifications and Alerts

When something is wrong in your system, the tool triggers an Alert and then we can throw some notification to advise it. This is very important when none uses the datadog, out-of-time work, for example.

To configure some alerts, we use the Monitors:

And we have a lot of types of monitors:

And to configure it, you can see below:

  1. First, we define a query that we use to collect data, and some rules to evaluate the monitor:

2. Then we define some conditions to trigger the alert:

3. Finally, we define a structured message and to where we will send this message:

Integrating into Google Chats

For more productivity, we can usually connect the datadog to our chat, like teams or G-chat. In this case, we need to create some webhooks:

To configure in G-chat, we need to catch the URL address for the chat or space:

Create a new WebHook, where we inform the chat URL and a payload:

To create a custom card, like this:

> We can use Cards V2 for G-Chat and use the webhook variables

Use Cases

Of alerts

  1. Dead-Letter-Queues
  2. Http Client communication
  3. Product validation, like none operation in a period

Of Custom Views

  1. All apps of a team
  2. All view of a company
  3. All view of the infrastructure

Conclusion

This platform may be an essential tool for observability and monitoring, which provides us with some resources to collect, analyze, and visualize data. In this context, let’s highlight some points:

Pros

  • Real-time monitoring
  • Easy customization
  • Easy trace and analysis
  • Service cloud integrations

Cons

  • Initial complexity
  • Cost (may be expensive)
  • Time spent learning

--

--

Alex Alves

Bachelor in Computer Science, MBA in Software Architecture and .NET Developer.