Observability - Scope and Design with elastic

Praveen Manvi
9 min readJun 14, 2021

--

The software generates business value when it runs and not when it’s written. We need to know what software does when it’s running — unless we measure it well, we will not be able to know” . Observability is about building capable eco system that can “Measure what’s measurable and count what’s countable” in a self service enabled manner.

Observability is key component and its gaining traction with different applicability context and rightly moving up in value chain as concept.

The current document discusses the ideas, techniques and how it can help solve the problem to keep software system running confirming to SLAs and providing insights that enhances capability of teams to deliver sustainable software. JVM, SpringBoot as building for RESTful APIs, AWS as cloud vendor development eco system is used as examples.

The ideas expressed here are general, but elastic stack is used as basis. The rational for elastic is because of the current investment (with logs, alerts processing) and elastic improvement over years (especially in APM space with ML/AI), Its complete covering 3 pillars of Observability namely log, trace and metrics. Its recent recognition by gartner is a testimony to this fact.

Lets try to understand and differentiate observability from monitoring. The observability is complex than monitoring, Its stakeholders are beyond engineers and comes up with the queryable metrics that provides ability to come up with new use cases possibility (Sales Pitch). Guessing from fixed dashboards is limiting, it misses out new possibilities with features and more importantly helps clean up the wastage (unused resources, features). microservices architecture style is making observability and not monitoring a ‘essential’ business capability to have and not just operational one. In simple terms microservices breaks down single/big instalment into series of independently deployable, scalable and easy to manage with single focus. With this inherent distributed nature, request will have to pass through multiple nodes and makes tracing, debugging, monitoring lot difficult.

Going with domain driven design concepts observability offers different view of the data to be seen different stake holders (namely product, engineers and operations teams). The dashboards serves as the bounded contexts.

Goals with quantifiable metrics.
Provide single window to check the health of systems providing unified visibility across different suit of services/products
- Reduced operations monitoring labor cost
Enable teams to take recovery upon detecting anomalies (e.g. service downtime, errors, slow responses) without a run-book
Engineering effort savings for development, testing, and debugging
- Alerts providing actionable insights and analytics

Faster recovery and with time to value and better forecasting abilities.
Reduce time to identify root cause and avoid repetitive errors
Key product, engineering knowledge is captures as rules
Define KPIs (Key Performance Indicators) SLO/SLIs for all services
New use cases/Opportunity cost benefits and machine learning/regressions to reduce future issues to limit false positives

KPIs for Architecture
Ensure uptime and alerts for
1. Microservices and ensure Single dashboard to check/verify
2. Datastores (Ex: Atlas, RDS, Elastic,ElastiCache), tie up with each service, merge with service dashboard to have connectivity
3. Auxiliary/Platform services Ex: Kafka Connectors /Brokers/SchemaRegistry,Zookeeper, Gateways
Ensure QoS at service level
1. Application error trend from services like HTTP >500 error codes,Exceptions
2. Data synchronisation within SLA & Latency from APIs are acceptable/delighting customers.

Observability concepts are novel, and they are a paradigm shift for product managers, developers, and operators. It may seem counterintuitive to allow and plan for failure, but maintaining 100% reliability is prohibitive and ultimately slows progress toward business objectives. SLOs and SLIs eliminate all the often fuzzy definitions of reliability and makes it assertable.

as read from SRE-Service Level Objectives (SLO) book.

{Network, Machine, Service} Down
{Database, Service, UI} Slow
{Memory, Sockets, threads, Disk, Network} Full
To to re-capture Architecture SLOs should be able to which usually results into resizing for quick turnaround CPU,Memory bound issues increase VM (Ex: EC2 instance type) if service is bottleneck OR database instance if depending on the stress, requests are piling up we end up adding new instance, if database we increase the number of instances and so on…

Transaction Success Rate (both for +ve and -ve)
UX — visible latency (end user complaints)
Session depth (ROI on features) -This is the kind of data that product team (feature development team should be provide to take full advantage of Observability and come up feature flags to take decisions based on feedback)

SpringBoot
Probably the most popular framework to develop, it comes with readymade libraries such as ‘actuator’ that exposes operational information about the running application — health, metrics, info, dump, env,

Spring Boot Actuator has started using ‘Micrometer’ framework as metrics facade. This makes to push these awesome metrics data without any changes in Spring Boot application to elasticsearch (support for graphite also exists)

Including ‘micrometer-registry-elastic’ dependency in services and adding following property in application.properties file should do the trick to get all information. management.metrics.export.elastic.host=http://<elastic-host>:<elastic-port>

APM
Elastic APM is an application performance monitoring system built on the Elastic Stack. It allows you to monitor software services and applications in real-time, by collecting detailed performance information on response time for incoming requests, database queries, calls to caches, external HTTP requests, and more. This makes it easy to pinpoint and fix performance problems quickly. Elastic APM also automatically collects and links unhandled errors and exceptions. Errors are grouped based primarily on the stack trace, so you can identify new errors as they appear and keep an eye on how many times specific errors happen.

APM modules/plugins — technology specific
MySQL(RDS), Mongo(Atlas), Redis(ElastiCache), Elasticsearch and other data stores. Data really powers everything that we do, and we believe that selection of data store has highest impact in success of our APIs. The success and failure of application heavily depends on the database selection OR more importantly which is use case (query pattern, data volume, data growth over the time). The data metrics from observability will help to assert decision to go with one against. Each databases have their own sweet spots. The micro service has made cost of wrong selection are limited. Some sweet spots can be asserted with performance and actual usage of use cases.
Elasticsearch
Low latency for indexed read (return set of documents)
Low latency for random search (return set of documents)
Low latency for full text and aggregates queries (return facets)
MySQL/Mongo
Primary data store — General purpose
Low/Medium latency indexed read
High latency random search
Dynamic schema — Mongo
Controlled schema and transactions — MySQL
Redshift/Vertica/HBase
High write throughput
Low latency aggregation queries
S3/EFS/EBS
High read throughput (MB/sec)
Moderate latency single document Read
Redis
In Memory Datagrid for low latency key value lookup, Latency, scale number of clients

APM Agent

elastic provide good set of ready to use and a easy way to implement newer plugins. Look for this space for newer plugins for hashicorp vault, zookeeper, schema registry among other services provided by us.

For ex: configuring slow and fast queries threshold with JDBC, following startup parameters to spring boot application will push the metrics data.

-Dio.opentracing.contrib.jdbc.slowQueryThresholdMs=100 will find track all queries exceeding 100 milliseconds
Spans that complete faster than the optional excludeFastQueryThresholdMs flag will be not be reported. -Dio.opentracing.contrib.jdbc.excludeFastQueryThresholdMs=10

Just dropping the agent jar is sufficient to capture, below example shows for Redis, RestTemplate (library), JDBC. We can pin point the culprit(class, method) their facets and latency. Taking this data into investigation which will tie perfectly into file beat(logs) and metric beat(cpu, network, memory) data and ML job can predict the latency stress occurrences. In my opinion this is the killer feature compared to other competitors. All pillars i.e.logs, metrics and traces in one place and we can navigate among them seamlessly.

Beats, Logstash, Kibana, Cloud

Beats provide ingesting metrics and logs is and please not that its not a replacement for APM data, it’s just a compliment. Together Metrics, Logs, and APM data (traces) gives the full picture. Elastic Metric beat modules (each technology framework is a module) has huge number of collections. As each module contains one or multiple metric sets. One big advantage with these awesome tools is that it provides metric sets that should be monitored provided by experts in the field a huge savings in learning.

Metrics segregation and governing rules for alert thresholds and contingency management

Alerts

Alerting works by running checks on a schedule to detect conditions defined by a rule. When a condition is met, the rule tracks it as an alert and responds by triggering one or more actions. Latest ES made adding alerts as simple as click of few buttons from dashboard as well as writing complex predicates as rules. One key aspect to remember is that alert can be used to send message on happy events as well like “spike in number of users increased”, “ad campaign is exceeding expectations”, “no errors(500+ status codes) from new services in 24 hours” so on… :). Of course main goal of operations like “service down”, “high cpu”, “application errors” with all context information can be sent to various notification options. We did integration with mail and pagerduty

Summary
In my opinion good observability is work of ‘labor of love’. This passion only can extract insights (biz, tech) ability to ask questions is democratised. Its being said ‘Netflix’ is logging company that streams the data and their problems are different. Best practices, efficacy of execution in engineering are domain independent and context independent attributes.
Here are conclusions:
- Embrace ‘automation’ everywhere fanatically including to run effective cloud operations
- Assuring Security and Compliance, organisation Resource consistency and visibility through observability in all layers including development
- Eliminate Cost leakages with cloud, product/engineering team can assert for for ROI (cloud and maintainability cost) for the features.

FAQ
Is Elastic moving away from open source poses a risk?
I am not lawyer, but I think elastic developed legal weapon to fight amazon as they thought (may be legitimate) their pie is getting eaten up by cloud vendor in unfair manner. The main intention to make it as deterrent (like nuclear, not for use) against the cloud vendors (aws,gcp,azure). Its safe to assume, other companies will never to face any kind of law suit. There is moral background (as they elastic relies on selfless engineer’s contribution and open source search engine ‘lucene’). Well, business can’t work moral high grounds. Shay Benon and his ilk are well intentioned tech giants with open source DNA. I don’t see any problem over here.
https://www.elastic.co/blog/elasticsearch-free-open-limitless

Why Open telemetry?
OpenTelemetry is a Cloud Native Computing Foundation (CNCF) sandbox project that provides vendor-neutral, language-specific agents, SDKs, and APIs that you can use to collect distributed traces, metrics, and log data from all your monitored applications. There is built in support for OpenTracing into Elastic APM have actively participated as a member of the OpenTelemetry project. Its nice to have.

Why not Datadog,NewRelic, AppDyanamic — They are leaders in APM with rich experience?
Elastic has covered the APM space well now with ML/AL support as well. Having single framework/tool to cover all pillars (metrics,logs and traces) is a big win. Recent gartner reports pushes elastic into ‘visionary’ quadrant. In microservices world where there will be 100s of nodes, elastic works better cost-wise as well compared to its competitors. With the continuous additions of technology specific plugins its just matter of time that in APM (distributed traces and auto investigation) space also it covers the gap.

How ‘pets vs. cattle’ influence observability?
Microservices are inherently designed for scale out (horizontal scaling with stateless services) with cloud native technology, considering as Pet(that’s scale up and requires non automated personal care) is just impossible. Spinning up new services has to be as easy as spinning down, when an environment is not in use, drop it and more importantly in a automated fashion.Observability has to be baked into services.

References
https://medium.com/sipios/how-did-we-reduce-a-request-by-133-times-with-tracing-and-elastic-apm-17a3456c114e
https://faun.pub/elastic-apm-and-opentelemetry-integration-49abaaccdad9 — APM/Open Telemetry

https://github.com/opentracing-contrib

https://github.com/michaelhyatt/elastic-apm-mule4-agent Example of writing new agent.

https://www.elastic.co/virtual-events/ci-cd-observability-and-opentelemetry CI/CD with Jenkins

--

--

Praveen Manvi

Senior (Architect, Director, VP) Software Engineer. Building web scale SaaS/Multi-Tenant Solutions in Cloud (AWS,GCP,Azure). Ex-Yahoo, Ex-Oracle, Ex-{startups}.