Architecture

Observability stack for an operator: see the system before the customer complains

Metrics, logs, traces — the three pillars without which the operator flies blind. Observability architecture for a multi-system telecom landscape.

Discuss Your Challenge

Why an operator needs unified observability

In a typical operator observability is fragmented: network team has its NMS, IT has Zabbix or Nagios, app team has its own APM, biller has logs in files. When an incident crosses team boundaries, nobody sees the full picture.

The result: the first signal of a problem comes from the contact centre (“customers are complaining”), not from the monitoring system. By the time it is detected, the impact is already at scale.

Unified observability is the architectural approach in which three classes of data (metrics, logs, traces) are collected and correlated on a single platform, accessible to all engineering teams.

Three pillars

Metrics. Numeric indicators aggregated over time. CPU, memory, request rate, latency p50/p95/p99, error rate. Time-series store (Prometheus, InfluxDB, cloud-native). Cheap, fast to query, limited on cardinality.

Logs. Structured events with context. Detailed, full-text searchable. Log aggregation (Elastic, Loki, Splunk). Costlier than metrics but with more detail.

Traces. End-to-end path of a request across multiple systems. Understanding of distributed system. Distributed tracing (Jaeger, Zipkin, Tempo). Especially important for microservice architecture.

Without all three you get a blind spot. Metrics show “something is wrong”, logs explain “what exactly”, traces — “where specifically”.

Structural elements of the platform

Collectors. Agents on each system that send metrics/logs/traces to a centralised stack. Standard: OpenTelemetry collectors.

Pipelines. Stream processing (filtering, enrichment, sampling) before writing to storage.

Storage. Tiers: hot (latest hours, fast query) → warm (days, slower) → cold (long retention for compliance).

Query layer. Unified UI for exploration — across metrics, logs, traces — with the ability to quickly jump from a metric to relevant logs and trace.

Alerting. Threshold and anomaly-based alerts. Routing — who receives an alert for which component. Alert fatigue management — suppression, dedup, escalation.

Dashboards. Per-system, per-service, per-business-flow. Not “one big dashboard” but a layered set.

Where it usually breaks

Every team has its own stack. Cross-system incident correlation is impossible without a human who collects data manually.

Sampling is too aggressive — during an incident there are no relevant traces to investigate.

Alerts are not tuned — 200 alerts per day arrive, the team ignores them. A critical alert drowns in the noise.

Logs are not structured — every team logs in its own format, correlation is impossible.

Retention is too short — during a post-mortem you need a week of data and only two days exist.

Cost out of control — observability costs more than the applications it monitors. Without discipline this is an inflationary cost.

SLI / SLO discipline

Without SLI (service level indicators) and SLO (objectives) observability is just dashboards. With SLO business meaning appears.

Per service:

  • SLI: what we measure (availability, latency, error rate)
  • SLO: target level (e.g. 99.9% availability per quarter)
  • Error budget: tolerable deviation

When the error budget is exhausted — releases are paused, the team focuses on reliability. This is an operating discipline, not just a metric.

Operating model

Owner — Head of SRE / Platform Reliability. Not each team on its own — a central function with standards, others follow.

Teams:

  • Platform engineering (collectors, pipelines, storage, query, alerting)
  • SRE per critical service (defines SLI/SLO, on-call)
  • Service owners (use the platform, accountable for their SLO)

Routine — weekly SLO review, quarterly reliability review, post-mortem on each significant incident.

What is measured

MTTD (mean time to detect) — time from incident to alert.

MTTR (mean time to resolve) — time from alert to resolution.

Alert noise ratio — what share of alerts turned out to be actionable.

SLO compliance — what share of services is in the green zone.

Cost per metric / log / trace — observability economics.

How SamaraliSoft engages

Observability Blueprint — 6-8 weeks. Inventory of current observability fragments, target stack design, SLI/SLO framework for critical services, governance, platform choice (open-source: Prometheus + Grafana + Loki + Tempo; managed: Datadog, NewRelic; cloud-native). Pilot — usually one critical service end-to-end.

← Back

Ready to discuss your challenge?

Tell me what's not working or what needs to be built. First conversation — no obligations.

Usually respond within a few hours

Discuss a challenge
Choose a convenient way to connect
Telegram
Fast reply
Fast
WhatsApp
Voice and documents
📞
Call
+998 99 838-11-88