Observability stack for an operator: see the system before the customer complains
Metrics, logs, traces — the three pillars without which the operator flies blind. Observability architecture for a multi-system telecom landscape.
Discuss Your ChallengeWhy an operator needs unified observability
In a typical operator observability is fragmented: network team has its NMS, IT has Zabbix or Nagios, app team has its own APM, biller has logs in files. When an incident crosses team boundaries, nobody sees the full picture.
The result: the first signal of a problem comes from the contact centre (“customers are complaining”), not from the monitoring system. By the time it is detected, the impact is already at scale.
Unified observability is the architectural approach in which three classes of data (metrics, logs, traces) are collected and correlated on a single platform, accessible to all engineering teams.
Three pillars
Metrics. Numeric indicators aggregated over time. CPU, memory, request rate, latency p50/p95/p99, error rate. Time-series store (Prometheus, InfluxDB, cloud-native). Cheap, fast to query, limited on cardinality.
Logs. Structured events with context. Detailed, full-text searchable. Log aggregation (Elastic, Loki, Splunk). Costlier than metrics but with more detail.
Traces. End-to-end path of a request across multiple systems. Understanding of distributed system. Distributed tracing (Jaeger, Zipkin, Tempo). Especially important for microservice architecture.
Without all three you get a blind spot. Metrics show “something is wrong”, logs explain “what exactly”, traces — “where specifically”.
Structural elements of the platform
Collectors. Agents on each system that send metrics/logs/traces to a centralised stack. Standard: OpenTelemetry collectors.
Pipelines. Stream processing (filtering, enrichment, sampling) before writing to storage.
Storage. Tiers: hot (latest hours, fast query) → warm (days, slower) → cold (long retention for compliance).
Query layer. Unified UI for exploration — across metrics, logs, traces — with the ability to quickly jump from a metric to relevant logs and trace.
Alerting. Threshold and anomaly-based alerts. Routing — who receives an alert for which component. Alert fatigue management — suppression, dedup, escalation.
Dashboards. Per-system, per-service, per-business-flow. Not “one big dashboard” but a layered set.
Where it usually breaks
Every team has its own stack. Cross-system incident correlation is impossible without a human who collects data manually.
Sampling is too aggressive — during an incident there are no relevant traces to investigate.
Alerts are not tuned — 200 alerts per day arrive, the team ignores them. A critical alert drowns in the noise.
Logs are not structured — every team logs in its own format, correlation is impossible.
Retention is too short — during a post-mortem you need a week of data and only two days exist.
Cost out of control — observability costs more than the applications it monitors. Without discipline this is an inflationary cost.
SLI / SLO discipline
Without SLI (service level indicators) and SLO (objectives) observability is just dashboards. With SLO business meaning appears.
Per service:
- SLI: what we measure (availability, latency, error rate)
- SLO: target level (e.g. 99.9% availability per quarter)
- Error budget: tolerable deviation
When the error budget is exhausted — releases are paused, the team focuses on reliability. This is an operating discipline, not just a metric.
Operating model
Owner — Head of SRE / Platform Reliability. Not each team on its own — a central function with standards, others follow.
Teams:
- Platform engineering (collectors, pipelines, storage, query, alerting)
- SRE per critical service (defines SLI/SLO, on-call)
- Service owners (use the platform, accountable for their SLO)
Routine — weekly SLO review, quarterly reliability review, post-mortem on each significant incident.
What is measured
MTTD (mean time to detect) — time from incident to alert.
MTTR (mean time to resolve) — time from alert to resolution.
Alert noise ratio — what share of alerts turned out to be actionable.
SLO compliance — what share of services is in the green zone.
Cost per metric / log / trace — observability economics.
How SamaraliSoft engages
Observability Blueprint — 6-8 weeks. Inventory of current observability fragments, target stack design, SLI/SLO framework for critical services, governance, platform choice (open-source: Prometheus + Grafana + Loki + Tempo; managed: Datadog, NewRelic; cloud-native). Pilot — usually one critical service end-to-end.
Related
- /en/architecture/telecom-event-bus-architecture/ — event bus monitoring
- /en/insights/telecom-sre-discipline/ — SRE discipline
- /en/architecture/telecom-mlops-architecture/ — MLOps observability
- /en/insights/telecom-incident-management/ — incident management
What else is worth exploring
Topics from the same area we usually explore together
CRM
Not an off-the-shelf CRM, but a properly built customer management contour — from first contact to loyalty.
→SolutionBI
Analytics is not pretty charts on the wall. It's the answer to 'why?' before the problem becomes a loss.
→SolutionContact Center
The contact center is not a phone station — it's the point where a client decides: stay with you or leave. The question is how it's built…
→SolutionIntegrations
Integrations are invisible but critical. When they work — systems talk. When they don't — data is lost and people copy from window to…
→I do not just write about this. I can come in, examine your situation and design a solution for your specific landscape.
Discuss applying this →Ready to discuss your challenge?
Tell me what's not working or what needs to be built. First conversation — no obligations.
Usually respond within a few hours