System Observability — Logging, Metrics & Traces
Three-pillar observability framework unifying scattered monitoring decisions into a cohesive OpenTelemetry-based architecture
Table of Contents
- Problem Statement
- Architecture Overview
- Technology Stack
- Instrumentation Strategy
- Metrics Catalog
- Distributed Tracing Architecture
- SLO/SLI Framework
- Alerting Architecture
- Grafana Dashboards
- Repository Structure
- Phased Delivery
- Migration from Existing Patterns
- Configuration
- Related Documents
Problem Statement
MetaForge has observability decisions scattered across multiple documents but no unified strategy. The table below maps each existing piece to its current location and identifies gaps:
| Existing Piece | Current Location | What It Provides | Gap |
|---|---|---|---|
structlog |
Architecture index — Gateway dependencies | Structured JSON logging | No trace correlation (trace_id/span_id), no centralized log aggregation |
.forge/traces/ JSONL |
Architecture index — Workspace structure | Per-session execution traces | Local-only, no distributed tracing, no span hierarchy |
GET /health |
Architecture index — API endpoints | Basic liveness check | No readiness probe, no dependency health (Neo4j, Kafka, pgvector) |
| Performance targets | Architecture index — Section 9.1 | Response time and resource limits | No SLO/SLI framework, no error budgets, no alerting |
| Dead letter queue | Event Sourcing — Multi-agent coordination | Failed event retry and investigation | No DLQ depth metrics, no consumer lag monitoring |
| Oscillation detection | Event Sourcing — Propagation control | Detects agent feedback loops | No oscillation metrics or alerting rules |
| OPA audit logging | Orchestrator Technical — Governance | Policy decision audit trail | Not integrated with centralized logging or metrics |
ComponentHealth types |
Dashboard types | Frontend health display models | No backend metrics feeding these types |
| WebSocket events | Architecture index — Gateway service | Real-time UI updates | No connection metrics, no message rate tracking |
What is missing:
- No OpenTelemetry instrumentation (the industry standard for vendor-neutral telemetry)
- No metrics collection (Prometheus) or operational dashboarding (Grafana)
- No distributed tracing across agent → skill → tool → Neo4j call chains
- No log correlation (
trace_id/span_idpropagation across services) - No alerting or on-call architecture
- No SLO/SLI/error budget framework
- No agent-specific metrics (LLM token usage, cost, latency per agent/skill)
- No Kafka consumer lag, Neo4j query, or pgvector search monitoring
Architecture Overview
Signal Flow
All telemetry signals (logs, metrics, traces) flow through a unified OpenTelemetry pipeline:
flowchart TD
subgraph SOURCES["Signal Sources"]
direction LR
GW["Gateway<br/>FastAPI"]
AGT["Domain Agents<br/>Python"]
SKL["Skills<br/>Python"]
MCP_S["MCP Layer<br/>Tool Adapters"]
KFK["Kafka<br/>Consumers/Producers"]
NEO["Neo4j<br/>Graph Queries"]
PGV["pgvector<br/>Semantic Search"]
end
subgraph OTEL_SDK["OpenTelemetry SDK (in-process)"]
direction LR
TRACES_SDK["TracerProvider"]
METRICS_SDK["MeterProvider"]
LOGS_SDK["LoggerProvider<br/>structlog bridge"]
end
subgraph COLLECTOR["OpenTelemetry Collector"]
direction LR
RECV["OTLP Receiver"]
PROC["Processors<br/>batch, filter, attributes"]
EXP["Exporters"]
end
subgraph BACKENDS["Storage Backends"]
direction LR
PROM["Prometheus<br/>Metrics"]
TEMPO["Grafana Tempo<br/>Traces"]
LOKI["Grafana Loki<br/>Logs"]
end
SOURCES --> OTEL_SDK
OTEL_SDK -->|"OTLP/gRPC"| COLLECTOR
COLLECTOR --> BACKENDS
BACKENDS --> GRAFANA["Grafana<br/>Dashboards + Alerts"]
GRAFANA --> AM["Alertmanager<br/>Notification Routing"]
style SOURCES fill:#3498db,color:#fff
style OTEL_SDK fill:#9b59b6,color:#fff
style COLLECTOR fill:#E67E22,color:#fff
style BACKENDS fill:#2C3E50,color:#fff
style GRAFANA fill:#27ae60,color:#fff
Distributed Trace: Full Agent Execution
A single user request generates a trace that spans the entire execution chain:
sequenceDiagram
participant CLI as CLI / Dashboard
participant GW as Gateway (FastAPI)
participant ORCH as Orchestrator
participant AGT as Domain Agent
participant SKL as Skill
participant MCP as MCP Bridge
participant TOOL as External Tool
participant NEO as Neo4j
participant KFK as Kafka
CLI->>GW: POST /api/v1/agent/run [trace_id generated]
activate GW
GW->>ORCH: create_session [child span]
activate ORCH
ORCH->>AGT: dispatch_task [child span]
activate AGT
AGT->>SKL: skill.execute [child span]
activate SKL
SKL->>MCP: mcp.tool_call [child span]
activate MCP
MCP->>TOOL: execute [child span]
TOOL-->>MCP: result
deactivate MCP
SKL-->>AGT: skill output
deactivate SKL
AGT->>NEO: graph query [child span]
activate NEO
NEO-->>AGT: query result
deactivate NEO
AGT->>KFK: publish event [child span, trace context in headers]
AGT-->>ORCH: agent result
deactivate AGT
ORCH-->>GW: session result
deactivate ORCH
GW-->>CLI: response [trace_id in header]
deactivate GW
Technology Stack
| Component | Technology | Rationale |
|---|---|---|
| Instrumentation | opentelemetry-sdk (Python) + @opentelemetry/sdk-node (TS) |
Industry standard, vendor-neutral; auto-instrumentation for FastAPI, httpx, kafka-python |
| Log backend | Grafana Loki | Label-indexed, lightweight, Grafana-native; no full-text indexing overhead |
| Metrics backend | Prometheus | Pull-based, PromQL for SLO queries; distinct from InfluxDB (reserved for device telemetry L3) |
| Trace backend | Grafana Tempo | Object storage backend reuses MinIO; lighter than Jaeger + Elasticsearch |
| Dashboarding | Grafana | Unified UI for all three pillars; alerting built-in |
| Alerting | Grafana Alertmanager | Rule-based alerting on metrics, traces, and logs |
| Collector | OpenTelemetry Collector | Vendor-neutral pipeline; receives OTLP, exports to all backends |
| Structured logging | structlog (existing) + OTel logging bridge |
Zero migration from existing logging; adds trace_id/span_id to every log line |
Why Prometheus instead of InfluxDB for metrics? InfluxDB is reserved for device telemetry (L3 — high-frequency sensor data). Prometheus serves operational metrics with pull-based scraping, native PromQL for SLO calculations, and Grafana integration. Keeping them separate avoids mixing operational concerns with product telemetry.
Why Tempo instead of Jaeger? Tempo uses object storage (MinIO, already in the stack) instead of Elasticsearch, reducing operational complexity. It integrates natively with Grafana for trace visualization and supports TraceQL for querying.
Instrumentation Strategy
Layer 1: Auto-Instrumentation (Zero Code Changes)
OpenTelemetry provides automatic instrumentation for common libraries. These require only SDK initialization — no code changes in application logic:
| Library | Auto-Instrumentation Package | What It Captures |
|---|---|---|
| FastAPI | opentelemetry-instrumentation-fastapi |
HTTP request spans, status codes, latency |
| httpx | opentelemetry-instrumentation-httpx |
Outbound HTTP calls (LLM API calls, supplier APIs) |
| kafka-python | opentelemetry-instrumentation-kafka-python |
Producer/consumer spans, topic metadata |
| psycopg2 | opentelemetry-instrumentation-psycopg2 |
PostgreSQL/pgvector query spans |
| requests | opentelemetry-instrumentation-requests |
Outbound HTTP (fallback for non-httpx clients) |
Layer 2: Manual Spans (Domain-Specific Instrumentation)
Custom spans for MetaForge-specific operations that auto-instrumentation cannot capture:
from opentelemetry import trace
tracer = trace.get_tracer("metaforge.agents")
async def execute_agent(agent_code: str, session_id: str, input_data: dict):
with tracer.start_as_current_span(
"agent.execute",
attributes={
"agent.code": agent_code,
"session.id": session_id,
"agent.input_size": len(str(input_data)),
}
) as span:
result = await agent.run(input_data)
span.set_attribute("agent.output_artifacts", len(result.artifacts))
span.set_attribute("agent.llm_tokens_total", result.token_usage.total)
span.set_attribute("agent.llm_cost_usd", result.cost_usd)
return result
Manual span catalog:
| Span Name | Attributes | Description |
|---|---|---|
agent.execute |
agent.code, session.id, agent.llm_tokens_total, agent.llm_cost_usd |
Full agent execution lifecycle |
skill.execute |
skill.name, skill.domain, skill.input_size |
Individual skill invocation |
mcp.tool_call |
tool.name, tool.adapter, tool.timeout_ms |
MCP protocol tool execution |
neo4j.query |
db.statement, db.operation, neo4j.result_count |
Graph database query |
pgvector.search |
pgvector.query_embedding_dim, pgvector.top_k, pgvector.similarity_threshold |
Semantic similarity search |
kafka.produce |
messaging.destination, messaging.message_id |
Event publication |
kafka.consume |
messaging.destination, messaging.consumer_group |
Event consumption |
llm.completion |
llm.provider, llm.model, llm.tokens.prompt, llm.tokens.completion, llm.cost_usd |
LLM API call |
constraint.evaluate |
constraint.rule_id, constraint.domain, constraint.result |
Constraint engine evaluation |
opa.decision |
opa.policy, opa.result, opa.input_size |
OPA policy evaluation |
Layer 3: Log Correlation
A structlog processor injects OpenTelemetry trace context into every log line, enabling log-to-trace correlation in Grafana:
import structlog
from opentelemetry import trace
def add_trace_context(logger, method_name, event_dict):
"""structlog processor that injects trace_id and span_id."""
span = trace.get_current_span()
if span.is_recording():
ctx = span.get_span_context()
event_dict["trace_id"] = format(ctx.trace_id, "032x")
event_dict["span_id"] = format(ctx.span_id, "016x")
event_dict["trace_flags"] = ctx.trace_flags
return event_dict
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
add_trace_context, # Inject OTel context
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.JSONRenderer(),
]
)
Result: Every log line includes trace_id and span_id, enabling Grafana to jump from a log entry directly to its parent trace.
Metrics Catalog
Approximately 30 metrics organized by subsystem. All metrics use Prometheus naming conventions (snake_case, unit suffix).
Gateway Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
metaforge_gateway_request_total |
Counter | method, endpoint, status_code |
Total HTTP requests |
metaforge_gateway_request_duration_seconds |
Histogram | method, endpoint |
Request latency distribution |
metaforge_gateway_websocket_connections |
Gauge | state |
Active WebSocket connections |
metaforge_gateway_active_sessions |
Gauge | status |
Sessions by status (running, pending_approval, etc.) |
Agent Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
metaforge_agent_execution_duration_seconds |
Histogram | agent_code, status |
Agent execution time |
metaforge_agent_execution_total |
Counter | agent_code, status |
Total agent executions (success/failure) |
metaforge_agent_llm_tokens_total |
Counter | agent_code, llm_provider, llm_model, token_type |
LLM tokens consumed (prompt/completion) |
metaforge_agent_llm_cost_usd_total |
Counter | agent_code, llm_provider, llm_model |
Cumulative LLM API cost |
metaforge_agent_llm_request_duration_seconds |
Histogram | agent_code, llm_provider, llm_model |
LLM API call latency |
metaforge_skill_execution_duration_seconds |
Histogram | skill_name, domain |
Skill execution time |
metaforge_skill_execution_total |
Counter | skill_name, domain, status |
Total skill executions |
Tool Adapter Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
metaforge_mcp_call_duration_seconds |
Histogram | tool_name, adapter |
MCP tool call latency |
metaforge_mcp_call_total |
Counter | tool_name, adapter, status |
Total MCP tool calls |
metaforge_mcp_call_errors_total |
Counter | tool_name, adapter, error_type |
MCP tool call failures |
Data Store Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
metaforge_neo4j_query_duration_seconds |
Histogram | operation, node_type |
Neo4j query latency |
metaforge_neo4j_active_connections |
Gauge | — | Neo4j connection pool usage |
metaforge_neo4j_query_total |
Counter | operation, status |
Total Neo4j queries |
metaforge_pgvector_search_duration_seconds |
Histogram | knowledge_type |
pgvector similarity search latency |
metaforge_pgvector_search_total |
Counter | knowledge_type, status |
Total pgvector searches |
metaforge_minio_operation_duration_seconds |
Histogram | operation |
MinIO read/write latency |
metaforge_minio_operation_total |
Counter | operation, status |
Total MinIO operations |
Kafka Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
metaforge_kafka_consumer_lag |
Gauge | consumer_group, topic, partition |
Consumer lag (messages behind) |
metaforge_kafka_messages_produced_total |
Counter | topic |
Messages published |
metaforge_kafka_messages_consumed_total |
Counter | topic, consumer_group |
Messages consumed |
metaforge_kafka_dead_letters_total |
Counter | topic, consumer_group |
Messages sent to dead letter queue |
metaforge_kafka_rebalance_total |
Counter | consumer_group |
Consumer group rebalances |
Constraint & Policy Metrics
| Metric | Type | Labels | Description |
|---|---|---|---|
metaforge_constraint_evaluation_total |
Counter | domain, result |
Constraint evaluations (pass/fail/error) |
metaforge_constraint_evaluation_duration_seconds |
Histogram | domain |
Constraint evaluation latency |
metaforge_opa_decision_total |
Counter | policy, result |
OPA policy decisions |
metaforge_oscillation_detected_total |
Counter | node_type |
Oscillation detection events |
Distributed Tracing Architecture
Trace Waterfall Example
A typical agent execution trace spans multiple services and produces a waterfall like this:
gantt
title Trace Waterfall: Agent Execution
dateFormat X
axisFormat %s
section Gateway
HTTP POST /api/v1/agent/run :gw, 0, 30000
section Orchestrator
create_session :orch, 100, 500
dispatch_task :orch2, 600, 29000
section Agent
agent.execute (EE) :agt, 700, 28000
llm.completion :llm1, 800, 5000
skill.execute (run_erc) :skl, 6000, 8000
mcp.tool_call (kicad) :mcp, 6100, 7500
llm.completion :llm2, 15000, 5000
neo4j.query (write artifacts) :neo, 21000, 500
kafka.produce (graph.mutations) :kfk, 22000, 100
W3C TraceContext Propagation
MetaForge propagates trace context using the W3C TraceContext standard across all communication boundaries:
| Boundary | Propagation Mechanism | Format |
|---|---|---|
| HTTP (Gateway ↔ CLI/Dashboard) | traceparent + tracestate headers |
W3C TraceContext |
| Kafka messages | Message headers: traceparent |
W3C TraceContext |
| WebSocket | Initial handshake headers + message metadata field | W3C TraceContext |
| MCP tool calls | Tool call context metadata | W3C TraceContext |
Kafka header propagation:
from opentelemetry.context.propagation import inject
def produce_with_context(producer, topic: str, value: bytes):
"""Inject trace context into Kafka message headers."""
headers = []
inject(carrier=headers, setter=kafka_header_setter)
producer.send(topic, value=value, headers=headers)
Backward-Compatible .forge/traces/ Enrichment
Existing .forge/traces/ JSONL files are enriched with trace context fields. This is additive — existing consumers that ignore these fields continue to work:
{
"timestamp": "2026-02-28T10:30:00Z",
"session_id": "abc-123",
"agent": "electronics",
"action": "run_erc",
"level": "info",
"data": {"components_checked": 42},
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"parent_span_id": "a1b2c3d4e5f60718"
}
SLO/SLI Framework
This section formalizes the existing performance targets (Architecture index — Section 9.1) into proper SLOs with error budgets.
Service Level Indicators (SLIs) and Objectives (SLOs)
| Service | SLI | SLO | Error Budget (30d) | Measurement |
|---|---|---|---|---|
| Gateway | Availability (non-5xx responses / total) | 99.9% | 43.2 min downtime | metaforge_gateway_request_total |
| Gateway | Latency p99 (non-agent requests) | < 100ms | 0.1% of requests may exceed | metaforge_gateway_request_duration_seconds |
| Agent execution | Success rate | > 95% | 5% of executions may fail | metaforge_agent_execution_total |
| Agent execution | Latency p95 | < 30s | 5% of executions may exceed | metaforge_agent_execution_duration_seconds |
| Neo4j reads | Latency p99 | < 50ms | 1% of queries may exceed | metaforge_neo4j_query_duration_seconds |
| Neo4j writes | Latency p99 | < 200ms | 1% of queries may exceed | metaforge_neo4j_query_duration_seconds |
| pgvector search | Latency p99 | < 100ms | 1% of searches may exceed | metaforge_pgvector_search_duration_seconds |
| Kafka consumer lag | Max lag per partition | < 1000 messages | Sustained violation triggers alert | metaforge_kafka_consumer_lag |
| Kafka DLQ | DLQ rate | < 0.1% of consumed messages | 0.1% may go to DLQ | metaforge_kafka_dead_letters_total |
PromQL Examples for SLO Calculation
Gateway availability (30-day window):
# Gateway availability SLO: 99.9%
1 - (
sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[30d]))
/
sum(rate(metaforge_gateway_request_total[30d]))
)
Agent success rate:
# Agent success rate SLO: > 95%
sum(rate(metaforge_agent_execution_total{status="success"}[30d]))
/
sum(rate(metaforge_agent_execution_total[30d]))
Error budget burn rate (alerting on fast burn):
# Alert if burning error budget 14.4x faster than sustainable (2% budget consumed in 1h)
(
sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[1h]))
/
sum(rate(metaforge_gateway_request_total[1h]))
) > (14.4 * 0.001)
Neo4j read latency p99:
# Neo4j read latency p99 < 50ms
histogram_quantile(0.99,
sum(rate(metaforge_neo4j_query_duration_seconds_bucket{operation="read"}[5m])) by (le)
)
Alerting Architecture
Severity Levels
| Severity | Response Time | Notification Channel | Description |
|---|---|---|---|
| Critical | < 15 min | PagerDuty / Slack #alerts-critical |
Service down, data loss risk, SLO breach imminent |
| Warning | < 1 hour | Slack #alerts-warning |
Degraded performance, approaching thresholds |
| Info | Next business day | Slack #alerts-info |
Capacity planning, trend notifications |
Alerting Rules
# observability/alerting/rules.yaml
groups:
- name: metaforge-critical
rules:
- alert: GatewayDown
expr: up{job="metaforge-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MetaForge Gateway is down"
description: "Gateway has been unreachable for more than 1 minute."
- alert: KafkaConsumerStopped
expr: |
rate(metaforge_kafka_messages_consumed_total[5m]) == 0
and on(consumer_group) metaforge_kafka_consumer_lag > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Kafka consumer has stopped"
description: "Consumer has non-zero lag but zero consumption rate for 5 minutes."
- alert: Neo4jUnreachable
expr: metaforge_neo4j_active_connections == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Neo4j connection pool is empty"
description: "No active Neo4j connections — all graph operations will fail."
- name: metaforge-warning
rules:
- alert: HighAgentFailureRate
expr: |
sum(rate(metaforge_agent_execution_total{status="failure"}[15m]))
/
sum(rate(metaforge_agent_execution_total[15m]))
> 0.1
for: 10m
labels:
severity: warning
annotations:
summary: "Agent failure rate exceeds 10%"
description: " of agent executions are failing."
- alert: KafkaConsumerLagHigh
expr: metaforge_kafka_consumer_lag > 5000
for: 10m
labels:
severity: warning
annotations:
summary: "Kafka consumer lag is high"
description: "Consumer on has lag of ."
- alert: ErrorBudgetBurnRate
expr: |
(
sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[1h]))
/
sum(rate(metaforge_gateway_request_total[1h]))
) > (14.4 * 0.001)
for: 5m
labels:
severity: warning
annotations:
summary: "Gateway error budget burning too fast"
description: "At current rate, 30-day error budget will be exhausted in hours."
- alert: LLMCostSpike
expr: |
sum(rate(metaforge_agent_llm_cost_usd_total[1h])) * 3600
> 10
for: 15m
labels:
severity: warning
annotations:
summary: "LLM cost exceeding $10/hour"
description: "Current hourly LLM spend: $."
- alert: OscillationDetected
expr: rate(metaforge_oscillation_detected_total[5m]) > 0
for: 1m
labels:
severity: warning
annotations:
summary: "Agent oscillation detected"
description: "Oscillation detected on nodes."
Notification Routing
# observability/alerting/routes.yaml
route:
receiver: default-slack
group_by: [alertname, severity]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: pagerduty-critical
repeat_interval: 15m
- match:
severity: warning
receiver: slack-warning
repeat_interval: 1h
- match:
severity: info
receiver: slack-info
repeat_interval: 24h
receivers:
- name: pagerduty-critical
pagerduty_configs:
- routing_key: "${PAGERDUTY_ROUTING_KEY}"
- name: slack-warning
slack_configs:
- channel: "#alerts-warning"
api_url: "${SLACK_WEBHOOK_URL}"
- name: slack-info
slack_configs:
- channel: "#alerts-info"
api_url: "${SLACK_WEBHOOK_URL}"
- name: default-slack
slack_configs:
- channel: "#alerts-general"
api_url: "${SLACK_WEBHOOK_URL}"
Grafana Dashboards
Four dashboards provide complete operational visibility:
1. System Overview Dashboard
Purpose: High-level health at a glance for all MetaForge components.
| Panel | Visualization | Data Source | Query |
|---|---|---|---|
| Gateway Status | Stat (up/down) | Prometheus | up{job="metaforge-gateway"} |
| Request Rate | Time series | Prometheus | rate(metaforge_gateway_request_total[5m]) |
| Error Rate | Time series | Prometheus | rate(metaforge_gateway_request_total{status_code=~"5.."}[5m]) |
| Active Sessions | Stat | Prometheus | metaforge_gateway_active_sessions |
| WebSocket Connections | Gauge | Prometheus | metaforge_gateway_websocket_connections |
| Kafka Consumer Lag | Time series | Prometheus | metaforge_kafka_consumer_lag |
| Neo4j Connection Pool | Gauge | Prometheus | metaforge_neo4j_active_connections |
| Recent Alerts | Table | Alertmanager | Active alerts by severity |
2. Agent Performance Dashboard
Purpose: Per-agent execution metrics, LLM costs, and skill breakdowns.
| Panel | Visualization | Data Source | Query |
|---|---|---|---|
| Agent Success Rate | Stat per agent | Prometheus | Success/total by agent_code |
| Execution Duration (p50/p95/p99) | Heatmap | Prometheus | metaforge_agent_execution_duration_seconds |
| LLM Token Usage | Stacked bar | Prometheus | metaforge_agent_llm_tokens_total by model |
| LLM Cost Over Time | Time series | Prometheus | rate(metaforge_agent_llm_cost_usd_total[1h]) |
| Skill Execution Breakdown | Table | Prometheus | metaforge_skill_execution_duration_seconds by skill |
| MCP Tool Call Latency | Heatmap | Prometheus | metaforge_mcp_call_duration_seconds by tool |
| Agent Trace Explorer | Trace panel | Tempo | Traces filtered by agent.code |
3. Data Stores Dashboard
Purpose: Neo4j, pgvector, Kafka, and MinIO operational health.
| Panel | Visualization | Data Source | Query |
|---|---|---|---|
| Neo4j Query Latency | Histogram | Prometheus | metaforge_neo4j_query_duration_seconds |
| Neo4j Operations | Time series | Prometheus | rate(metaforge_neo4j_query_total[5m]) by operation |
| pgvector Search Latency | Histogram | Prometheus | metaforge_pgvector_search_duration_seconds |
| Kafka Consumer Lag by Topic | Time series | Prometheus | metaforge_kafka_consumer_lag by topic |
| Kafka Throughput | Time series | Prometheus | rate(metaforge_kafka_messages_produced_total[5m]) |
| Dead Letter Queue Depth | Stat | Prometheus | metaforge_kafka_dead_letters_total |
| MinIO Operations | Time series | Prometheus | rate(metaforge_minio_operation_total[5m]) |
4. SLO Overview Dashboard
Purpose: Error budget tracking and SLO compliance for all defined objectives.
| Panel | Visualization | Data Source | Query |
|---|---|---|---|
| Gateway Availability (30d) | Stat with threshold | Prometheus | Availability PromQL (see above) |
| Gateway Latency p99 | Stat with threshold | Prometheus | histogram_quantile(0.99, ...) |
| Agent Success Rate (30d) | Stat with threshold | Prometheus | Agent success PromQL |
| Error Budget Remaining | Gauge per SLO | Prometheus | Budget consumed vs total |
| Error Budget Burn Rate | Time series | Prometheus | Burn rate over sliding windows |
| SLO Compliance History | Table | Prometheus | Weekly SLO compliance |
Repository Structure
The observability/ directory contains all observability infrastructure code:
observability/
├── bootstrap.py # OTel SDK initialization (TracerProvider, MeterProvider, LoggerProvider)
├── logging.py # structlog → OTel logging bridge (trace_id/span_id injection)
├── tracing.py # Custom span helpers for agent/skill/tool instrumentation
├── metrics.py # Prometheus metric definitions (all ~30 metrics)
├── middleware.py # FastAPI middleware for request tracing and metrics
├── config.py # ObservabilityConfig Pydantic model
├── slo/
│ ├── definitions.py # SLO/SLI definitions as code
│ └── calculator.py # Error budget calculation utilities
├── alerting/
│ ├── rules.yaml # Prometheus alerting rules
│ └── routes.yaml # Alertmanager notification routing
└── dashboards/
├── system-overview.json # Grafana dashboard: system health
├── agent-performance.json # Grafana dashboard: agent metrics
├── data-stores.json # Grafana dashboard: Neo4j/Kafka/pgvector
└── slo-overview.json # Grafana dashboard: SLO tracking
Phased Delivery
Observability delivery aligns with the existing phased timeline:
| Phase | Deliverable | Timeline | Details |
|---|---|---|---|
| P1 | structlog with trace_id, basic Prometheus metrics, /health enhancement, system Grafana dashboard |
Months 1-6 | Add trace_id to structlog output; instrument Gateway request count/latency; enhance /health with dependency checks (Neo4j, Kafka); create system overview dashboard |
| P1.5 | OTel SDK, distributed tracing (Gateway → Agent → Neo4j), agent LLM metrics, Kafka consumer lag | Months 5-6 | Initialize opentelemetry-sdk; add manual spans for agent.execute, llm.completion, neo4j.query; add LLM token/cost counters; monitor Kafka consumer lag |
| P2 | Full three-pillar stack (Loki, Tempo, Prometheus, Grafana), SLO dashboards, alerting rules, MCP + pgvector metrics | Months 7-12 | Deploy OTel Collector, Loki, Tempo; add all alerting rules; instrument MCP tool calls and pgvector searches; create all four Grafana dashboards; define SLO error budgets |
| P3 | Device telemetry observability, fleet health dashboards, anomaly alerts, incident runbooks | Months 13-18 | Add MQTT/telemetry pipeline metrics; fleet health Grafana dashboard; anomaly detection alerts; on-call runbooks for each critical alert |
| P4 | Simulation performance tracking, multi-tenant observability, cost attribution, enterprise audit export | Months 21-24+ | Per-tenant metric isolation; simulation run performance tracking; LLM cost attribution per project/team; audit log export for compliance |
Phase Dependencies
flowchart LR
P1["P1: structlog + basic metrics<br/>Months 1-6"] --> P15["P1.5: OTel SDK + tracing<br/>Months 5-6"]
P15 --> P2["P2: Full three-pillar stack<br/>Months 7-12"]
P2 --> P3["P3: Device telemetry observability<br/>Months 13-18"]
P3 --> P4["P4: Multi-tenant + audit<br/>Months 21-24+"]
style P1 fill:#27ae60,color:#fff
style P15 fill:#2ecc71,color:#fff
style P2 fill:#3498db,color:#fff
style P3 fill:#9b59b6,color:#fff
style P4 fill:#E67E22,color:#fff
Migration from Existing Patterns
All changes are additive — no existing functionality is removed or broken.
| Existing Piece | Migration Path | Breaking Changes |
|---|---|---|
structlog (Gateway logging) |
Add add_trace_context processor to existing structlog.configure() call |
None — existing log fields preserved; trace_id/span_id added as new fields |
.forge/traces/ JSONL |
Enrich existing JSONL entries with trace_id, span_id, parent_span_id fields |
None — new fields are additive; existing consumers ignore unknown fields |
GET /health endpoint |
Extend response to include dependency health checks (Neo4j, Kafka, pgvector connectivity) | None — existing 200 OK response preserved; new fields added to response body |
| Performance targets (index.md §9.1) | Formalized into SLO/SLI definitions with PromQL queries and error budgets | None — targets unchanged; SLO framework adds measurement and alerting |
| Dead letter queue (event-sourcing) | Add metaforge_kafka_dead_letters_total counter; DLQ depth visible in Grafana |
None — existing DLQ behavior unchanged; metrics are new instrumentation |
| Oscillation detection (event-sourcing) | Add metaforge_oscillation_detected_total counter; alerting rule triggers on detection |
None — existing detection logic unchanged; metric emission added |
| OPA audit logging (orchestrator-technical) | Route OPA decision logs through structlog (with trace context) to Loki | None — existing OPA audit trail preserved; logs gain correlation IDs |
ComponentHealth types (dashboard) |
Backend /health endpoint populates these types with real metrics from Prometheus |
None — existing types unchanged; backed by real data instead of placeholders |
| WebSocket events (gateway) | Add connection/message metrics; propagate trace context in WebSocket handshake | None — existing WebSocket protocol unchanged; metrics are new |
Configuration
ObservabilityConfig
from pydantic import BaseModel
from typing import Optional
class OTelExporterConfig(BaseModel):
endpoint: str = "http://localhost:4317" # OTel Collector OTLP/gRPC
insecure: bool = True # TLS disabled for local dev
timeout_ms: int = 5000
class PrometheusConfig(BaseModel):
port: int = 9090
scrape_interval: str = "15s"
class GrafanaConfig(BaseModel):
url: str = "http://localhost:3000"
class ObservabilityConfig(BaseModel):
"""Central configuration for all observability components."""
enabled: bool = True
service_name: str = "metaforge"
environment: str = "development" # development | staging | production
# Exporters
otlp: OTelExporterConfig = OTelExporterConfig()
prometheus: PrometheusConfig = PrometheusConfig()
grafana: GrafanaConfig = GrafanaConfig()
# Sampling
trace_sample_rate: float = 1.0 # 1.0 = sample everything (dev); 0.1 = 10% (prod)
log_level: str = "INFO"
# Feature flags
enable_traces: bool = True
enable_metrics: bool = True
enable_logs: bool = True
# Backends (for docker-compose generation)
loki_url: Optional[str] = "http://localhost:3100"
tempo_url: Optional[str] = "http://localhost:3200"
Docker Compose: Observability Stack
# docker-compose.observability.yml
version: "3.9"
services:
otel-collector:
image: otel/opentelemetry-collector-contrib:latest
command: ["--config=/etc/otel-collector-config.yaml"]
volumes:
- ./observability/otel-collector-config.yaml:/etc/otel-collector-config.yaml
ports:
- "4317:4317" # OTLP gRPC
- "4318:4318" # OTLP HTTP
- "8889:8889" # Prometheus metrics exporter
depends_on:
- tempo
- loki
prometheus:
image: prom/prometheus:latest
volumes:
- ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
- ./observability/alerting/rules.yaml:/etc/prometheus/rules.yaml
ports:
- "9090:9090"
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.retention.time=30d"
grafana:
image: grafana/grafana:latest
volumes:
- ./observability/dashboards:/var/lib/grafana/dashboards
- ./observability/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
- ./observability/grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/dashboards.yml
ports:
- "3000:3000"
environment:
GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:-admin}"
depends_on:
- prometheus
- tempo
- loki
tempo:
image: grafana/tempo:latest
command: ["-config.file=/etc/tempo.yaml"]
volumes:
- ./observability/tempo.yaml:/etc/tempo.yaml
ports:
- "3200:3200" # Tempo API
- "4317" # OTLP gRPC (internal)
loki:
image: grafana/loki:latest
command: ["-config.file=/etc/loki/local-config.yaml"]
ports:
- "3100:3100"
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./observability/alerting/routes.yaml:/etc/alertmanager/alertmanager.yml
ports:
- "9093:9093"
Related Documents
| Document | Relationship |
|---|---|
| Architecture Overview | Parent architecture; Gateway dependencies and performance targets formalized here |
| Event Sourcing | DLQ, oscillation detection, and Kafka consumer patterns instrumented by observability metrics |
| Digital Twin Evolution | Technology stack extended with observability layer; phased delivery aligned |
| Repository Structure | observability/ directory added to repository layout |
| AI Memory & Knowledge | pgvector search metrics and knowledge pipeline monitoring |
| Constraint Engine | Constraint evaluation metrics and alerting on failures |
| Orchestrator Technical | OPA audit logging integrated with centralized observability |
Document Version: v1.0 Last Updated: 2026-02-28 Status: Technical Architecture Document
| ← AI Memory & Knowledge | Architecture Home → |