System Observability — Logging, Metrics & Traces

Three-pillar observability framework unifying scattered monitoring decisions into a cohesive OpenTelemetry-based architecture

Problem Statement
Architecture Overview
1. Signal Flow
2. Distributed Trace: Full Agent Execution
Technology Stack
Instrumentation Strategy
Metrics Catalog
Distributed Tracing Architecture
SLO/SLI Framework
1. Service Level Indicators (SLIs) and Objectives (SLOs)
2. PromQL Examples for SLO Calculation
Alerting Architecture
Grafana Dashboards
Repository Structure
Phased Delivery
1. Phase Dependencies
Migration from Existing Patterns
Configuration
1. ObservabilityConfig
2. Docker Compose: Observability Stack
Related Documents

Problem Statement

MetaForge has observability decisions scattered across multiple documents but no unified strategy. The table below maps each existing piece to its current location and identifies gaps:

Existing Piece	Current Location	What It Provides	Gap
`structlog`	Architecture index — Gateway dependencies	Structured JSON logging	No trace correlation (`trace_id`/`span_id`), no centralized log aggregation
`.forge/traces/` JSONL	Architecture index — Workspace structure	Per-session execution traces	Local-only, no distributed tracing, no span hierarchy
`GET /health`	Architecture index — API endpoints	Basic liveness check	No readiness probe, no dependency health (Neo4j, Kafka, pgvector)
Performance targets	Architecture index — Section 9.1	Response time and resource limits	No SLO/SLI framework, no error budgets, no alerting
Dead letter queue	Event Sourcing — Multi-agent coordination	Failed event retry and investigation	No DLQ depth metrics, no consumer lag monitoring
Oscillation detection	Event Sourcing — Propagation control	Detects agent feedback loops	No oscillation metrics or alerting rules
OPA audit logging	Orchestrator Technical — Governance	Policy decision audit trail	Not integrated with centralized logging or metrics
`ComponentHealth` types	Dashboard types	Frontend health display models	No backend metrics feeding these types
WebSocket events	Architecture index — Gateway service	Real-time UI updates	No connection metrics, no message rate tracking

What is missing:

No OpenTelemetry instrumentation (the industry standard for vendor-neutral telemetry)
No metrics collection (Prometheus) or operational dashboarding (Grafana)
No distributed tracing across agent → skill → tool → Neo4j call chains
No log correlation (trace_id/span_id propagation across services)
No alerting or on-call architecture
No SLO/SLI/error budget framework
No agent-specific metrics (LLM token usage, cost, latency per agent/skill)
No Kafka consumer lag, Neo4j query, or pgvector search monitoring

Architecture Overview

Signal Flow

All telemetry signals (logs, metrics, traces) flow through a unified OpenTelemetry pipeline:

flowchart TD
    subgraph SOURCES["Signal Sources"]
        direction LR
        GW["Gateway<br/>FastAPI"]
        AGT["Domain Agents<br/>Python"]
        SKL["Skills<br/>Python"]
        MCP_S["MCP Layer<br/>Tool Adapters"]
        KFK["Kafka<br/>Consumers/Producers"]
        NEO["Neo4j<br/>Graph Queries"]
        PGV["pgvector<br/>Semantic Search"]
    end

    subgraph OTEL_SDK["OpenTelemetry SDK (in-process)"]
        direction LR
        TRACES_SDK["TracerProvider"]
        METRICS_SDK["MeterProvider"]
        LOGS_SDK["LoggerProvider<br/>structlog bridge"]
    end

    subgraph COLLECTOR["OpenTelemetry Collector"]
        direction LR
        RECV["OTLP Receiver"]
        PROC["Processors<br/>batch, filter, attributes"]
        EXP["Exporters"]
    end

    subgraph BACKENDS["Storage Backends"]
        direction LR
        PROM["Prometheus<br/>Metrics"]
        TEMPO["Grafana Tempo<br/>Traces"]
        LOKI["Grafana Loki<br/>Logs"]
    end

    SOURCES --> OTEL_SDK
    OTEL_SDK -->|"OTLP/gRPC"| COLLECTOR
    COLLECTOR --> BACKENDS
    BACKENDS --> GRAFANA["Grafana<br/>Dashboards + Alerts"]
    GRAFANA --> AM["Alertmanager<br/>Notification Routing"]

    style SOURCES fill:#3498db,color:#fff
    style OTEL_SDK fill:#9b59b6,color:#fff
    style COLLECTOR fill:#E67E22,color:#fff
    style BACKENDS fill:#2C3E50,color:#fff
    style GRAFANA fill:#27ae60,color:#fff

Distributed Trace: Full Agent Execution

A single user request generates a trace that spans the entire execution chain:

sequenceDiagram
    participant CLI as CLI / Dashboard
    participant GW as Gateway (FastAPI)
    participant ORCH as Orchestrator
    participant AGT as Domain Agent
    participant SKL as Skill
    participant MCP as MCP Bridge
    participant TOOL as External Tool
    participant NEO as Neo4j
    participant KFK as Kafka

    CLI->>GW: POST /api/v1/agent/run [trace_id generated]
    activate GW
    GW->>ORCH: create_session [child span]
    activate ORCH
    ORCH->>AGT: dispatch_task [child span]
    activate AGT
    AGT->>SKL: skill.execute [child span]
    activate SKL
    SKL->>MCP: mcp.tool_call [child span]
    activate MCP
    MCP->>TOOL: execute [child span]
    TOOL-->>MCP: result
    deactivate MCP
    SKL-->>AGT: skill output
    deactivate SKL
    AGT->>NEO: graph query [child span]
    activate NEO
    NEO-->>AGT: query result
    deactivate NEO
    AGT->>KFK: publish event [child span, trace context in headers]
    AGT-->>ORCH: agent result
    deactivate AGT
    ORCH-->>GW: session result
    deactivate ORCH
    GW-->>CLI: response [trace_id in header]
    deactivate GW

Technology Stack

Component	Technology	Rationale
Instrumentation	`opentelemetry-sdk` (Python) + `@opentelemetry/sdk-node` (TS)	Industry standard, vendor-neutral; auto-instrumentation for FastAPI, httpx, kafka-python
Log backend	Grafana Loki	Label-indexed, lightweight, Grafana-native; no full-text indexing overhead
Metrics backend	Prometheus	Pull-based, PromQL for SLO queries; distinct from InfluxDB (reserved for device telemetry L3)
Trace backend	Grafana Tempo	Object storage backend reuses MinIO; lighter than Jaeger + Elasticsearch
Dashboarding	Grafana	Unified UI for all three pillars; alerting built-in
Alerting	Grafana Alertmanager	Rule-based alerting on metrics, traces, and logs
Collector	OpenTelemetry Collector	Vendor-neutral pipeline; receives OTLP, exports to all backends
Structured logging	`structlog` (existing) + OTel logging bridge	Zero migration from existing logging; adds `trace_id`/`span_id` to every log line

Why Prometheus instead of InfluxDB for metrics? InfluxDB is reserved for device telemetry (L3 — high-frequency sensor data). Prometheus serves operational metrics with pull-based scraping, native PromQL for SLO calculations, and Grafana integration. Keeping them separate avoids mixing operational concerns with product telemetry.

Why Tempo instead of Jaeger? Tempo uses object storage (MinIO, already in the stack) instead of Elasticsearch, reducing operational complexity. It integrates natively with Grafana for trace visualization and supports TraceQL for querying.

Instrumentation Strategy

Layer 1: Auto-Instrumentation (Zero Code Changes)

OpenTelemetry provides automatic instrumentation for common libraries. These require only SDK initialization — no code changes in application logic:

Library	Auto-Instrumentation Package	What It Captures
FastAPI	`opentelemetry-instrumentation-fastapi`	HTTP request spans, status codes, latency
httpx	`opentelemetry-instrumentation-httpx`	Outbound HTTP calls (LLM API calls, supplier APIs)
kafka-python	`opentelemetry-instrumentation-kafka-python`	Producer/consumer spans, topic metadata
psycopg2	`opentelemetry-instrumentation-psycopg2`	PostgreSQL/pgvector query spans
requests	`opentelemetry-instrumentation-requests`	Outbound HTTP (fallback for non-httpx clients)

Layer 2: Manual Spans (Domain-Specific Instrumentation)

Custom spans for MetaForge-specific operations that auto-instrumentation cannot capture:

from opentelemetry import trace

tracer = trace.get_tracer("metaforge.agents")

async def execute_agent(agent_code: str, session_id: str, input_data: dict):
    with tracer.start_as_current_span(
        "agent.execute",
        attributes={
            "agent.code": agent_code,
            "session.id": session_id,
            "agent.input_size": len(str(input_data)),
        }
    ) as span:
        result = await agent.run(input_data)
        span.set_attribute("agent.output_artifacts", len(result.artifacts))
        span.set_attribute("agent.llm_tokens_total", result.token_usage.total)
        span.set_attribute("agent.llm_cost_usd", result.cost_usd)
        return result

Manual span catalog:

Span Name	Attributes	Description
`agent.execute`	`agent.code`, `session.id`, `agent.llm_tokens_total`, `agent.llm_cost_usd`	Full agent execution lifecycle
`skill.execute`	`skill.name`, `skill.domain`, `skill.input_size`	Individual skill invocation
`mcp.tool_call`	`tool.name`, `tool.adapter`, `tool.timeout_ms`	MCP protocol tool execution
`neo4j.query`	`db.statement`, `db.operation`, `neo4j.result_count`	Graph database query
`pgvector.search`	`pgvector.query_embedding_dim`, `pgvector.top_k`, `pgvector.similarity_threshold`	Semantic similarity search
`kafka.produce`	`messaging.destination`, `messaging.message_id`	Event publication
`kafka.consume`	`messaging.destination`, `messaging.consumer_group`	Event consumption
`llm.completion`	`llm.provider`, `llm.model`, `llm.tokens.prompt`, `llm.tokens.completion`, `llm.cost_usd`	LLM API call
`constraint.evaluate`	`constraint.rule_id`, `constraint.domain`, `constraint.result`	Constraint engine evaluation
`opa.decision`	`opa.policy`, `opa.result`, `opa.input_size`	OPA policy evaluation

Layer 3: Log Correlation

A structlog processor injects OpenTelemetry trace context into every log line, enabling log-to-trace correlation in Grafana:

import structlog
from opentelemetry import trace

def add_trace_context(logger, method_name, event_dict):
    """structlog processor that injects trace_id and span_id."""
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
        event_dict["trace_flags"] = ctx.trace_flags
    return event_dict

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        add_trace_context,  # Inject OTel context
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

Result: Every log line includes trace_id and span_id, enabling Grafana to jump from a log entry directly to its parent trace.

Metrics Catalog

Approximately 30 metrics organized by subsystem. All metrics use Prometheus naming conventions (snake_case, unit suffix).

Gateway Metrics

Metric	Type	Labels	Description
`metaforge_gateway_request_total`	Counter	`method`, `endpoint`, `status_code`	Total HTTP requests
`metaforge_gateway_request_duration_seconds`	Histogram	`method`, `endpoint`	Request latency distribution
`metaforge_gateway_websocket_connections`	Gauge	`state`	Active WebSocket connections
`metaforge_gateway_active_sessions`	Gauge	`status`	Sessions by status (running, pending_approval, etc.)

Agent Metrics

Metric	Type	Labels	Description
`metaforge_agent_execution_duration_seconds`	Histogram	`agent_code`, `status`	Agent execution time
`metaforge_agent_execution_total`	Counter	`agent_code`, `status`	Total agent executions (success/failure)
`metaforge_agent_llm_tokens_total`	Counter	`agent_code`, `llm_provider`, `llm_model`, `token_type`	LLM tokens consumed (prompt/completion)
`metaforge_agent_llm_cost_usd_total`	Counter	`agent_code`, `llm_provider`, `llm_model`	Cumulative LLM API cost
`metaforge_agent_llm_request_duration_seconds`	Histogram	`agent_code`, `llm_provider`, `llm_model`	LLM API call latency
`metaforge_skill_execution_duration_seconds`	Histogram	`skill_name`, `domain`	Skill execution time
`metaforge_skill_execution_total`	Counter	`skill_name`, `domain`, `status`	Total skill executions

Tool Adapter Metrics

Metric	Type	Labels	Description
`metaforge_mcp_call_duration_seconds`	Histogram	`tool_name`, `adapter`	MCP tool call latency
`metaforge_mcp_call_total`	Counter	`tool_name`, `adapter`, `status`	Total MCP tool calls
`metaforge_mcp_call_errors_total`	Counter	`tool_name`, `adapter`, `error_type`	MCP tool call failures

Data Store Metrics

Metric	Type	Labels	Description
`metaforge_neo4j_query_duration_seconds`	Histogram	`operation`, `node_type`	Neo4j query latency
`metaforge_neo4j_active_connections`	Gauge	—	Neo4j connection pool usage
`metaforge_neo4j_query_total`	Counter	`operation`, `status`	Total Neo4j queries
`metaforge_pgvector_search_duration_seconds`	Histogram	`knowledge_type`	pgvector similarity search latency
`metaforge_pgvector_search_total`	Counter	`knowledge_type`, `status`	Total pgvector searches
`metaforge_minio_operation_duration_seconds`	Histogram	`operation`	MinIO read/write latency
`metaforge_minio_operation_total`	Counter	`operation`, `status`	Total MinIO operations

Kafka Metrics

Metric	Type	Labels	Description
`metaforge_kafka_consumer_lag`	Gauge	`consumer_group`, `topic`, `partition`	Consumer lag (messages behind)
`metaforge_kafka_messages_produced_total`	Counter	`topic`	Messages published
`metaforge_kafka_messages_consumed_total`	Counter	`topic`, `consumer_group`	Messages consumed
`metaforge_kafka_dead_letters_total`	Counter	`topic`, `consumer_group`	Messages sent to dead letter queue
`metaforge_kafka_rebalance_total`	Counter	`consumer_group`	Consumer group rebalances

Constraint & Policy Metrics

Metric	Type	Labels	Description
`metaforge_constraint_evaluation_total`	Counter	`domain`, `result`	Constraint evaluations (pass/fail/error)
`metaforge_constraint_evaluation_duration_seconds`	Histogram	`domain`	Constraint evaluation latency
`metaforge_opa_decision_total`	Counter	`policy`, `result`	OPA policy decisions
`metaforge_oscillation_detected_total`	Counter	`node_type`	Oscillation detection events

Distributed Tracing Architecture

Trace Waterfall Example

A typical agent execution trace spans multiple services and produces a waterfall like this:

gantt
    title Trace Waterfall: Agent Execution
    dateFormat X
    axisFormat %s

    section Gateway
    HTTP POST /api/v1/agent/run       :gw, 0, 30000

    section Orchestrator
    create_session                     :orch, 100, 500
    dispatch_task                      :orch2, 600, 29000

    section Agent
    agent.execute (EE)                 :agt, 700, 28000
    llm.completion                     :llm1, 800, 5000
    skill.execute (run_erc)            :skl, 6000, 8000
    mcp.tool_call (kicad)              :mcp, 6100, 7500
    llm.completion                     :llm2, 15000, 5000
    neo4j.query (write artifacts)      :neo, 21000, 500
    kafka.produce (graph.mutations)    :kfk, 22000, 100

W3C TraceContext Propagation

MetaForge propagates trace context using the W3C TraceContext standard across all communication boundaries:

Boundary	Propagation Mechanism	Format
HTTP (Gateway ↔ CLI/Dashboard)	`traceparent` + `tracestate` headers	W3C TraceContext
Kafka messages	Message headers: `traceparent`	W3C TraceContext
WebSocket	Initial handshake headers + message metadata field	W3C TraceContext
MCP tool calls	Tool call context metadata	W3C TraceContext

Kafka header propagation:

from opentelemetry.context.propagation import inject

def produce_with_context(producer, topic: str, value: bytes):
    """Inject trace context into Kafka message headers."""
    headers = []
    inject(carrier=headers, setter=kafka_header_setter)
    producer.send(topic, value=value, headers=headers)

Backward-Compatible `.forge/traces/` Enrichment

Existing .forge/traces/ JSONL files are enriched with trace context fields. This is additive — existing consumers that ignore these fields continue to work:

{
  "timestamp": "2026-02-28T10:30:00Z",
  "session_id": "abc-123",
  "agent": "electronics",
  "action": "run_erc",
  "level": "info",
  "data": {"components_checked": 42},
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "parent_span_id": "a1b2c3d4e5f60718"
}

SLO/SLI Framework

This section formalizes the existing performance targets (Architecture index — Section 9.1) into proper SLOs with error budgets.

Service Level Indicators (SLIs) and Objectives (SLOs)

Service	SLI	SLO	Error Budget (30d)	Measurement
Gateway	Availability (non-5xx responses / total)	99.9%	43.2 min downtime	`metaforge_gateway_request_total`
Gateway	Latency p99 (non-agent requests)	< 100ms	0.1% of requests may exceed	`metaforge_gateway_request_duration_seconds`
Agent execution	Success rate	> 95%	5% of executions may fail	`metaforge_agent_execution_total`
Agent execution	Latency p95	< 30s	5% of executions may exceed	`metaforge_agent_execution_duration_seconds`
Neo4j reads	Latency p99	< 50ms	1% of queries may exceed	`metaforge_neo4j_query_duration_seconds`
Neo4j writes	Latency p99	< 200ms	1% of queries may exceed	`metaforge_neo4j_query_duration_seconds`
pgvector search	Latency p99	< 100ms	1% of searches may exceed	`metaforge_pgvector_search_duration_seconds`
Kafka consumer lag	Max lag per partition	< 1000 messages	Sustained violation triggers alert	`metaforge_kafka_consumer_lag`
Kafka DLQ	DLQ rate	< 0.1% of consumed messages	0.1% may go to DLQ	`metaforge_kafka_dead_letters_total`

PromQL Examples for SLO Calculation

Gateway availability (30-day window):

# Gateway availability SLO: 99.9%
1 - (
  sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[30d]))
  /
  sum(rate(metaforge_gateway_request_total[30d]))
)

Agent success rate:

# Agent success rate SLO: > 95%
sum(rate(metaforge_agent_execution_total{status="success"}[30d]))
/
sum(rate(metaforge_agent_execution_total[30d]))

Error budget burn rate (alerting on fast burn):

# Alert if burning error budget 14.4x faster than sustainable (2% budget consumed in 1h)
(
  sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[1h]))
  /
  sum(rate(metaforge_gateway_request_total[1h]))
) > (14.4 * 0.001)

Neo4j read latency p99:

# Neo4j read latency p99 < 50ms
histogram_quantile(0.99,
  sum(rate(metaforge_neo4j_query_duration_seconds_bucket{operation="read"}[5m])) by (le)
)

Alerting Architecture

Severity Levels

Severity	Response Time	Notification Channel	Description
Critical	< 15 min	PagerDuty / Slack `#alerts-critical`	Service down, data loss risk, SLO breach imminent
Warning	< 1 hour	Slack `#alerts-warning`	Degraded performance, approaching thresholds
Info	Next business day	Slack `#alerts-info`	Capacity planning, trend notifications

Alerting Rules

# observability/alerting/rules.yaml
groups:
  - name: metaforge-critical
    rules:
      - alert: GatewayDown
        expr: up{job="metaforge-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MetaForge Gateway is down"
          description: "Gateway has been unreachable for more than 1 minute."

      - alert: KafkaConsumerStopped
        expr: |
          rate(metaforge_kafka_messages_consumed_total[5m]) == 0
          and on(consumer_group) metaforge_kafka_consumer_lag > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer  has stopped"
          description: "Consumer has non-zero lag but zero consumption rate for 5 minutes."

      - alert: Neo4jUnreachable
        expr: metaforge_neo4j_active_connections == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Neo4j connection pool is empty"
          description: "No active Neo4j connections — all graph operations will fail."

  - name: metaforge-warning
    rules:
      - alert: HighAgentFailureRate
        expr: |
          sum(rate(metaforge_agent_execution_total{status="failure"}[15m]))
          /
          sum(rate(metaforge_agent_execution_total[15m]))
          > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Agent failure rate exceeds 10%"
          description: " of agent executions are failing."

      - alert: KafkaConsumerLagHigh
        expr: metaforge_kafka_consumer_lag > 5000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer lag is high"
          description: "Consumer  on  has lag of ."

      - alert: ErrorBudgetBurnRate
        expr: |
          (
            sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[1h]))
            /
            sum(rate(metaforge_gateway_request_total[1h]))
          ) > (14.4 * 0.001)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway error budget burning too fast"
          description: "At current rate, 30-day error budget will be exhausted in  hours."

      - alert: LLMCostSpike
        expr: |
          sum(rate(metaforge_agent_llm_cost_usd_total[1h])) * 3600
          > 10
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "LLM cost exceeding $10/hour"
          description: "Current hourly LLM spend: $."

      - alert: OscillationDetected
        expr: rate(metaforge_oscillation_detected_total[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Agent oscillation detected"
          description: "Oscillation detected on  nodes."

Notification Routing

# observability/alerting/routes.yaml
route:
  receiver: default-slack
  group_by: [alertname, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: slack-warning
      repeat_interval: 1h
    - match:
        severity: info
      receiver: slack-info
      repeat_interval: 24h

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_ROUTING_KEY}"
  - name: slack-warning
    slack_configs:
      - channel: "#alerts-warning"
        api_url: "${SLACK_WEBHOOK_URL}"
  - name: slack-info
    slack_configs:
      - channel: "#alerts-info"
        api_url: "${SLACK_WEBHOOK_URL}"
  - name: default-slack
    slack_configs:
      - channel: "#alerts-general"
        api_url: "${SLACK_WEBHOOK_URL}"

Grafana Dashboards

Four dashboards provide complete operational visibility:

1. System Overview Dashboard

Purpose: High-level health at a glance for all MetaForge components.

Panel	Visualization	Data Source	Query
Gateway Status	Stat (up/down)	Prometheus	`up{job="metaforge-gateway"}`
Request Rate	Time series	Prometheus	`rate(metaforge_gateway_request_total[5m])`
Error Rate	Time series	Prometheus	`rate(metaforge_gateway_request_total{status_code=~"5.."}[5m])`
Active Sessions	Stat	Prometheus	`metaforge_gateway_active_sessions`
WebSocket Connections	Gauge	Prometheus	`metaforge_gateway_websocket_connections`
Kafka Consumer Lag	Time series	Prometheus	`metaforge_kafka_consumer_lag`
Neo4j Connection Pool	Gauge	Prometheus	`metaforge_neo4j_active_connections`
Recent Alerts	Table	Alertmanager	Active alerts by severity

2. Agent Performance Dashboard

Purpose: Per-agent execution metrics, LLM costs, and skill breakdowns.

Panel	Visualization	Data Source	Query
Agent Success Rate	Stat per agent	Prometheus	Success/total by `agent_code`
Execution Duration (p50/p95/p99)	Heatmap	Prometheus	`metaforge_agent_execution_duration_seconds`
LLM Token Usage	Stacked bar	Prometheus	`metaforge_agent_llm_tokens_total` by model
LLM Cost Over Time	Time series	Prometheus	`rate(metaforge_agent_llm_cost_usd_total[1h])`
Skill Execution Breakdown	Table	Prometheus	`metaforge_skill_execution_duration_seconds` by skill
MCP Tool Call Latency	Heatmap	Prometheus	`metaforge_mcp_call_duration_seconds` by tool
Agent Trace Explorer	Trace panel	Tempo	Traces filtered by `agent.code`

3. Data Stores Dashboard

Purpose: Neo4j, pgvector, Kafka, and MinIO operational health.

Panel	Visualization	Data Source	Query
Neo4j Query Latency	Histogram	Prometheus	`metaforge_neo4j_query_duration_seconds`
Neo4j Operations	Time series	Prometheus	`rate(metaforge_neo4j_query_total[5m])` by operation
pgvector Search Latency	Histogram	Prometheus	`metaforge_pgvector_search_duration_seconds`
Kafka Consumer Lag by Topic	Time series	Prometheus	`metaforge_kafka_consumer_lag` by topic
Kafka Throughput	Time series	Prometheus	`rate(metaforge_kafka_messages_produced_total[5m])`
Dead Letter Queue Depth	Stat	Prometheus	`metaforge_kafka_dead_letters_total`
MinIO Operations	Time series	Prometheus	`rate(metaforge_minio_operation_total[5m])`

4. SLO Overview Dashboard

Purpose: Error budget tracking and SLO compliance for all defined objectives.

Panel	Visualization	Data Source	Query
Gateway Availability (30d)	Stat with threshold	Prometheus	Availability PromQL (see above)
Gateway Latency p99	Stat with threshold	Prometheus	`histogram_quantile(0.99, ...)`
Agent Success Rate (30d)	Stat with threshold	Prometheus	Agent success PromQL
Error Budget Remaining	Gauge per SLO	Prometheus	Budget consumed vs total
Error Budget Burn Rate	Time series	Prometheus	Burn rate over sliding windows
SLO Compliance History	Table	Prometheus	Weekly SLO compliance

Repository Structure

The observability/ directory contains all observability infrastructure code:

observability/
├── bootstrap.py              # OTel SDK initialization (TracerProvider, MeterProvider, LoggerProvider)
├── logging.py                # structlog → OTel logging bridge (trace_id/span_id injection)
├── tracing.py                # Custom span helpers for agent/skill/tool instrumentation
├── metrics.py                # Prometheus metric definitions (all ~30 metrics)
├── middleware.py              # FastAPI middleware for request tracing and metrics
├── config.py                 # ObservabilityConfig Pydantic model
├── slo/
│   ├── definitions.py        # SLO/SLI definitions as code
│   └── calculator.py         # Error budget calculation utilities
├── alerting/
│   ├── rules.yaml            # Prometheus alerting rules
│   └── routes.yaml           # Alertmanager notification routing
└── dashboards/
    ├── system-overview.json   # Grafana dashboard: system health
    ├── agent-performance.json # Grafana dashboard: agent metrics
    ├── data-stores.json       # Grafana dashboard: Neo4j/Kafka/pgvector
    └── slo-overview.json      # Grafana dashboard: SLO tracking

Phased Delivery

Observability delivery aligns with the existing phased timeline:

Phase	Deliverable	Timeline	Details
P1	structlog with `trace_id`, basic Prometheus metrics, `/health` enhancement, system Grafana dashboard	Months 1-6	Add `trace_id` to structlog output; instrument Gateway request count/latency; enhance `/health` with dependency checks (Neo4j, Kafka); create system overview dashboard
P1.5	OTel SDK, distributed tracing (Gateway → Agent → Neo4j), agent LLM metrics, Kafka consumer lag	Months 5-6	Initialize `opentelemetry-sdk`; add manual spans for `agent.execute`, `llm.completion`, `neo4j.query`; add LLM token/cost counters; monitor Kafka consumer lag
P2	Full three-pillar stack (Loki, Tempo, Prometheus, Grafana), SLO dashboards, alerting rules, MCP + pgvector metrics	Months 7-12	Deploy OTel Collector, Loki, Tempo; add all alerting rules; instrument MCP tool calls and pgvector searches; create all four Grafana dashboards; define SLO error budgets
P3	Device telemetry observability, fleet health dashboards, anomaly alerts, incident runbooks	Months 13-18	Add MQTT/telemetry pipeline metrics; fleet health Grafana dashboard; anomaly detection alerts; on-call runbooks for each critical alert
P4	Simulation performance tracking, multi-tenant observability, cost attribution, enterprise audit export	Months 21-24+	Per-tenant metric isolation; simulation run performance tracking; LLM cost attribution per project/team; audit log export for compliance

Phase Dependencies

flowchart LR
    P1["P1: structlog + basic metrics<br/>Months 1-6"] --> P15["P1.5: OTel SDK + tracing<br/>Months 5-6"]
    P15 --> P2["P2: Full three-pillar stack<br/>Months 7-12"]
    P2 --> P3["P3: Device telemetry observability<br/>Months 13-18"]
    P3 --> P4["P4: Multi-tenant + audit<br/>Months 21-24+"]

    style P1 fill:#27ae60,color:#fff
    style P15 fill:#2ecc71,color:#fff
    style P2 fill:#3498db,color:#fff
    style P3 fill:#9b59b6,color:#fff
    style P4 fill:#E67E22,color:#fff

Migration from Existing Patterns

All changes are additive — no existing functionality is removed or broken.

Existing Piece	Migration Path	Breaking Changes
`structlog` (Gateway logging)	Add `add_trace_context` processor to existing `structlog.configure()` call	None — existing log fields preserved; `trace_id`/`span_id` added as new fields
`.forge/traces/` JSONL	Enrich existing JSONL entries with `trace_id`, `span_id`, `parent_span_id` fields	None — new fields are additive; existing consumers ignore unknown fields
`GET /health` endpoint	Extend response to include dependency health checks (Neo4j, Kafka, pgvector connectivity)	None — existing `200 OK` response preserved; new fields added to response body
Performance targets (index.md §9.1)	Formalized into SLO/SLI definitions with PromQL queries and error budgets	None — targets unchanged; SLO framework adds measurement and alerting
Dead letter queue (event-sourcing)	Add `metaforge_kafka_dead_letters_total` counter; DLQ depth visible in Grafana	None — existing DLQ behavior unchanged; metrics are new instrumentation
Oscillation detection (event-sourcing)	Add `metaforge_oscillation_detected_total` counter; alerting rule triggers on detection	None — existing detection logic unchanged; metric emission added
OPA audit logging (orchestrator-technical)	Route OPA decision logs through structlog (with trace context) to Loki	None — existing OPA audit trail preserved; logs gain correlation IDs
`ComponentHealth` types (dashboard)	Backend `/health` endpoint populates these types with real metrics from Prometheus	None — existing types unchanged; backed by real data instead of placeholders
WebSocket events (gateway)	Add connection/message metrics; propagate trace context in WebSocket handshake	None — existing WebSocket protocol unchanged; metrics are new

Configuration

ObservabilityConfig

from pydantic import BaseModel
from typing import Optional

class OTelExporterConfig(BaseModel):
    endpoint: str = "http://localhost:4317"  # OTel Collector OTLP/gRPC
    insecure: bool = True                     # TLS disabled for local dev
    timeout_ms: int = 5000

class PrometheusConfig(BaseModel):
    port: int = 9090
    scrape_interval: str = "15s"

class GrafanaConfig(BaseModel):
    url: str = "http://localhost:3000"

class ObservabilityConfig(BaseModel):
    """Central configuration for all observability components."""
    enabled: bool = True
    service_name: str = "metaforge"
    environment: str = "development"          # development | staging | production

    # Exporters
    otlp: OTelExporterConfig = OTelExporterConfig()
    prometheus: PrometheusConfig = PrometheusConfig()
    grafana: GrafanaConfig = GrafanaConfig()

    # Sampling
    trace_sample_rate: float = 1.0            # 1.0 = sample everything (dev); 0.1 = 10% (prod)
    log_level: str = "INFO"

    # Feature flags
    enable_traces: bool = True
    enable_metrics: bool = True
    enable_logs: bool = True

    # Backends (for docker-compose generation)
    loki_url: Optional[str] = "http://localhost:3100"
    tempo_url: Optional[str] = "http://localhost:3200"

Docker Compose: Observability Stack

# docker-compose.observability.yml
version: "3.9"

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./observability/otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8889:8889"   # Prometheus metrics exporter
    depends_on:
      - tempo
      - loki

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./observability/alerting/rules.yaml:/etc/prometheus/rules.yaml
    ports:
      - "9090:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

  grafana:
    image: grafana/grafana:latest
    volumes:
      - ./observability/dashboards:/var/lib/grafana/dashboards
      - ./observability/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
      - ./observability/grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/dashboards.yml
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:-admin}"
    depends_on:
      - prometheus
      - tempo
      - loki

  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./observability/tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200"   # Tempo API
      - "4317"        # OTLP gRPC (internal)

  loki:
    image: grafana/loki:latest
    command: ["-config.file=/etc/loki/local-config.yaml"]
    ports:
      - "3100:3100"

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./observability/alerting/routes.yaml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"

Document	Relationship
Architecture Overview	Parent architecture; Gateway dependencies and performance targets formalized here
Event Sourcing	DLQ, oscillation detection, and Kafka consumer patterns instrumented by observability metrics
Digital Twin Evolution	Technology stack extended with observability layer; phased delivery aligned
Repository Structure	`observability/` directory added to repository layout
AI Memory & Knowledge	pgvector search metrics and knowledge pipeline monitoring
Constraint Engine	Constraint evaluation metrics and alerting on failures
Orchestrator Technical	OPA audit logging integrated with centralized observability

Document Version: v1.0 Last Updated: 2026-02-28 Status: Technical Architecture Document

← AI Memory & Knowledge

Architecture Home →