System Observability — Logging, Metrics & Traces

Three-pillar observability framework unifying scattered monitoring decisions into a cohesive OpenTelemetry-based architecture

Table of Contents

  1. Problem Statement
  2. Architecture Overview
    1. Signal Flow
    2. Distributed Trace: Full Agent Execution
  3. Technology Stack
  4. Instrumentation Strategy
    1. Layer 1: Auto-Instrumentation (Zero Code Changes)
    2. Layer 2: Manual Spans (Domain-Specific Instrumentation)
    3. Layer 3: Log Correlation
  5. Metrics Catalog
    1. Gateway Metrics
    2. Agent Metrics
    3. Tool Adapter Metrics
    4. Data Store Metrics
    5. Kafka Metrics
    6. Constraint & Policy Metrics
  6. Distributed Tracing Architecture
    1. Trace Waterfall Example
    2. W3C TraceContext Propagation
    3. Backward-Compatible .forge/traces/ Enrichment
  7. SLO/SLI Framework
    1. Service Level Indicators (SLIs) and Objectives (SLOs)
    2. PromQL Examples for SLO Calculation
  8. Alerting Architecture
    1. Severity Levels
    2. Alerting Rules
    3. Notification Routing
  9. Grafana Dashboards
    1. 1. System Overview Dashboard
    2. 2. Agent Performance Dashboard
    3. 3. Data Stores Dashboard
    4. 4. SLO Overview Dashboard
  10. Repository Structure
  11. Phased Delivery
    1. Phase Dependencies
  12. Migration from Existing Patterns
  13. Configuration
    1. ObservabilityConfig
    2. Docker Compose: Observability Stack
  14. Related Documents

Problem Statement

MetaForge has observability decisions scattered across multiple documents but no unified strategy. The table below maps each existing piece to its current location and identifies gaps:

Existing Piece Current Location What It Provides Gap
structlog Architecture index — Gateway dependencies Structured JSON logging No trace correlation (trace_id/span_id), no centralized log aggregation
.forge/traces/ JSONL Architecture index — Workspace structure Per-session execution traces Local-only, no distributed tracing, no span hierarchy
GET /health Architecture index — API endpoints Basic liveness check No readiness probe, no dependency health (Neo4j, Kafka, pgvector)
Performance targets Architecture index — Section 9.1 Response time and resource limits No SLO/SLI framework, no error budgets, no alerting
Dead letter queue Event Sourcing — Multi-agent coordination Failed event retry and investigation No DLQ depth metrics, no consumer lag monitoring
Oscillation detection Event Sourcing — Propagation control Detects agent feedback loops No oscillation metrics or alerting rules
OPA audit logging Orchestrator Technical — Governance Policy decision audit trail Not integrated with centralized logging or metrics
ComponentHealth types Dashboard types Frontend health display models No backend metrics feeding these types
WebSocket events Architecture index — Gateway service Real-time UI updates No connection metrics, no message rate tracking

What is missing:

  • No OpenTelemetry instrumentation (the industry standard for vendor-neutral telemetry)
  • No metrics collection (Prometheus) or operational dashboarding (Grafana)
  • No distributed tracing across agent → skill → tool → Neo4j call chains
  • No log correlation (trace_id/span_id propagation across services)
  • No alerting or on-call architecture
  • No SLO/SLI/error budget framework
  • No agent-specific metrics (LLM token usage, cost, latency per agent/skill)
  • No Kafka consumer lag, Neo4j query, or pgvector search monitoring

Architecture Overview

Signal Flow

All telemetry signals (logs, metrics, traces) flow through a unified OpenTelemetry pipeline:

flowchart TD
    subgraph SOURCES["Signal Sources"]
        direction LR
        GW["Gateway<br/>FastAPI"]
        AGT["Domain Agents<br/>Python"]
        SKL["Skills<br/>Python"]
        MCP_S["MCP Layer<br/>Tool Adapters"]
        KFK["Kafka<br/>Consumers/Producers"]
        NEO["Neo4j<br/>Graph Queries"]
        PGV["pgvector<br/>Semantic Search"]
    end

    subgraph OTEL_SDK["OpenTelemetry SDK (in-process)"]
        direction LR
        TRACES_SDK["TracerProvider"]
        METRICS_SDK["MeterProvider"]
        LOGS_SDK["LoggerProvider<br/>structlog bridge"]
    end

    subgraph COLLECTOR["OpenTelemetry Collector"]
        direction LR
        RECV["OTLP Receiver"]
        PROC["Processors<br/>batch, filter, attributes"]
        EXP["Exporters"]
    end

    subgraph BACKENDS["Storage Backends"]
        direction LR
        PROM["Prometheus<br/>Metrics"]
        TEMPO["Grafana Tempo<br/>Traces"]
        LOKI["Grafana Loki<br/>Logs"]
    end

    SOURCES --> OTEL_SDK
    OTEL_SDK -->|"OTLP/gRPC"| COLLECTOR
    COLLECTOR --> BACKENDS
    BACKENDS --> GRAFANA["Grafana<br/>Dashboards + Alerts"]
    GRAFANA --> AM["Alertmanager<br/>Notification Routing"]

    style SOURCES fill:#3498db,color:#fff
    style OTEL_SDK fill:#9b59b6,color:#fff
    style COLLECTOR fill:#E67E22,color:#fff
    style BACKENDS fill:#2C3E50,color:#fff
    style GRAFANA fill:#27ae60,color:#fff

Distributed Trace: Full Agent Execution

A single user request generates a trace that spans the entire execution chain:

sequenceDiagram
    participant CLI as CLI / Dashboard
    participant GW as Gateway (FastAPI)
    participant ORCH as Orchestrator
    participant AGT as Domain Agent
    participant SKL as Skill
    participant MCP as MCP Bridge
    participant TOOL as External Tool
    participant NEO as Neo4j
    participant KFK as Kafka

    CLI->>GW: POST /api/v1/agent/run [trace_id generated]
    activate GW
    GW->>ORCH: create_session [child span]
    activate ORCH
    ORCH->>AGT: dispatch_task [child span]
    activate AGT
    AGT->>SKL: skill.execute [child span]
    activate SKL
    SKL->>MCP: mcp.tool_call [child span]
    activate MCP
    MCP->>TOOL: execute [child span]
    TOOL-->>MCP: result
    deactivate MCP
    SKL-->>AGT: skill output
    deactivate SKL
    AGT->>NEO: graph query [child span]
    activate NEO
    NEO-->>AGT: query result
    deactivate NEO
    AGT->>KFK: publish event [child span, trace context in headers]
    AGT-->>ORCH: agent result
    deactivate AGT
    ORCH-->>GW: session result
    deactivate ORCH
    GW-->>CLI: response [trace_id in header]
    deactivate GW

Technology Stack

Component Technology Rationale
Instrumentation opentelemetry-sdk (Python) + @opentelemetry/sdk-node (TS) Industry standard, vendor-neutral; auto-instrumentation for FastAPI, httpx, kafka-python
Log backend Grafana Loki Label-indexed, lightweight, Grafana-native; no full-text indexing overhead
Metrics backend Prometheus Pull-based, PromQL for SLO queries; distinct from InfluxDB (reserved for device telemetry L3)
Trace backend Grafana Tempo Object storage backend reuses MinIO; lighter than Jaeger + Elasticsearch
Dashboarding Grafana Unified UI for all three pillars; alerting built-in
Alerting Grafana Alertmanager Rule-based alerting on metrics, traces, and logs
Collector OpenTelemetry Collector Vendor-neutral pipeline; receives OTLP, exports to all backends
Structured logging structlog (existing) + OTel logging bridge Zero migration from existing logging; adds trace_id/span_id to every log line

Why Prometheus instead of InfluxDB for metrics? InfluxDB is reserved for device telemetry (L3 — high-frequency sensor data). Prometheus serves operational metrics with pull-based scraping, native PromQL for SLO calculations, and Grafana integration. Keeping them separate avoids mixing operational concerns with product telemetry.

Why Tempo instead of Jaeger? Tempo uses object storage (MinIO, already in the stack) instead of Elasticsearch, reducing operational complexity. It integrates natively with Grafana for trace visualization and supports TraceQL for querying.


Instrumentation Strategy

Layer 1: Auto-Instrumentation (Zero Code Changes)

OpenTelemetry provides automatic instrumentation for common libraries. These require only SDK initialization — no code changes in application logic:

Library Auto-Instrumentation Package What It Captures
FastAPI opentelemetry-instrumentation-fastapi HTTP request spans, status codes, latency
httpx opentelemetry-instrumentation-httpx Outbound HTTP calls (LLM API calls, supplier APIs)
kafka-python opentelemetry-instrumentation-kafka-python Producer/consumer spans, topic metadata
psycopg2 opentelemetry-instrumentation-psycopg2 PostgreSQL/pgvector query spans
requests opentelemetry-instrumentation-requests Outbound HTTP (fallback for non-httpx clients)

Layer 2: Manual Spans (Domain-Specific Instrumentation)

Custom spans for MetaForge-specific operations that auto-instrumentation cannot capture:

from opentelemetry import trace

tracer = trace.get_tracer("metaforge.agents")

async def execute_agent(agent_code: str, session_id: str, input_data: dict):
    with tracer.start_as_current_span(
        "agent.execute",
        attributes={
            "agent.code": agent_code,
            "session.id": session_id,
            "agent.input_size": len(str(input_data)),
        }
    ) as span:
        result = await agent.run(input_data)
        span.set_attribute("agent.output_artifacts", len(result.artifacts))
        span.set_attribute("agent.llm_tokens_total", result.token_usage.total)
        span.set_attribute("agent.llm_cost_usd", result.cost_usd)
        return result

Manual span catalog:

Span Name Attributes Description
agent.execute agent.code, session.id, agent.llm_tokens_total, agent.llm_cost_usd Full agent execution lifecycle
skill.execute skill.name, skill.domain, skill.input_size Individual skill invocation
mcp.tool_call tool.name, tool.adapter, tool.timeout_ms MCP protocol tool execution
neo4j.query db.statement, db.operation, neo4j.result_count Graph database query
pgvector.search pgvector.query_embedding_dim, pgvector.top_k, pgvector.similarity_threshold Semantic similarity search
kafka.produce messaging.destination, messaging.message_id Event publication
kafka.consume messaging.destination, messaging.consumer_group Event consumption
llm.completion llm.provider, llm.model, llm.tokens.prompt, llm.tokens.completion, llm.cost_usd LLM API call
constraint.evaluate constraint.rule_id, constraint.domain, constraint.result Constraint engine evaluation
opa.decision opa.policy, opa.result, opa.input_size OPA policy evaluation

Layer 3: Log Correlation

A structlog processor injects OpenTelemetry trace context into every log line, enabling log-to-trace correlation in Grafana:

import structlog
from opentelemetry import trace

def add_trace_context(logger, method_name, event_dict):
    """structlog processor that injects trace_id and span_id."""
    span = trace.get_current_span()
    if span.is_recording():
        ctx = span.get_span_context()
        event_dict["trace_id"] = format(ctx.trace_id, "032x")
        event_dict["span_id"] = format(ctx.span_id, "016x")
        event_dict["trace_flags"] = ctx.trace_flags
    return event_dict

structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        add_trace_context,  # Inject OTel context
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer(),
    ]
)

Result: Every log line includes trace_id and span_id, enabling Grafana to jump from a log entry directly to its parent trace.


Metrics Catalog

Approximately 30 metrics organized by subsystem. All metrics use Prometheus naming conventions (snake_case, unit suffix).

Gateway Metrics

Metric Type Labels Description
metaforge_gateway_request_total Counter method, endpoint, status_code Total HTTP requests
metaforge_gateway_request_duration_seconds Histogram method, endpoint Request latency distribution
metaforge_gateway_websocket_connections Gauge state Active WebSocket connections
metaforge_gateway_active_sessions Gauge status Sessions by status (running, pending_approval, etc.)

Agent Metrics

Metric Type Labels Description
metaforge_agent_execution_duration_seconds Histogram agent_code, status Agent execution time
metaforge_agent_execution_total Counter agent_code, status Total agent executions (success/failure)
metaforge_agent_llm_tokens_total Counter agent_code, llm_provider, llm_model, token_type LLM tokens consumed (prompt/completion)
metaforge_agent_llm_cost_usd_total Counter agent_code, llm_provider, llm_model Cumulative LLM API cost
metaforge_agent_llm_request_duration_seconds Histogram agent_code, llm_provider, llm_model LLM API call latency
metaforge_skill_execution_duration_seconds Histogram skill_name, domain Skill execution time
metaforge_skill_execution_total Counter skill_name, domain, status Total skill executions

Tool Adapter Metrics

Metric Type Labels Description
metaforge_mcp_call_duration_seconds Histogram tool_name, adapter MCP tool call latency
metaforge_mcp_call_total Counter tool_name, adapter, status Total MCP tool calls
metaforge_mcp_call_errors_total Counter tool_name, adapter, error_type MCP tool call failures

Data Store Metrics

Metric Type Labels Description
metaforge_neo4j_query_duration_seconds Histogram operation, node_type Neo4j query latency
metaforge_neo4j_active_connections Gauge Neo4j connection pool usage
metaforge_neo4j_query_total Counter operation, status Total Neo4j queries
metaforge_pgvector_search_duration_seconds Histogram knowledge_type pgvector similarity search latency
metaforge_pgvector_search_total Counter knowledge_type, status Total pgvector searches
metaforge_minio_operation_duration_seconds Histogram operation MinIO read/write latency
metaforge_minio_operation_total Counter operation, status Total MinIO operations

Kafka Metrics

Metric Type Labels Description
metaforge_kafka_consumer_lag Gauge consumer_group, topic, partition Consumer lag (messages behind)
metaforge_kafka_messages_produced_total Counter topic Messages published
metaforge_kafka_messages_consumed_total Counter topic, consumer_group Messages consumed
metaforge_kafka_dead_letters_total Counter topic, consumer_group Messages sent to dead letter queue
metaforge_kafka_rebalance_total Counter consumer_group Consumer group rebalances

Constraint & Policy Metrics

Metric Type Labels Description
metaforge_constraint_evaluation_total Counter domain, result Constraint evaluations (pass/fail/error)
metaforge_constraint_evaluation_duration_seconds Histogram domain Constraint evaluation latency
metaforge_opa_decision_total Counter policy, result OPA policy decisions
metaforge_oscillation_detected_total Counter node_type Oscillation detection events

Distributed Tracing Architecture

Trace Waterfall Example

A typical agent execution trace spans multiple services and produces a waterfall like this:

gantt
    title Trace Waterfall: Agent Execution
    dateFormat X
    axisFormat %s

    section Gateway
    HTTP POST /api/v1/agent/run       :gw, 0, 30000

    section Orchestrator
    create_session                     :orch, 100, 500
    dispatch_task                      :orch2, 600, 29000

    section Agent
    agent.execute (EE)                 :agt, 700, 28000
    llm.completion                     :llm1, 800, 5000
    skill.execute (run_erc)            :skl, 6000, 8000
    mcp.tool_call (kicad)              :mcp, 6100, 7500
    llm.completion                     :llm2, 15000, 5000
    neo4j.query (write artifacts)      :neo, 21000, 500
    kafka.produce (graph.mutations)    :kfk, 22000, 100

W3C TraceContext Propagation

MetaForge propagates trace context using the W3C TraceContext standard across all communication boundaries:

Boundary Propagation Mechanism Format
HTTP (Gateway ↔ CLI/Dashboard) traceparent + tracestate headers W3C TraceContext
Kafka messages Message headers: traceparent W3C TraceContext
WebSocket Initial handshake headers + message metadata field W3C TraceContext
MCP tool calls Tool call context metadata W3C TraceContext

Kafka header propagation:

from opentelemetry.context.propagation import inject

def produce_with_context(producer, topic: str, value: bytes):
    """Inject trace context into Kafka message headers."""
    headers = []
    inject(carrier=headers, setter=kafka_header_setter)
    producer.send(topic, value=value, headers=headers)

Backward-Compatible .forge/traces/ Enrichment

Existing .forge/traces/ JSONL files are enriched with trace context fields. This is additive — existing consumers that ignore these fields continue to work:

{
  "timestamp": "2026-02-28T10:30:00Z",
  "session_id": "abc-123",
  "agent": "electronics",
  "action": "run_erc",
  "level": "info",
  "data": {"components_checked": 42},
  "trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
  "span_id": "00f067aa0ba902b7",
  "parent_span_id": "a1b2c3d4e5f60718"
}

SLO/SLI Framework

This section formalizes the existing performance targets (Architecture index — Section 9.1) into proper SLOs with error budgets.

Service Level Indicators (SLIs) and Objectives (SLOs)

Service SLI SLO Error Budget (30d) Measurement
Gateway Availability (non-5xx responses / total) 99.9% 43.2 min downtime metaforge_gateway_request_total
Gateway Latency p99 (non-agent requests) < 100ms 0.1% of requests may exceed metaforge_gateway_request_duration_seconds
Agent execution Success rate > 95% 5% of executions may fail metaforge_agent_execution_total
Agent execution Latency p95 < 30s 5% of executions may exceed metaforge_agent_execution_duration_seconds
Neo4j reads Latency p99 < 50ms 1% of queries may exceed metaforge_neo4j_query_duration_seconds
Neo4j writes Latency p99 < 200ms 1% of queries may exceed metaforge_neo4j_query_duration_seconds
pgvector search Latency p99 < 100ms 1% of searches may exceed metaforge_pgvector_search_duration_seconds
Kafka consumer lag Max lag per partition < 1000 messages Sustained violation triggers alert metaforge_kafka_consumer_lag
Kafka DLQ DLQ rate < 0.1% of consumed messages 0.1% may go to DLQ metaforge_kafka_dead_letters_total

PromQL Examples for SLO Calculation

Gateway availability (30-day window):

# Gateway availability SLO: 99.9%
1 - (
  sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[30d]))
  /
  sum(rate(metaforge_gateway_request_total[30d]))
)

Agent success rate:

# Agent success rate SLO: > 95%
sum(rate(metaforge_agent_execution_total{status="success"}[30d]))
/
sum(rate(metaforge_agent_execution_total[30d]))

Error budget burn rate (alerting on fast burn):

# Alert if burning error budget 14.4x faster than sustainable (2% budget consumed in 1h)
(
  sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[1h]))
  /
  sum(rate(metaforge_gateway_request_total[1h]))
) > (14.4 * 0.001)

Neo4j read latency p99:

# Neo4j read latency p99 < 50ms
histogram_quantile(0.99,
  sum(rate(metaforge_neo4j_query_duration_seconds_bucket{operation="read"}[5m])) by (le)
)

Alerting Architecture

Severity Levels

Severity Response Time Notification Channel Description
Critical < 15 min PagerDuty / Slack #alerts-critical Service down, data loss risk, SLO breach imminent
Warning < 1 hour Slack #alerts-warning Degraded performance, approaching thresholds
Info Next business day Slack #alerts-info Capacity planning, trend notifications

Alerting Rules

# observability/alerting/rules.yaml
groups:
  - name: metaforge-critical
    rules:
      - alert: GatewayDown
        expr: up{job="metaforge-gateway"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MetaForge Gateway is down"
          description: "Gateway has been unreachable for more than 1 minute."

      - alert: KafkaConsumerStopped
        expr: |
          rate(metaforge_kafka_messages_consumed_total[5m]) == 0
          and on(consumer_group) metaforge_kafka_consumer_lag > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Kafka consumer  has stopped"
          description: "Consumer has non-zero lag but zero consumption rate for 5 minutes."

      - alert: Neo4jUnreachable
        expr: metaforge_neo4j_active_connections == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Neo4j connection pool is empty"
          description: "No active Neo4j connections  all graph operations will fail."

  - name: metaforge-warning
    rules:
      - alert: HighAgentFailureRate
        expr: |
          sum(rate(metaforge_agent_execution_total{status="failure"}[15m]))
          /
          sum(rate(metaforge_agent_execution_total[15m]))
          > 0.1
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Agent failure rate exceeds 10%"
          description: " of agent executions are failing."

      - alert: KafkaConsumerLagHigh
        expr: metaforge_kafka_consumer_lag > 5000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Kafka consumer lag is high"
          description: "Consumer  on  has lag of ."

      - alert: ErrorBudgetBurnRate
        expr: |
          (
            sum(rate(metaforge_gateway_request_total{status_code=~"5.."}[1h]))
            /
            sum(rate(metaforge_gateway_request_total[1h]))
          ) > (14.4 * 0.001)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Gateway error budget burning too fast"
          description: "At current rate, 30-day error budget will be exhausted in  hours."

      - alert: LLMCostSpike
        expr: |
          sum(rate(metaforge_agent_llm_cost_usd_total[1h])) * 3600
          > 10
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "LLM cost exceeding $10/hour"
          description: "Current hourly LLM spend: $."

      - alert: OscillationDetected
        expr: rate(metaforge_oscillation_detected_total[5m]) > 0
        for: 1m
        labels:
          severity: warning
        annotations:
          summary: "Agent oscillation detected"
          description: "Oscillation detected on  nodes."

Notification Routing

# observability/alerting/routes.yaml
route:
  receiver: default-slack
  group_by: [alertname, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: pagerduty-critical
      repeat_interval: 15m
    - match:
        severity: warning
      receiver: slack-warning
      repeat_interval: 1h
    - match:
        severity: info
      receiver: slack-info
      repeat_interval: 24h

receivers:
  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: "${PAGERDUTY_ROUTING_KEY}"
  - name: slack-warning
    slack_configs:
      - channel: "#alerts-warning"
        api_url: "${SLACK_WEBHOOK_URL}"
  - name: slack-info
    slack_configs:
      - channel: "#alerts-info"
        api_url: "${SLACK_WEBHOOK_URL}"
  - name: default-slack
    slack_configs:
      - channel: "#alerts-general"
        api_url: "${SLACK_WEBHOOK_URL}"

Grafana Dashboards

Four dashboards provide complete operational visibility:

1. System Overview Dashboard

Purpose: High-level health at a glance for all MetaForge components.

Panel Visualization Data Source Query
Gateway Status Stat (up/down) Prometheus up{job="metaforge-gateway"}
Request Rate Time series Prometheus rate(metaforge_gateway_request_total[5m])
Error Rate Time series Prometheus rate(metaforge_gateway_request_total{status_code=~"5.."}[5m])
Active Sessions Stat Prometheus metaforge_gateway_active_sessions
WebSocket Connections Gauge Prometheus metaforge_gateway_websocket_connections
Kafka Consumer Lag Time series Prometheus metaforge_kafka_consumer_lag
Neo4j Connection Pool Gauge Prometheus metaforge_neo4j_active_connections
Recent Alerts Table Alertmanager Active alerts by severity

2. Agent Performance Dashboard

Purpose: Per-agent execution metrics, LLM costs, and skill breakdowns.

Panel Visualization Data Source Query
Agent Success Rate Stat per agent Prometheus Success/total by agent_code
Execution Duration (p50/p95/p99) Heatmap Prometheus metaforge_agent_execution_duration_seconds
LLM Token Usage Stacked bar Prometheus metaforge_agent_llm_tokens_total by model
LLM Cost Over Time Time series Prometheus rate(metaforge_agent_llm_cost_usd_total[1h])
Skill Execution Breakdown Table Prometheus metaforge_skill_execution_duration_seconds by skill
MCP Tool Call Latency Heatmap Prometheus metaforge_mcp_call_duration_seconds by tool
Agent Trace Explorer Trace panel Tempo Traces filtered by agent.code

3. Data Stores Dashboard

Purpose: Neo4j, pgvector, Kafka, and MinIO operational health.

Panel Visualization Data Source Query
Neo4j Query Latency Histogram Prometheus metaforge_neo4j_query_duration_seconds
Neo4j Operations Time series Prometheus rate(metaforge_neo4j_query_total[5m]) by operation
pgvector Search Latency Histogram Prometheus metaforge_pgvector_search_duration_seconds
Kafka Consumer Lag by Topic Time series Prometheus metaforge_kafka_consumer_lag by topic
Kafka Throughput Time series Prometheus rate(metaforge_kafka_messages_produced_total[5m])
Dead Letter Queue Depth Stat Prometheus metaforge_kafka_dead_letters_total
MinIO Operations Time series Prometheus rate(metaforge_minio_operation_total[5m])

4. SLO Overview Dashboard

Purpose: Error budget tracking and SLO compliance for all defined objectives.

Panel Visualization Data Source Query
Gateway Availability (30d) Stat with threshold Prometheus Availability PromQL (see above)
Gateway Latency p99 Stat with threshold Prometheus histogram_quantile(0.99, ...)
Agent Success Rate (30d) Stat with threshold Prometheus Agent success PromQL
Error Budget Remaining Gauge per SLO Prometheus Budget consumed vs total
Error Budget Burn Rate Time series Prometheus Burn rate over sliding windows
SLO Compliance History Table Prometheus Weekly SLO compliance

Repository Structure

The observability/ directory contains all observability infrastructure code:

observability/
├── bootstrap.py              # OTel SDK initialization (TracerProvider, MeterProvider, LoggerProvider)
├── logging.py                # structlog → OTel logging bridge (trace_id/span_id injection)
├── tracing.py                # Custom span helpers for agent/skill/tool instrumentation
├── metrics.py                # Prometheus metric definitions (all ~30 metrics)
├── middleware.py              # FastAPI middleware for request tracing and metrics
├── config.py                 # ObservabilityConfig Pydantic model
├── slo/
│   ├── definitions.py        # SLO/SLI definitions as code
│   └── calculator.py         # Error budget calculation utilities
├── alerting/
│   ├── rules.yaml            # Prometheus alerting rules
│   └── routes.yaml           # Alertmanager notification routing
└── dashboards/
    ├── system-overview.json   # Grafana dashboard: system health
    ├── agent-performance.json # Grafana dashboard: agent metrics
    ├── data-stores.json       # Grafana dashboard: Neo4j/Kafka/pgvector
    └── slo-overview.json      # Grafana dashboard: SLO tracking

Phased Delivery

Observability delivery aligns with the existing phased timeline:

Phase Deliverable Timeline Details
P1 structlog with trace_id, basic Prometheus metrics, /health enhancement, system Grafana dashboard Months 1-6 Add trace_id to structlog output; instrument Gateway request count/latency; enhance /health with dependency checks (Neo4j, Kafka); create system overview dashboard
P1.5 OTel SDK, distributed tracing (Gateway → Agent → Neo4j), agent LLM metrics, Kafka consumer lag Months 5-6 Initialize opentelemetry-sdk; add manual spans for agent.execute, llm.completion, neo4j.query; add LLM token/cost counters; monitor Kafka consumer lag
P2 Full three-pillar stack (Loki, Tempo, Prometheus, Grafana), SLO dashboards, alerting rules, MCP + pgvector metrics Months 7-12 Deploy OTel Collector, Loki, Tempo; add all alerting rules; instrument MCP tool calls and pgvector searches; create all four Grafana dashboards; define SLO error budgets
P3 Device telemetry observability, fleet health dashboards, anomaly alerts, incident runbooks Months 13-18 Add MQTT/telemetry pipeline metrics; fleet health Grafana dashboard; anomaly detection alerts; on-call runbooks for each critical alert
P4 Simulation performance tracking, multi-tenant observability, cost attribution, enterprise audit export Months 21-24+ Per-tenant metric isolation; simulation run performance tracking; LLM cost attribution per project/team; audit log export for compliance

Phase Dependencies

flowchart LR
    P1["P1: structlog + basic metrics<br/>Months 1-6"] --> P15["P1.5: OTel SDK + tracing<br/>Months 5-6"]
    P15 --> P2["P2: Full three-pillar stack<br/>Months 7-12"]
    P2 --> P3["P3: Device telemetry observability<br/>Months 13-18"]
    P3 --> P4["P4: Multi-tenant + audit<br/>Months 21-24+"]

    style P1 fill:#27ae60,color:#fff
    style P15 fill:#2ecc71,color:#fff
    style P2 fill:#3498db,color:#fff
    style P3 fill:#9b59b6,color:#fff
    style P4 fill:#E67E22,color:#fff

Migration from Existing Patterns

All changes are additive — no existing functionality is removed or broken.

Existing Piece Migration Path Breaking Changes
structlog (Gateway logging) Add add_trace_context processor to existing structlog.configure() call None — existing log fields preserved; trace_id/span_id added as new fields
.forge/traces/ JSONL Enrich existing JSONL entries with trace_id, span_id, parent_span_id fields None — new fields are additive; existing consumers ignore unknown fields
GET /health endpoint Extend response to include dependency health checks (Neo4j, Kafka, pgvector connectivity) None — existing 200 OK response preserved; new fields added to response body
Performance targets (index.md §9.1) Formalized into SLO/SLI definitions with PromQL queries and error budgets None — targets unchanged; SLO framework adds measurement and alerting
Dead letter queue (event-sourcing) Add metaforge_kafka_dead_letters_total counter; DLQ depth visible in Grafana None — existing DLQ behavior unchanged; metrics are new instrumentation
Oscillation detection (event-sourcing) Add metaforge_oscillation_detected_total counter; alerting rule triggers on detection None — existing detection logic unchanged; metric emission added
OPA audit logging (orchestrator-technical) Route OPA decision logs through structlog (with trace context) to Loki None — existing OPA audit trail preserved; logs gain correlation IDs
ComponentHealth types (dashboard) Backend /health endpoint populates these types with real metrics from Prometheus None — existing types unchanged; backed by real data instead of placeholders
WebSocket events (gateway) Add connection/message metrics; propagate trace context in WebSocket handshake None — existing WebSocket protocol unchanged; metrics are new

Configuration

ObservabilityConfig

from pydantic import BaseModel
from typing import Optional

class OTelExporterConfig(BaseModel):
    endpoint: str = "http://localhost:4317"  # OTel Collector OTLP/gRPC
    insecure: bool = True                     # TLS disabled for local dev
    timeout_ms: int = 5000

class PrometheusConfig(BaseModel):
    port: int = 9090
    scrape_interval: str = "15s"

class GrafanaConfig(BaseModel):
    url: str = "http://localhost:3000"

class ObservabilityConfig(BaseModel):
    """Central configuration for all observability components."""
    enabled: bool = True
    service_name: str = "metaforge"
    environment: str = "development"          # development | staging | production

    # Exporters
    otlp: OTelExporterConfig = OTelExporterConfig()
    prometheus: PrometheusConfig = PrometheusConfig()
    grafana: GrafanaConfig = GrafanaConfig()

    # Sampling
    trace_sample_rate: float = 1.0            # 1.0 = sample everything (dev); 0.1 = 10% (prod)
    log_level: str = "INFO"

    # Feature flags
    enable_traces: bool = True
    enable_metrics: bool = True
    enable_logs: bool = True

    # Backends (for docker-compose generation)
    loki_url: Optional[str] = "http://localhost:3100"
    tempo_url: Optional[str] = "http://localhost:3200"

Docker Compose: Observability Stack

# docker-compose.observability.yml
version: "3.9"

services:
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    command: ["--config=/etc/otel-collector-config.yaml"]
    volumes:
      - ./observability/otel-collector-config.yaml:/etc/otel-collector-config.yaml
    ports:
      - "4317:4317"   # OTLP gRPC
      - "4318:4318"   # OTLP HTTP
      - "8889:8889"   # Prometheus metrics exporter
    depends_on:
      - tempo
      - loki

  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./observability/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./observability/alerting/rules.yaml:/etc/prometheus/rules.yaml
    ports:
      - "9090:9090"
    command:
      - "--config.file=/etc/prometheus/prometheus.yml"
      - "--storage.tsdb.retention.time=30d"

  grafana:
    image: grafana/grafana:latest
    volumes:
      - ./observability/dashboards:/var/lib/grafana/dashboards
      - ./observability/grafana-datasources.yml:/etc/grafana/provisioning/datasources/datasources.yml
      - ./observability/grafana-dashboards.yml:/etc/grafana/provisioning/dashboards/dashboards.yml
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: "${GRAFANA_ADMIN_PASSWORD:-admin}"
    depends_on:
      - prometheus
      - tempo
      - loki

  tempo:
    image: grafana/tempo:latest
    command: ["-config.file=/etc/tempo.yaml"]
    volumes:
      - ./observability/tempo.yaml:/etc/tempo.yaml
    ports:
      - "3200:3200"   # Tempo API
      - "4317"        # OTLP gRPC (internal)

  loki:
    image: grafana/loki:latest
    command: ["-config.file=/etc/loki/local-config.yaml"]
    ports:
      - "3100:3100"

  alertmanager:
    image: prom/alertmanager:latest
    volumes:
      - ./observability/alerting/routes.yaml:/etc/alertmanager/alertmanager.yml
    ports:
      - "9093:9093"

Document Relationship
Architecture Overview Parent architecture; Gateway dependencies and performance targets formalized here
Event Sourcing DLQ, oscillation detection, and Kafka consumer patterns instrumented by observability metrics
Digital Twin Evolution Technology stack extended with observability layer; phased delivery aligned
Repository Structure observability/ directory added to repository layout
AI Memory & Knowledge pgvector search metrics and knowledge pipeline monitoring
Constraint Engine Constraint evaluation metrics and alerting on failures
Orchestrator Technical OPA audit logging integrated with centralized observability

Document Version: v1.0 Last Updated: 2026-02-28 Status: Technical Architecture Document

← AI Memory & Knowledge Architecture Home →