Testing Strategy

Layered testing approach from unit tests to end-to-end hardware validation

Table of Contents

  1. Testing Philosophy
  2. Testing Pyramid
  3. Layer 1: Unit Tests
    1. 1.1 Skill Testing
    2. 1.2 Constraint Engine Testing
    3. 1.3 Gate Engine Testing
    4. 1.4 BOM Risk Scoring
  4. Layer 2: Contract Tests
    1. 2.1 Agent Output Contracts
    2. 2.2 MCP Tool Adapter Contracts
    3. 2.3 Event Bus Contracts
  5. Layer 3: Integration Tests
    1. 3.1 Infrastructure: Testcontainers
    2. 3.2 Digital Thread Integration
    3. 3.3 Evidence Ingestion Pipeline
    4. 3.4 Supply Chain API Integration
  6. Layer 4: End-to-End Scenario Tests
    1. 4.1 Scenario: PRD to EVT Gate
    2. 4.2 Scenario: Compliance Checklist
  7. Agent Testing Strategy
    1. LLM Interaction Testing
    2. Stub Provider Example
    3. Eval Harness (Nightly)
  8. Tool Adapter Testing
    1. Adapter Test Matrix
    2. Mock Adapter Test
    3. SCPI Lab Equipment Test
  9. CI/CD Pipeline
    1. Pipeline Stages
    2. PR Gate Checks
    3. Coverage Targets
  10. Test Data Management
    1. Fixture Strategy
    2. Golden File Testing
  11. Performance & Load Testing
    1. Benchmarks
    2. Load Targets
  12. Security Testing
    1. SAST/DAST Integration
    2. Security-Specific Tests
  13. Observability in Tests
    1. Structured Test Logging
    2. Test Failure Triage
  14. Test Execution Summary
  15. Phased Rollout
    1. Phase 1 (MVP)
    2. Phase 2 (Mid-Size)
    3. Phase 3 (Enterprise)
  16. Related Documents

Testing Philosophy

MetaForge is a safety-critical orchestration platform where incorrect outputs can propagate into physical hardware. The testing strategy prioritises:

  1. Determinism — Skills are pure functions; test them as such
  2. Contract enforcement — Every inter-component boundary has schema-validated contracts
  3. Traceability — Every test maps to a requirement or constraint in the Digital Thread
  4. Fail-fast — Catch constraint violations at the earliest possible layer
  5. Reproducibility — All tests run identically in CI and locally

Testing Pyramid

graph TB
    subgraph Pyramid["Testing Pyramid"]
        E2E["E2E / Scenario Tests<br/>~5% of suite"]
        INT["Integration Tests<br/>~15% of suite"]
        CONTRACT["Contract Tests<br/>~20% of suite"]
        UNIT["Unit Tests<br/>~60% of suite"]
    end

    E2E --> INT --> CONTRACT --> UNIT

    style UNIT fill:#27ae60,color:#fff
    style CONTRACT fill:#3498db,color:#fff
    style INT fill:#f39c12,color:#000
    style E2E fill:#e74c3c,color:#fff
Layer Scope Speed Isolation
Unit Single function / class <1s per test Full mocks
Contract Schema boundaries between components <2s per test Stub external services
Integration Multi-component workflows <30s per test Testcontainers (Neo4j, MinIO, Kafka)
E2E / Scenario Full orchestrator flow against example project 1-5 min per scenario Real infrastructure, LLM stubs

Layer 1: Unit Tests

1.1 Skill Testing

Skills are deterministic pure functions — the easiest components to test. Every skill must have 100% branch coverage.

# tests/unit/skills/test_validate_stress.py
import pytest
from metaforge.skills.mechanical import ValidateStressSkill, StressInput, StressOutput

@pytest.fixture
def skill():
    return ValidateStressSkill()

def test_pass_within_yield_strength(skill):
    result = skill.execute(StressInput(
        material="6061-T6",
        max_stress_mpa=200.0,
        yield_strength_mpa=276.0,
        safety_factor=1.5
    ))
    assert result.status == "PASS"
    assert result.margin_of_safety > 0

def test_fail_exceeds_yield(skill):
    result = skill.execute(StressInput(
        material="6061-T6",
        max_stress_mpa=300.0,
        yield_strength_mpa=276.0,
        safety_factor=1.5
    ))
    assert result.status == "FAIL"
    assert "yield" in result.failure_reason.lower()

def test_edge_case_zero_stress(skill):
    result = skill.execute(StressInput(
        material="6061-T6",
        max_stress_mpa=0.0,
        yield_strength_mpa=276.0,
        safety_factor=1.5
    ))
    assert result.status == "PASS"

1.2 Constraint Engine Testing

# tests/unit/test_constraint_engine.py
import pytest
from metaforge.constraints import ConstraintEngine, Constraint

def test_thermal_constraint_violation():
    engine = ConstraintEngine()
    engine.add(Constraint(
        id="THERM-001",
        rule="max_junction_temp_c < 85",
        scope="BOMItem[category='IC']"
    ))

    violations = engine.evaluate({
        "max_junction_temp_c": 92,
        "category": "IC"
    })

    assert len(violations) == 1
    assert violations[0].constraint_id == "THERM-001"

def test_cross_domain_clearance_constraint():
    engine = ConstraintEngine()
    engine.add(Constraint(
        id="MECH-EE-001",
        rule="pcb_edge_clearance_mm >= enclosure_wall_clearance_mm",
        scope="CrossDomain[mechanical, electronics]"
    ))

    violations = engine.evaluate({
        "pcb_edge_clearance_mm": 1.5,
        "enclosure_wall_clearance_mm": 2.0
    })

    assert len(violations) == 1

1.3 Gate Engine Testing

# tests/unit/test_gate_engine.py
from metaforge.gates import GateEngine, GateDefinition

def test_evt_gate_blocks_on_missing_coverage():
    gate = GateEngine()
    gate.load(GateDefinition(
        name="EVT",
        entry_criteria={
            "requirement_coverage": "> 95%",
            "bom_risk_score": "< 30",
            "test_plan_approved": True
        }
    ))

    result = gate.evaluate({
        "requirement_coverage": 0.80,
        "bom_risk_score": 15,
        "test_plan_approved": True
    })

    assert result.ready is False
    assert "requirement_coverage" in result.blockers[0].criterion

1.4 BOM Risk Scoring

# tests/unit/test_bom_risk.py
from metaforge.supply_chain import calculate_bom_risk, BOMItem

def test_single_source_risk():
    items = [
        BOMItem(mpn="STM32F407", sources=1, lead_time_weeks=4, eol=False),
        BOMItem(mpn="CAP-100nF", sources=5, lead_time_weeks=2, eol=False),
    ]
    result = calculate_bom_risk(items)
    assert result["level"] in ("LOW", "MEDIUM", "HIGH")
    assert result["score"] > 0  # Single-source penalty

def test_eol_component_raises_risk():
    items = [
        BOMItem(mpn="LM317", sources=3, lead_time_weeks=3, eol=True),
    ]
    result = calculate_bom_risk(items)
    assert result["score"] >= 15  # EOL penalty
    assert any("EOL" in r for r in result["recommendations"])

Layer 2: Contract Tests

Contract tests validate the schema boundaries between MetaForge components. They ensure producers and consumers agree on data shape without requiring live services.

2.1 Agent Output Contracts

Every agent must produce Pydantic-validated output conforming to its declared schema.

# tests/contract/test_agent_output_schemas.py
import pytest
from pydantic import ValidationError
from metaforge.agents.schemas import (
    RequirementsOutput,
    ArchitectureOutput,
    BOMOutput,
    TestPlanOutput,
)

AGENT_SCHEMAS = [
    ("REQ", RequirementsOutput, "fixtures/req_output_valid.json"),
    ("EE", ArchitectureOutput, "fixtures/ee_output_valid.json"),
    ("SC", BOMOutput, "fixtures/sc_output_valid.json"),
    ("TST", TestPlanOutput, "fixtures/tst_output_valid.json"),
]

@pytest.mark.parametrize("agent_id,schema,fixture_path", AGENT_SCHEMAS)
def test_valid_fixture_passes_schema(agent_id, schema, fixture_path):
    data = load_fixture(fixture_path)
    result = schema.model_validate(data)
    assert result is not None

@pytest.mark.parametrize("agent_id,schema,fixture_path", AGENT_SCHEMAS)
def test_missing_required_fields_rejected(agent_id, schema, fixture_path):
    data = load_fixture(fixture_path)
    del data[list(data.keys())[0]]  # Remove a required field
    with pytest.raises(ValidationError):
        schema.model_validate(data)

2.2 MCP Tool Adapter Contracts

Tool adapters must conform to the ToolAdapter protocol. Contract tests verify capability detection, input validation, and output shape.

# tests/contract/test_tool_adapter_protocol.py
import pytest
from metaforge.tools.protocol import ToolAdapter

ADAPTERS = [
    "KiCadAdapter",
    "NGSpiceAdapter",
    "FreeCADAdapter",
    "DigiKeyAdapter",
    "CalculiXAdapter",
]

@pytest.mark.parametrize("adapter_name", ADAPTERS)
def test_adapter_implements_protocol(adapter_name):
    adapter_cls = get_adapter_class(adapter_name)
    assert issubclass(adapter_cls, ToolAdapter)
    assert callable(getattr(adapter_cls, "detect", None))
    assert callable(getattr(adapter_cls, "execute", None))

@pytest.mark.parametrize("adapter_name", ADAPTERS)
def test_adapter_returns_structured_result(adapter_name):
    adapter = create_stub_adapter(adapter_name)
    result = adapter.execute(get_sample_action(adapter_name))
    assert hasattr(result, "status")
    assert result.status in ("success", "error")
    assert hasattr(result, "data")

2.3 Event Bus Contracts

Events published to Kafka must conform to declared Avro/JSON schemas.

# tests/contract/test_event_schemas.py
from metaforge.events.schemas import (
    ArtifactCreatedEvent,
    GateTransitionEvent,
    ConstraintViolationEvent,
)

def test_artifact_created_event_schema():
    event = ArtifactCreatedEvent(
        artifact_id="art-001",
        artifact_type="BOMItem",
        session_id="sess-123",
        timestamp="2026-03-07T10:00:00Z"
    )
    serialised = event.model_dump_json()
    roundtrip = ArtifactCreatedEvent.model_validate_json(serialised)
    assert roundtrip.artifact_id == event.artifact_id

def test_gate_transition_requires_approver():
    with pytest.raises(ValidationError):
        GateTransitionEvent(
            gate="EVT",
            from_status="BLOCKED",
            to_status="READY",
            # missing: approver_id
        )

Layer 3: Integration Tests

Integration tests verify multi-component workflows using real (containerised) infrastructure.

3.1 Infrastructure: Testcontainers

# tests/integration/conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer
from testcontainers.minio import MinioContainer
from testcontainers.kafka import KafkaContainer

@pytest.fixture(scope="session")
def neo4j():
    with Neo4jContainer("neo4j:5-community") as container:
        yield container.get_connection_url()

@pytest.fixture(scope="session")
def minio():
    with MinioContainer() as container:
        yield {
            "endpoint": container.get_url(),
            "access_key": container.access_key,
            "secret_key": container.secret_key,
        }

@pytest.fixture(scope="session")
def kafka():
    with KafkaContainer() as container:
        yield container.get_bootstrap_server()

3.2 Digital Thread Integration

# tests/integration/test_digital_thread.py

def test_requirement_to_bom_traceability(neo4j):
    """Verify full traceability chain: Requirement -> BOMItem -> TestEvidence"""
    graph = connect(neo4j)

    # Create requirement
    req = graph.create_node("Requirement", {
        "id": "REQ-001",
        "title": "Operating voltage 3.3V +/- 5%",
        "status": "APPROVED"
    })

    # Create BOM item satisfying requirement
    bom = graph.create_node("BOMItem", {
        "mpn": "TPS63020",
        "description": "3.3V Buck-Boost Converter"
    })
    graph.create_relationship(bom, "SATISFIES", req)

    # Create test evidence
    evidence = graph.create_node("TestEvidence", {
        "type": "voltage_regulation",
        "status": "PASS",
        "measured_value": "3.31V"
    })
    graph.create_relationship(evidence, "VALIDATES", req)

    # Verify traceability query
    chain = graph.query("""
        MATCH (e:TestEvidence)-[:VALIDATES]->(r:Requirement)
              <-[:SATISFIES]-(b:BOMItem)
        WHERE r.id = 'REQ-001'
        RETURN r, b, e
    """)
    assert len(chain) == 1

def test_orphan_requirement_detection(neo4j):
    """Requirements without test coverage should be flagged"""
    graph = connect(neo4j)
    graph.create_node("Requirement", {
        "id": "REQ-ORPHAN",
        "title": "Untested requirement",
        "status": "APPROVED"
    })

    orphans = graph.query("""
        MATCH (r:Requirement)
        WHERE NOT (r)<-[:VALIDATES]-(:TestEvidence)
        RETURN r
    """)
    assert any(r["id"] == "REQ-ORPHAN" for r in orphans)

3.3 Evidence Ingestion Pipeline

# tests/integration/test_evidence_ingestion.py

def test_github_actions_webhook_ingests_evidence(neo4j, minio, kafka):
    """Simulate a GitHub Actions webhook delivering test results"""
    api = create_test_client(neo4j=neo4j, minio=minio, kafka=kafka)

    webhook_payload = {
        "action": "completed",
        "workflow_run": {
            "conclusion": "success",
            "artifacts": [{
                "name": "test-results",
                "content": encode_junit_xml(tests=[
                    {"name": "test_voltage_regulation", "status": "passed",
                     "requirement_ids": ["REQ-001"]},
                    {"name": "test_current_limit", "status": "passed",
                     "requirement_ids": ["REQ-002"]},
                ])
            }]
        }
    }

    response = api.post("/webhooks/github", json=webhook_payload)
    assert response.status_code == 200

    # Verify evidence nodes created and linked
    evidence = api.get("/api/v1/evidence?source=github-actions")
    assert len(evidence.json()) == 2

    # Verify requirement linkage
    coverage = api.get("/api/v1/requirements/REQ-001/coverage")
    assert coverage.json()["covered"] is True

3.4 Supply Chain API Integration

# tests/integration/test_supply_chain.py

@pytest.mark.vcr()  # Record/replay HTTP interactions
def test_bom_risk_with_live_pricing():
    """Test BOM risk scoring with recorded distributor API responses"""
    sc_agent = SupplyChainAgent(
        adapters=[DigiKeyAdapter(), MouserAdapter(), NexarAdapter()]
    )

    bom = [
        {"mpn": "STM32F407VGT6", "quantity": 100},
        {"mpn": "GRM155R71C104KA88D", "quantity": 500},
    ]

    result = sc_agent.assess_risk(bom)

    assert "score" in result
    assert "level" in result
    assert all(item["pricing"] is not None for item in result["items"])
    assert all(item["availability"] is not None for item in result["items"])

Layer 4: End-to-End Scenario Tests

E2E tests run the full orchestrator against the example Drone Flight Controller project.

4.1 Scenario: PRD to EVT Gate

# tests/e2e/test_prd_to_evt.py

@pytest.mark.e2e
@pytest.mark.slow
def test_drone_fc_prd_to_evt_gate(orchestrator, drone_fc_project):
    """Full scenario: ingest PRD, extract requirements, assess BOM,
    generate test plan, evaluate EVT gate readiness"""

    # Step 1: Ingest PRD
    session = orchestrator.run("spec", project=drone_fc_project)
    assert session.status == "COMPLETED"
    assert session.has_artifact("constraints.json")

    # Step 2: Generate architecture
    session = orchestrator.run("architecture", project=drone_fc_project)
    assert session.status == "COMPLETED"

    # Step 3: BOM risk assessment
    session = orchestrator.run("bom-risk", project=drone_fc_project)
    result = session.get_artifact("bom_risk.json")
    assert result["level"] != "HIGH"

    # Step 4: Generate test plan
    session = orchestrator.run("test-plan", project=drone_fc_project)
    assert session.has_artifact("test_plan.md")

    # Step 5: Check EVT gate readiness
    gate = orchestrator.evaluate_gate("EVT", project=drone_fc_project)
    assert gate.coverage >= 0.95
    assert gate.bom_risk_score < 30

4.2 Scenario: Compliance Checklist

# tests/e2e/test_compliance_flow.py

@pytest.mark.e2e
def test_multi_market_compliance(orchestrator, drone_fc_project):
    """Generate compliance checklists for UK + EU + USA markets"""

    drone_fc_project.set_target_markets(["UK", "EU", "USA"])

    session = orchestrator.run("compliance-checklist", project=drone_fc_project)
    checklist = session.get_artifact("compliance_checklist.json")

    # Verify all regimes detected
    regimes = {item["regime"] for item in checklist}
    assert "UKCA" in regimes
    assert "CE" in regimes
    assert "FCC" in regimes

    # Verify PSTI detected for connected product
    if drone_fc_project.has_connectivity:
        assert "PSTI" in regimes

    # Verify evidence requirements generated
    for item in checklist:
        assert len(item["evidence_required"]) > 0

Agent Testing Strategy

LLM Interaction Testing

Agent tests must isolate LLM calls to ensure determinism. Three approaches:

flowchart LR
    subgraph Approaches["Agent Test Approaches"]
        A["1. Recorded Responses<br/>VCR cassettes"]
        B["2. Stub Providers<br/>Deterministic output"]
        C["3. Eval Harness<br/>LLM-as-judge scoring"]
    end

    A -->|Fast, deterministic| CI["CI Pipeline"]
    B -->|Schema validation| CI
    C -->|Quality regression| NIGHTLY["Nightly Suite"]
Approach Use Case Speed When
Recorded responses Regression tests, contract validation Fast (<1s) Every PR
Stub providers Schema validation, error handling paths Fast (<1s) Every PR
Eval harness Output quality, prompt regression Slow (30s+) Nightly / pre-release

Stub Provider Example

# tests/agents/conftest.py
from pydantic_ai.models.test import TestModel

@pytest.fixture
def stub_llm():
    """Deterministic LLM stub for agent tests"""
    return TestModel(
        custom_result_text="stub response",
        call_tools=["search_parts", "check_availability"],
    )

def test_ee_agent_produces_valid_bom(stub_llm):
    agent = ElectronicsAgent(model=stub_llm)
    result = agent.run("Design power supply for 3.3V @ 2A")

    # Schema validation — the real test
    assert isinstance(result.data, ArchitectureOutput)
    assert len(result.data.bom_items) > 0
    for item in result.data.bom_items:
        assert item.mpn is not None
        assert item.quantity > 0

Eval Harness (Nightly)

# tests/eval/test_agent_quality.py

EVAL_CASES = [
    {
        "agent": "REQ",
        "input": "Design a temperature sensor with BLE, battery powered, IP67",
        "expected_constraints": ["operating_temp", "ble_version", "ingress_protection"],
        "min_score": 0.8,
    },
    {
        "agent": "EE",
        "input": "STM32-based motor controller, 24V, 10A per phase",
        "expected_components": ["MCU", "gate_driver", "MOSFETs", "current_sense"],
        "min_score": 0.7,
    },
]

@pytest.mark.nightly
@pytest.mark.parametrize("case", EVAL_CASES)
def test_agent_output_quality(case, live_llm):
    agent = get_agent(case["agent"], model=live_llm)
    result = agent.run(case["input"])

    score = evaluate_output(
        result=result.data,
        expected=case,
        rubric="engineering_completeness"
    )
    assert score >= case["min_score"], (
        f"Agent {case['agent']} scored {score}, minimum {case['min_score']}"
    )

Tool Adapter Testing

Adapter Test Matrix

Each tool adapter is tested at three levels:

Level What Infrastructure Example
Mock Protocol compliance, error handling None Mock SCPI responses
Containerised Full adapter against real tool Docker KiCad in container
Live Real tool on developer machine Local install KiCad native

Mock Adapter Test

# tests/tools/test_kicad_adapter.py

def test_kicad_erc_parse_errors():
    adapter = KiCadAdapter(executable=MockKiCadCLI(
        erc_output=FIXTURE_ERC_WITH_ERRORS
    ))
    result = adapter.run_erc("/fake/project.kicad_sch")

    assert result.status == "error"
    assert result.data["error_count"] == 3
    assert result.data["errors"][0]["type"] == "unconnected_pin"

def test_kicad_bom_export():
    adapter = KiCadAdapter(executable=MockKiCadCLI(
        bom_output=FIXTURE_BOM_CSV
    ))
    bom = adapter.export_bom("/fake/project.kicad_sch", format="csv")

    assert len(bom) == 15
    assert all(item.mpn for item in bom)

SCPI Lab Equipment Test

# tests/tools/test_scpi_adapter.py

def test_scpi_voltage_measurement():
    adapter = SCPIAdapter(transport=MockTCPTransport(
        responses={"MEAS:VOLT:DC?": "3.312\n"}
    ))
    voltage = adapter.measure_voltage(channel=1)
    assert abs(voltage - 3.312) < 0.001

def test_scpi_timeout_handling():
    adapter = SCPIAdapter(transport=MockTCPTransport(
        responses={},  # No response — simulates timeout
        timeout_ms=100
    ))
    with pytest.raises(InstrumentTimeoutError):
        adapter.measure_voltage(channel=1)

CI/CD Pipeline

Pipeline Stages

flowchart LR
    subgraph PR["Pull Request"]
        LINT["Lint<br/>ruff + mypy"]
        UNIT["Unit Tests"]
        CONTRACT["Contract Tests"]
        SEC["Security Scan<br/>bandit + trivy"]
    end

    subgraph Merge["Post-Merge (main)"]
        INT["Integration Tests<br/>Testcontainers"]
        BUILD["Build Images"]
        PUBLISH["Publish to Registry"]
    end

    subgraph Nightly["Nightly"]
        E2E["E2E Scenarios"]
        EVAL["Agent Eval Harness"]
        PERF["Performance Benchmarks"]
    end

    subgraph Release["Release"]
        SMOKE["Smoke Tests"]
        STAGE["Deploy to Staging"]
        ACCEPT["Acceptance Tests"]
        PROD["Deploy to Prod"]
    end

    LINT --> UNIT --> CONTRACT --> SEC
    SEC -->|pass| INT --> BUILD --> PUBLISH
    PUBLISH --> E2E & EVAL & PERF
    PERF --> SMOKE --> STAGE --> ACCEPT --> PROD

PR Gate Checks

# .github/workflows/pr-checks.yml
name: PR Checks

on:
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ruff check .
      - run: mypy --strict src/

  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e ".[test]"
      - run: pytest tests/unit/ tests/contract/ -v --cov=metaforge --cov-fail-under=85

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: bandit -r src/ -ll
      - run: trivy fs --severity HIGH,CRITICAL .

Coverage Targets

Component Minimum Coverage Rationale
Skills 100% Pure functions, fully testable
Constraint Engine 95% Safety-critical path
Gate Engine 95% Governs phase transitions
BOM Risk Scoring 90% Financial impact
Agent harness 85% Complex async paths
Tool adapters 80% External dependency boundaries
API endpoints 85% User-facing surface
Event handlers 80% Async processing
Overall 85% Floor for main branch

Test Data Management

Fixture Strategy

tests/
  fixtures/
    projects/
      drone-fc/              # Complete example project
        PRD.md
        constraints.json
        bom.csv
        test_plan.md
    agents/
      req_output_valid.json   # Valid agent output fixtures
      ee_output_valid.json
      sc_output_valid.json
    tools/
      kicad/
        erc_pass.txt          # Tool output fixtures
        erc_with_errors.txt
        bom_export.csv
      scpi/
        voltage_response.txt
    events/
      artifact_created.json   # Event schema fixtures
      gate_transition.json
    cassettes/                # VCR recorded HTTP interactions
      digikey_search.yaml
      mouser_search.yaml
      nexar_query.yaml

Golden File Testing

For complex outputs (compliance checklists, test plans, safety cases), use golden file comparison:

# tests/unit/test_compliance_generator.py

def test_uk_eu_compliance_checklist_matches_golden(snapshot):
    generator = ComplianceChecklistGenerator()
    checklist = generator.generate(
        product_features=["wireless", "battery", "iot"],
        target_markets=["UK", "EU"]
    )
    snapshot.assert_match(checklist.model_dump_json(indent=2), "uk_eu_checklist.json")

Performance & Load Testing

Benchmarks

# tests/performance/test_benchmarks.py

@pytest.mark.benchmark
def test_constraint_evaluation_latency(benchmark):
    """Constraint engine must evaluate <50ms for 100 constraints"""
    engine = load_constraint_engine(num_constraints=100)
    context = generate_test_context()

    result = benchmark(engine.evaluate, context)
    assert benchmark.stats["mean"] < 0.050  # 50ms

@pytest.mark.benchmark
def test_graph_traceability_query(benchmark, neo4j):
    """Full traceability query must return <200ms"""
    seed_test_graph(neo4j, requirements=500, bom_items=200, evidence=300)

    result = benchmark(
        query_traceability, neo4j, requirement_id="REQ-001"
    )
    assert benchmark.stats["mean"] < 0.200  # 200ms

Load Targets

Operation P50 P95 P99
Constraint evaluation (100 rules) <20ms <50ms <100ms
Graph traceability query <100ms <200ms <500ms
BOM risk scoring (50 items) <500ms <1s <2s
Gate readiness check <200ms <500ms <1s
Evidence ingestion (single) <100ms <200ms <500ms

Security Testing

SAST/DAST Integration

Tool Scope Frequency
Ruff Python linting + security rules Every PR
Bandit Python security analysis Every PR
Trivy Container image + dependency CVEs Every PR + nightly
Semgrep Custom rules for MetaForge patterns Every PR
OWASP ZAP API endpoint scanning Weekly

Security-Specific Tests

# tests/security/test_api_security.py

def test_path_traversal_blocked():
    response = client.post("/api/v1/artifacts", json={
        "path": "../../../etc/passwd"
    })
    assert response.status_code == 400

def test_unauthenticated_write_rejected():
    response = client.post(
        "/api/v1/agent/run",
        json={"skill": "spec"},
        headers={}  # No auth
    )
    assert response.status_code == 401

def test_agent_output_sanitised():
    """Agent output must not contain secrets or PII"""
    output = run_agent_with_fixture("req", FIXTURE_WITH_EMBEDDED_SECRET)
    assert "sk-" not in output.model_dump_json()
    assert "password" not in output.model_dump_json().lower()

Observability in Tests

Structured Test Logging

All integration and E2E tests emit OpenTelemetry spans for debugging failures:

# tests/integration/conftest.py
from opentelemetry import trace

tracer = trace.get_tracer("metaforge.tests")

@pytest.fixture(autouse=True)
def trace_test(request):
    with tracer.start_as_current_span(
        f"test.{request.node.nodeid}",
        attributes={"test.layer": "integration"}
    ):
        yield

Test Failure Triage

flowchart TD
    FAIL["Test Failure"] --> TYPE{Failure Type?}

    TYPE -->|Assertion| ASSERT["Check golden files<br/>and fixture drift"]
    TYPE -->|Timeout| TIMEOUT["Check Testcontainer<br/>startup / health"]
    TYPE -->|Schema| SCHEMA["Run contract tests<br/>in isolation"]
    TYPE -->|Flaky| FLAKY["Check for race conditions<br/>Add retry or fix ordering"]

    ASSERT --> FIX["Update fixture or fix code"]
    TIMEOUT --> FIX2["Increase timeout or fix infra"]
    SCHEMA --> FIX3["Update producer or consumer"]
    FLAKY --> FIX4["Add synchronisation or mark known-flaky"]

Test Execution Summary

Suite Trigger Duration Target Gate
pytest tests/unit/ Every PR <2 min Required to merge
pytest tests/contract/ Every PR <1 min Required to merge
pytest tests/integration/ Post-merge to main <10 min Required for release
pytest tests/e2e/ Nightly <30 min Advisory
pytest tests/eval/ Nightly <60 min Advisory (quality trend)
pytest tests/performance/ Nightly <15 min Alert on regression >20%
pytest tests/security/ Every PR + weekly <5 min Required to merge

Phased Rollout

Phase 1 (MVP)

  • Unit tests for skills, constraints, gates, BOM risk
  • Contract tests for agent output schemas
  • Integration tests for Digital Thread (Neo4j)
  • CI pipeline with lint + unit + contract gates
  • Coverage floor: 85%

Phase 2 (Mid-Size)

  • Tool adapter mock tests (KiCad, SCPI, CAD)
  • Containerised integration tests (full Testcontainer suite)
  • E2E scenario tests (drone FC project)
  • Agent eval harness (nightly)
  • Security scan integration (Bandit, Trivy)

Phase 3 (Enterprise)

  • Performance benchmarks with regression alerts
  • OWASP ZAP API scanning
  • Chaos testing for Temporal workflow recovery
  • Multi-tenant isolation tests
  • Compliance evidence generation tests (safety cases)

Document Description
System Vision Platform architecture and five pillars
Orchestrator Technical Detailed system design with event-driven workflows
MVP Roadmap Phased implementation plan and deliverables
Observability Logging, metrics, tracing, and SLO/SLI framework
Constraint Engine Engineering constraint evaluation rules
Testing & Validation Page Dashboard UI for test coverage and FMEA

Document Version: v1.0 Last Updated: 2026-03-07

← Observability Architecture Home →