Testing Strategy

Layered testing approach from unit tests to end-to-end hardware validation

Testing Philosophy
Testing Pyramid
Layer 1: Unit Tests
Layer 2: Contract Tests
Layer 3: Integration Tests
Layer 4: End-to-End Scenario Tests
1. 4.1 Scenario: PRD to EVT Gate
2. 4.2 Scenario: Compliance Checklist
Agent Testing Strategy
Tool Adapter Testing
CI/CD Pipeline
Test Data Management
1. Fixture Strategy
2. Golden File Testing
Performance & Load Testing
1. Benchmarks
2. Load Targets
Security Testing
1. SAST/DAST Integration
2. Security-Specific Tests
Observability in Tests
1. Structured Test Logging
2. Test Failure Triage
Test Execution Summary
Phased Rollout
Related Documents

Testing Philosophy

MetaForge is a safety-critical orchestration platform where incorrect outputs can propagate into physical hardware. The testing strategy prioritises:

Determinism — Skills are pure functions; test them as such
Contract enforcement — Every inter-component boundary has schema-validated contracts
Traceability — Every test maps to a requirement or constraint in the Digital Thread
Fail-fast — Catch constraint violations at the earliest possible layer
Reproducibility — All tests run identically in CI and locally

Testing Pyramid

graph TB
    subgraph Pyramid["Testing Pyramid"]
        E2E["E2E / Scenario Tests<br/>~5% of suite"]
        INT["Integration Tests<br/>~15% of suite"]
        CONTRACT["Contract Tests<br/>~20% of suite"]
        UNIT["Unit Tests<br/>~60% of suite"]
    end

    E2E --> INT --> CONTRACT --> UNIT

    style UNIT fill:#27ae60,color:#fff
    style CONTRACT fill:#3498db,color:#fff
    style INT fill:#f39c12,color:#000
    style E2E fill:#e74c3c,color:#fff

Layer	Scope	Speed	Isolation
Unit	Single function / class	<1s per test	Full mocks
Contract	Schema boundaries between components	<2s per test	Stub external services
Integration	Multi-component workflows	<30s per test	Testcontainers (Neo4j, MinIO, Kafka)
E2E / Scenario	Full orchestrator flow against example project	1-5 min per scenario	Real infrastructure, LLM stubs

Layer 1: Unit Tests

1.1 Skill Testing

Skills are deterministic pure functions — the easiest components to test. Every skill must have 100% branch coverage.

# tests/unit/skills/test_validate_stress.py
import pytest
from metaforge.skills.mechanical import ValidateStressSkill, StressInput, StressOutput

@pytest.fixture
def skill():
    return ValidateStressSkill()

def test_pass_within_yield_strength(skill):
    result = skill.execute(StressInput(
        material="6061-T6",
        max_stress_mpa=200.0,
        yield_strength_mpa=276.0,
        safety_factor=1.5
    ))
    assert result.status == "PASS"
    assert result.margin_of_safety > 0

def test_fail_exceeds_yield(skill):
    result = skill.execute(StressInput(
        material="6061-T6",
        max_stress_mpa=300.0,
        yield_strength_mpa=276.0,
        safety_factor=1.5
    ))
    assert result.status == "FAIL"
    assert "yield" in result.failure_reason.lower()

def test_edge_case_zero_stress(skill):
    result = skill.execute(StressInput(
        material="6061-T6",
        max_stress_mpa=0.0,
        yield_strength_mpa=276.0,
        safety_factor=1.5
    ))
    assert result.status == "PASS"

1.2 Constraint Engine Testing

# tests/unit/test_constraint_engine.py
import pytest
from metaforge.constraints import ConstraintEngine, Constraint

def test_thermal_constraint_violation():
    engine = ConstraintEngine()
    engine.add(Constraint(
        id="THERM-001",
        rule="max_junction_temp_c < 85",
        scope="BOMItem[category='IC']"
    ))

    violations = engine.evaluate({
        "max_junction_temp_c": 92,
        "category": "IC"
    })

    assert len(violations) == 1
    assert violations[0].constraint_id == "THERM-001"

def test_cross_domain_clearance_constraint():
    engine = ConstraintEngine()
    engine.add(Constraint(
        id="MECH-EE-001",
        rule="pcb_edge_clearance_mm >= enclosure_wall_clearance_mm",
        scope="CrossDomain[mechanical, electronics]"
    ))

    violations = engine.evaluate({
        "pcb_edge_clearance_mm": 1.5,
        "enclosure_wall_clearance_mm": 2.0
    })

    assert len(violations) == 1

1.3 Gate Engine Testing

# tests/unit/test_gate_engine.py
from metaforge.gates import GateEngine, GateDefinition

def test_evt_gate_blocks_on_missing_coverage():
    gate = GateEngine()
    gate.load(GateDefinition(
        name="EVT",
        entry_criteria={
            "requirement_coverage": "> 95%",
            "bom_risk_score": "< 30",
            "test_plan_approved": True
        }
    ))

    result = gate.evaluate({
        "requirement_coverage": 0.80,
        "bom_risk_score": 15,
        "test_plan_approved": True
    })

    assert result.ready is False
    assert "requirement_coverage" in result.blockers[0].criterion

1.4 BOM Risk Scoring

# tests/unit/test_bom_risk.py
from metaforge.supply_chain import calculate_bom_risk, BOMItem

def test_single_source_risk():
    items = [
        BOMItem(mpn="STM32F407", sources=1, lead_time_weeks=4, eol=False),
        BOMItem(mpn="CAP-100nF", sources=5, lead_time_weeks=2, eol=False),
    ]
    result = calculate_bom_risk(items)
    assert result["level"] in ("LOW", "MEDIUM", "HIGH")
    assert result["score"] > 0  # Single-source penalty

def test_eol_component_raises_risk():
    items = [
        BOMItem(mpn="LM317", sources=3, lead_time_weeks=3, eol=True),
    ]
    result = calculate_bom_risk(items)
    assert result["score"] >= 15  # EOL penalty
    assert any("EOL" in r for r in result["recommendations"])

Layer 2: Contract Tests

Contract tests validate the schema boundaries between MetaForge components. They ensure producers and consumers agree on data shape without requiring live services.

2.1 Agent Output Contracts

Every agent must produce Pydantic-validated output conforming to its declared schema.

# tests/contract/test_agent_output_schemas.py
import pytest
from pydantic import ValidationError
from metaforge.agents.schemas import (
    RequirementsOutput,
    ArchitectureOutput,
    BOMOutput,
    TestPlanOutput,
)

AGENT_SCHEMAS = [
    ("REQ", RequirementsOutput, "fixtures/req_output_valid.json"),
    ("EE", ArchitectureOutput, "fixtures/ee_output_valid.json"),
    ("SC", BOMOutput, "fixtures/sc_output_valid.json"),
    ("TST", TestPlanOutput, "fixtures/tst_output_valid.json"),
]

@pytest.mark.parametrize("agent_id,schema,fixture_path", AGENT_SCHEMAS)
def test_valid_fixture_passes_schema(agent_id, schema, fixture_path):
    data = load_fixture(fixture_path)
    result = schema.model_validate(data)
    assert result is not None

@pytest.mark.parametrize("agent_id,schema,fixture_path", AGENT_SCHEMAS)
def test_missing_required_fields_rejected(agent_id, schema, fixture_path):
    data = load_fixture(fixture_path)
    del data[list(data.keys())[0]]  # Remove a required field
    with pytest.raises(ValidationError):
        schema.model_validate(data)

2.2 MCP Tool Adapter Contracts

Tool adapters must conform to the ToolAdapter protocol. Contract tests verify capability detection, input validation, and output shape.

# tests/contract/test_tool_adapter_protocol.py
import pytest
from metaforge.tools.protocol import ToolAdapter

ADAPTERS = [
    "KiCadAdapter",
    "NGSpiceAdapter",
    "FreeCADAdapter",
    "DigiKeyAdapter",
    "CalculiXAdapter",
]

@pytest.mark.parametrize("adapter_name", ADAPTERS)
def test_adapter_implements_protocol(adapter_name):
    adapter_cls = get_adapter_class(adapter_name)
    assert issubclass(adapter_cls, ToolAdapter)
    assert callable(getattr(adapter_cls, "detect", None))
    assert callable(getattr(adapter_cls, "execute", None))

@pytest.mark.parametrize("adapter_name", ADAPTERS)
def test_adapter_returns_structured_result(adapter_name):
    adapter = create_stub_adapter(adapter_name)
    result = adapter.execute(get_sample_action(adapter_name))
    assert hasattr(result, "status")
    assert result.status in ("success", "error")
    assert hasattr(result, "data")

2.3 Event Bus Contracts

Events published to Kafka must conform to declared Avro/JSON schemas.

# tests/contract/test_event_schemas.py
from metaforge.events.schemas import (
    ArtifactCreatedEvent,
    GateTransitionEvent,
    ConstraintViolationEvent,
)

def test_artifact_created_event_schema():
    event = ArtifactCreatedEvent(
        artifact_id="art-001",
        artifact_type="BOMItem",
        session_id="sess-123",
        timestamp="2026-03-07T10:00:00Z"
    )
    serialised = event.model_dump_json()
    roundtrip = ArtifactCreatedEvent.model_validate_json(serialised)
    assert roundtrip.artifact_id == event.artifact_id

def test_gate_transition_requires_approver():
    with pytest.raises(ValidationError):
        GateTransitionEvent(
            gate="EVT",
            from_status="BLOCKED",
            to_status="READY",
            # missing: approver_id
        )

Layer 3: Integration Tests

Integration tests verify multi-component workflows using real (containerised) infrastructure.

3.1 Infrastructure: Testcontainers

# tests/integration/conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer
from testcontainers.minio import MinioContainer
from testcontainers.kafka import KafkaContainer

@pytest.fixture(scope="session")
def neo4j():
    with Neo4jContainer("neo4j:5-community") as container:
        yield container.get_connection_url()

@pytest.fixture(scope="session")
def minio():
    with MinioContainer() as container:
        yield {
            "endpoint": container.get_url(),
            "access_key": container.access_key,
            "secret_key": container.secret_key,
        }

@pytest.fixture(scope="session")
def kafka():
    with KafkaContainer() as container:
        yield container.get_bootstrap_server()

3.2 Digital Thread Integration

# tests/integration/test_digital_thread.py

def test_requirement_to_bom_traceability(neo4j):
    """Verify full traceability chain: Requirement -> BOMItem -> TestEvidence"""
    graph = connect(neo4j)

    # Create requirement
    req = graph.create_node("Requirement", {
        "id": "REQ-001",
        "title": "Operating voltage 3.3V +/- 5%",
        "status": "APPROVED"
    })

    # Create BOM item satisfying requirement
    bom = graph.create_node("BOMItem", {
        "mpn": "TPS63020",
        "description": "3.3V Buck-Boost Converter"
    })
    graph.create_relationship(bom, "SATISFIES", req)

    # Create test evidence
    evidence = graph.create_node("TestEvidence", {
        "type": "voltage_regulation",
        "status": "PASS",
        "measured_value": "3.31V"
    })
    graph.create_relationship(evidence, "VALIDATES", req)

    # Verify traceability query
    chain = graph.query("""
        MATCH (e:TestEvidence)-[:VALIDATES]->(r:Requirement)
              <-[:SATISFIES]-(b:BOMItem)
        WHERE r.id = 'REQ-001'
        RETURN r, b, e
    """)
    assert len(chain) == 1

def test_orphan_requirement_detection(neo4j):
    """Requirements without test coverage should be flagged"""
    graph = connect(neo4j)
    graph.create_node("Requirement", {
        "id": "REQ-ORPHAN",
        "title": "Untested requirement",
        "status": "APPROVED"
    })

    orphans = graph.query("""
        MATCH (r:Requirement)
        WHERE NOT (r)<-[:VALIDATES]-(:TestEvidence)
        RETURN r
    """)
    assert any(r["id"] == "REQ-ORPHAN" for r in orphans)

3.3 Evidence Ingestion Pipeline

# tests/integration/test_evidence_ingestion.py

def test_github_actions_webhook_ingests_evidence(neo4j, minio, kafka):
    """Simulate a GitHub Actions webhook delivering test results"""
    api = create_test_client(neo4j=neo4j, minio=minio, kafka=kafka)

    webhook_payload = {
        "action": "completed",
        "workflow_run": {
            "conclusion": "success",
            "artifacts": [{
                "name": "test-results",
                "content": encode_junit_xml(tests=[
                    {"name": "test_voltage_regulation", "status": "passed",
                     "requirement_ids": ["REQ-001"]},
                    {"name": "test_current_limit", "status": "passed",
                     "requirement_ids": ["REQ-002"]},
                ])
            }]
        }
    }

    response = api.post("/webhooks/github", json=webhook_payload)
    assert response.status_code == 200

    # Verify evidence nodes created and linked
    evidence = api.get("/api/v1/evidence?source=github-actions")
    assert len(evidence.json()) == 2

    # Verify requirement linkage
    coverage = api.get("/api/v1/requirements/REQ-001/coverage")
    assert coverage.json()["covered"] is True

3.4 Supply Chain API Integration

# tests/integration/test_supply_chain.py

@pytest.mark.vcr()  # Record/replay HTTP interactions
def test_bom_risk_with_live_pricing():
    """Test BOM risk scoring with recorded distributor API responses"""
    sc_agent = SupplyChainAgent(
        adapters=[DigiKeyAdapter(), MouserAdapter(), NexarAdapter()]
    )

    bom = [
        {"mpn": "STM32F407VGT6", "quantity": 100},
        {"mpn": "GRM155R71C104KA88D", "quantity": 500},
    ]

    result = sc_agent.assess_risk(bom)

    assert "score" in result
    assert "level" in result
    assert all(item["pricing"] is not None for item in result["items"])
    assert all(item["availability"] is not None for item in result["items"])

Layer 4: End-to-End Scenario Tests

E2E tests run the full orchestrator against the example Drone Flight Controller project.

4.1 Scenario: PRD to EVT Gate

# tests/e2e/test_prd_to_evt.py

@pytest.mark.e2e
@pytest.mark.slow
def test_drone_fc_prd_to_evt_gate(orchestrator, drone_fc_project):
    """Full scenario: ingest PRD, extract requirements, assess BOM,
    generate test plan, evaluate EVT gate readiness"""

    # Step 1: Ingest PRD
    session = orchestrator.run("spec", project=drone_fc_project)
    assert session.status == "COMPLETED"
    assert session.has_artifact("constraints.json")

    # Step 2: Generate architecture
    session = orchestrator.run("architecture", project=drone_fc_project)
    assert session.status == "COMPLETED"

    # Step 3: BOM risk assessment
    session = orchestrator.run("bom-risk", project=drone_fc_project)
    result = session.get_artifact("bom_risk.json")
    assert result["level"] != "HIGH"

    # Step 4: Generate test plan
    session = orchestrator.run("test-plan", project=drone_fc_project)
    assert session.has_artifact("test_plan.md")

    # Step 5: Check EVT gate readiness
    gate = orchestrator.evaluate_gate("EVT", project=drone_fc_project)
    assert gate.coverage >= 0.95
    assert gate.bom_risk_score < 30

4.2 Scenario: Compliance Checklist

# tests/e2e/test_compliance_flow.py

@pytest.mark.e2e
def test_multi_market_compliance(orchestrator, drone_fc_project):
    """Generate compliance checklists for UK + EU + USA markets"""

    drone_fc_project.set_target_markets(["UK", "EU", "USA"])

    session = orchestrator.run("compliance-checklist", project=drone_fc_project)
    checklist = session.get_artifact("compliance_checklist.json")

    # Verify all regimes detected
    regimes = {item["regime"] for item in checklist}
    assert "UKCA" in regimes
    assert "CE" in regimes
    assert "FCC" in regimes

    # Verify PSTI detected for connected product
    if drone_fc_project.has_connectivity:
        assert "PSTI" in regimes

    # Verify evidence requirements generated
    for item in checklist:
        assert len(item["evidence_required"]) > 0

Agent Testing Strategy

LLM Interaction Testing

Agent tests must isolate LLM calls to ensure determinism. Three approaches:

flowchart LR
    subgraph Approaches["Agent Test Approaches"]
        A["1. Recorded Responses<br/>VCR cassettes"]
        B["2. Stub Providers<br/>Deterministic output"]
        C["3. Eval Harness<br/>LLM-as-judge scoring"]
    end

    A -->|Fast, deterministic| CI["CI Pipeline"]
    B -->|Schema validation| CI
    C -->|Quality regression| NIGHTLY["Nightly Suite"]

Approach	Use Case	Speed	When
Recorded responses	Regression tests, contract validation	Fast (<1s)	Every PR
Stub providers	Schema validation, error handling paths	Fast (<1s)	Every PR
Eval harness	Output quality, prompt regression	Slow (30s+)	Nightly / pre-release

Stub Provider Example

# tests/agents/conftest.py
from pydantic_ai.models.test import TestModel

@pytest.fixture
def stub_llm():
    """Deterministic LLM stub for agent tests"""
    return TestModel(
        custom_result_text="stub response",
        call_tools=["search_parts", "check_availability"],
    )

def test_ee_agent_produces_valid_bom(stub_llm):
    agent = ElectronicsAgent(model=stub_llm)
    result = agent.run("Design power supply for 3.3V @ 2A")

    # Schema validation — the real test
    assert isinstance(result.data, ArchitectureOutput)
    assert len(result.data.bom_items) > 0
    for item in result.data.bom_items:
        assert item.mpn is not None
        assert item.quantity > 0

Eval Harness (Nightly)

# tests/eval/test_agent_quality.py

EVAL_CASES = [
    {
        "agent": "REQ",
        "input": "Design a temperature sensor with BLE, battery powered, IP67",
        "expected_constraints": ["operating_temp", "ble_version", "ingress_protection"],
        "min_score": 0.8,
    },
    {
        "agent": "EE",
        "input": "STM32-based motor controller, 24V, 10A per phase",
        "expected_components": ["MCU", "gate_driver", "MOSFETs", "current_sense"],
        "min_score": 0.7,
    },
]

@pytest.mark.nightly
@pytest.mark.parametrize("case", EVAL_CASES)
def test_agent_output_quality(case, live_llm):
    agent = get_agent(case["agent"], model=live_llm)
    result = agent.run(case["input"])

    score = evaluate_output(
        result=result.data,
        expected=case,
        rubric="engineering_completeness"
    )
    assert score >= case["min_score"], (
        f"Agent {case['agent']} scored {score}, minimum {case['min_score']}"
    )

Tool Adapter Testing

Adapter Test Matrix

Each tool adapter is tested at three levels:

Level	What	Infrastructure	Example
Mock	Protocol compliance, error handling	None	Mock SCPI responses
Containerised	Full adapter against real tool	Docker	KiCad in container
Live	Real tool on developer machine	Local install	KiCad native

Mock Adapter Test

# tests/tools/test_kicad_adapter.py

def test_kicad_erc_parse_errors():
    adapter = KiCadAdapter(executable=MockKiCadCLI(
        erc_output=FIXTURE_ERC_WITH_ERRORS
    ))
    result = adapter.run_erc("/fake/project.kicad_sch")

    assert result.status == "error"
    assert result.data["error_count"] == 3
    assert result.data["errors"][0]["type"] == "unconnected_pin"

def test_kicad_bom_export():
    adapter = KiCadAdapter(executable=MockKiCadCLI(
        bom_output=FIXTURE_BOM_CSV
    ))
    bom = adapter.export_bom("/fake/project.kicad_sch", format="csv")

    assert len(bom) == 15
    assert all(item.mpn for item in bom)

SCPI Lab Equipment Test

# tests/tools/test_scpi_adapter.py

def test_scpi_voltage_measurement():
    adapter = SCPIAdapter(transport=MockTCPTransport(
        responses={"MEAS:VOLT:DC?": "3.312\n"}
    ))
    voltage = adapter.measure_voltage(channel=1)
    assert abs(voltage - 3.312) < 0.001

def test_scpi_timeout_handling():
    adapter = SCPIAdapter(transport=MockTCPTransport(
        responses={},  # No response — simulates timeout
        timeout_ms=100
    ))
    with pytest.raises(InstrumentTimeoutError):
        adapter.measure_voltage(channel=1)

CI/CD Pipeline

Pipeline Stages

flowchart LR
    subgraph PR["Pull Request"]
        LINT["Lint<br/>ruff + mypy"]
        UNIT["Unit Tests"]
        CONTRACT["Contract Tests"]
        SEC["Security Scan<br/>bandit + trivy"]
    end

    subgraph Merge["Post-Merge (main)"]
        INT["Integration Tests<br/>Testcontainers"]
        BUILD["Build Images"]
        PUBLISH["Publish to Registry"]
    end

    subgraph Nightly["Nightly"]
        E2E["E2E Scenarios"]
        EVAL["Agent Eval Harness"]
        PERF["Performance Benchmarks"]
    end

    subgraph Release["Release"]
        SMOKE["Smoke Tests"]
        STAGE["Deploy to Staging"]
        ACCEPT["Acceptance Tests"]
        PROD["Deploy to Prod"]
    end

    LINT --> UNIT --> CONTRACT --> SEC
    SEC -->|pass| INT --> BUILD --> PUBLISH
    PUBLISH --> E2E & EVAL & PERF
    PERF --> SMOKE --> STAGE --> ACCEPT --> PROD

PR Gate Checks

# .github/workflows/pr-checks.yml
name: PR Checks

on:
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: ruff check .
      - run: mypy --strict src/

  unit-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: pip install -e ".[test]"
      - run: pytest tests/unit/ tests/contract/ -v --cov=metaforge --cov-fail-under=85

  security:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: bandit -r src/ -ll
      - run: trivy fs --severity HIGH,CRITICAL .

Coverage Targets

Component	Minimum Coverage	Rationale
Skills	100%	Pure functions, fully testable
Constraint Engine	95%	Safety-critical path
Gate Engine	95%	Governs phase transitions
BOM Risk Scoring	90%	Financial impact
Agent harness	85%	Complex async paths
Tool adapters	80%	External dependency boundaries
API endpoints	85%	User-facing surface
Event handlers	80%	Async processing
Overall	85%	Floor for main branch

Test Data Management

Fixture Strategy

tests/
  fixtures/
    projects/
      drone-fc/              # Complete example project
        PRD.md
        constraints.json
        bom.csv
        test_plan.md
    agents/
      req_output_valid.json   # Valid agent output fixtures
      ee_output_valid.json
      sc_output_valid.json
    tools/
      kicad/
        erc_pass.txt          # Tool output fixtures
        erc_with_errors.txt
        bom_export.csv
      scpi/
        voltage_response.txt
    events/
      artifact_created.json   # Event schema fixtures
      gate_transition.json
    cassettes/                # VCR recorded HTTP interactions
      digikey_search.yaml
      mouser_search.yaml
      nexar_query.yaml

Golden File Testing

For complex outputs (compliance checklists, test plans, safety cases), use golden file comparison:

# tests/unit/test_compliance_generator.py

def test_uk_eu_compliance_checklist_matches_golden(snapshot):
    generator = ComplianceChecklistGenerator()
    checklist = generator.generate(
        product_features=["wireless", "battery", "iot"],
        target_markets=["UK", "EU"]
    )
    snapshot.assert_match(checklist.model_dump_json(indent=2), "uk_eu_checklist.json")

Performance & Load Testing

Benchmarks

# tests/performance/test_benchmarks.py

@pytest.mark.benchmark
def test_constraint_evaluation_latency(benchmark):
    """Constraint engine must evaluate <50ms for 100 constraints"""
    engine = load_constraint_engine(num_constraints=100)
    context = generate_test_context()

    result = benchmark(engine.evaluate, context)
    assert benchmark.stats["mean"] < 0.050  # 50ms

@pytest.mark.benchmark
def test_graph_traceability_query(benchmark, neo4j):
    """Full traceability query must return <200ms"""
    seed_test_graph(neo4j, requirements=500, bom_items=200, evidence=300)

    result = benchmark(
        query_traceability, neo4j, requirement_id="REQ-001"
    )
    assert benchmark.stats["mean"] < 0.200  # 200ms

Load Targets

Operation	P50	P95	P99
Constraint evaluation (100 rules)	<20ms	<50ms	<100ms
Graph traceability query	<100ms	<200ms	<500ms
BOM risk scoring (50 items)	<500ms	<1s	<2s
Gate readiness check	<200ms	<500ms	<1s
Evidence ingestion (single)	<100ms	<200ms	<500ms

Security Testing

SAST/DAST Integration

Tool	Scope	Frequency
Ruff	Python linting + security rules	Every PR
Bandit	Python security analysis	Every PR
Trivy	Container image + dependency CVEs	Every PR + nightly
Semgrep	Custom rules for MetaForge patterns	Every PR
OWASP ZAP	API endpoint scanning	Weekly

Security-Specific Tests

# tests/security/test_api_security.py

def test_path_traversal_blocked():
    response = client.post("/api/v1/artifacts", json={
        "path": "../../../etc/passwd"
    })
    assert response.status_code == 400

def test_unauthenticated_write_rejected():
    response = client.post(
        "/api/v1/agent/run",
        json={"skill": "spec"},
        headers={}  # No auth
    )
    assert response.status_code == 401

def test_agent_output_sanitised():
    """Agent output must not contain secrets or PII"""
    output = run_agent_with_fixture("req", FIXTURE_WITH_EMBEDDED_SECRET)
    assert "sk-" not in output.model_dump_json()
    assert "password" not in output.model_dump_json().lower()

Observability in Tests

Structured Test Logging

All integration and E2E tests emit OpenTelemetry spans for debugging failures:

# tests/integration/conftest.py
from opentelemetry import trace

tracer = trace.get_tracer("metaforge.tests")

@pytest.fixture(autouse=True)
def trace_test(request):
    with tracer.start_as_current_span(
        f"test.{request.node.nodeid}",
        attributes={"test.layer": "integration"}
    ):
        yield

Test Failure Triage

flowchart TD
    FAIL["Test Failure"] --> TYPE{Failure Type?}

    TYPE -->|Assertion| ASSERT["Check golden files<br/>and fixture drift"]
    TYPE -->|Timeout| TIMEOUT["Check Testcontainer<br/>startup / health"]
    TYPE -->|Schema| SCHEMA["Run contract tests<br/>in isolation"]
    TYPE -->|Flaky| FLAKY["Check for race conditions<br/>Add retry or fix ordering"]

    ASSERT --> FIX["Update fixture or fix code"]
    TIMEOUT --> FIX2["Increase timeout or fix infra"]
    SCHEMA --> FIX3["Update producer or consumer"]
    FLAKY --> FIX4["Add synchronisation or mark known-flaky"]

Test Execution Summary

Suite	Trigger	Duration Target	Gate
`pytest tests/unit/`	Every PR	<2 min	Required to merge
`pytest tests/contract/`	Every PR	<1 min	Required to merge
`pytest tests/integration/`	Post-merge to main	<10 min	Required for release
`pytest tests/e2e/`	Nightly	<30 min	Advisory
`pytest tests/eval/`	Nightly	<60 min	Advisory (quality trend)
`pytest tests/performance/`	Nightly	<15 min	Alert on regression >20%
`pytest tests/security/`	Every PR + weekly	<5 min	Required to merge

Phased Rollout

Phase 1 (MVP)

Unit tests for skills, constraints, gates, BOM risk
Contract tests for agent output schemas
Integration tests for Digital Thread (Neo4j)
CI pipeline with lint + unit + contract gates
Coverage floor: 85%

Phase 2 (Mid-Size)

Tool adapter mock tests (KiCad, SCPI, CAD)
Containerised integration tests (full Testcontainer suite)
E2E scenario tests (drone FC project)
Agent eval harness (nightly)
Security scan integration (Bandit, Trivy)

Phase 3 (Enterprise)

Performance benchmarks with regression alerts
OWASP ZAP API scanning
Chaos testing for Temporal workflow recovery
Multi-tenant isolation tests
Compliance evidence generation tests (safety cases)

Document	Description
System Vision	Platform architecture and five pillars
Orchestrator Technical	Detailed system design with event-driven workflows
MVP Roadmap	Phased implementation plan and deliverables
Observability	Logging, metrics, tracing, and SLO/SLI framework
Constraint Engine	Engineering constraint evaluation rules
Testing & Validation Page	Dashboard UI for test coverage and FMEA

Document Version: v1.0 Last Updated: 2026-03-07

← Observability

Architecture Home →