Testing Strategy
Layered testing approach from unit tests to end-to-end hardware validation
Table of Contents
- Testing Philosophy
- Testing Pyramid
- Layer 1: Unit Tests
- Layer 2: Contract Tests
- Layer 3: Integration Tests
- Layer 4: End-to-End Scenario Tests
- Agent Testing Strategy
- Tool Adapter Testing
- CI/CD Pipeline
- Test Data Management
- Performance & Load Testing
- Security Testing
- Observability in Tests
- Test Execution Summary
- Phased Rollout
- Related Documents
Testing Philosophy
MetaForge is a safety-critical orchestration platform where incorrect outputs can propagate into physical hardware. The testing strategy prioritises:
- Determinism — Skills are pure functions; test them as such
- Contract enforcement — Every inter-component boundary has schema-validated contracts
- Traceability — Every test maps to a requirement or constraint in the Digital Thread
- Fail-fast — Catch constraint violations at the earliest possible layer
- Reproducibility — All tests run identically in CI and locally
Testing Pyramid
graph TB
subgraph Pyramid["Testing Pyramid"]
E2E["E2E / Scenario Tests<br/>~5% of suite"]
INT["Integration Tests<br/>~15% of suite"]
CONTRACT["Contract Tests<br/>~20% of suite"]
UNIT["Unit Tests<br/>~60% of suite"]
end
E2E --> INT --> CONTRACT --> UNIT
style UNIT fill:#27ae60,color:#fff
style CONTRACT fill:#3498db,color:#fff
style INT fill:#f39c12,color:#000
style E2E fill:#e74c3c,color:#fff
| Layer | Scope | Speed | Isolation |
|---|---|---|---|
| Unit | Single function / class | <1s per test | Full mocks |
| Contract | Schema boundaries between components | <2s per test | Stub external services |
| Integration | Multi-component workflows | <30s per test | Testcontainers (Neo4j, MinIO, Kafka) |
| E2E / Scenario | Full orchestrator flow against example project | 1-5 min per scenario | Real infrastructure, LLM stubs |
Layer 1: Unit Tests
1.1 Skill Testing
Skills are deterministic pure functions — the easiest components to test. Every skill must have 100% branch coverage.
# tests/unit/skills/test_validate_stress.py
import pytest
from metaforge.skills.mechanical import ValidateStressSkill, StressInput, StressOutput
@pytest.fixture
def skill():
return ValidateStressSkill()
def test_pass_within_yield_strength(skill):
result = skill.execute(StressInput(
material="6061-T6",
max_stress_mpa=200.0,
yield_strength_mpa=276.0,
safety_factor=1.5
))
assert result.status == "PASS"
assert result.margin_of_safety > 0
def test_fail_exceeds_yield(skill):
result = skill.execute(StressInput(
material="6061-T6",
max_stress_mpa=300.0,
yield_strength_mpa=276.0,
safety_factor=1.5
))
assert result.status == "FAIL"
assert "yield" in result.failure_reason.lower()
def test_edge_case_zero_stress(skill):
result = skill.execute(StressInput(
material="6061-T6",
max_stress_mpa=0.0,
yield_strength_mpa=276.0,
safety_factor=1.5
))
assert result.status == "PASS"
1.2 Constraint Engine Testing
# tests/unit/test_constraint_engine.py
import pytest
from metaforge.constraints import ConstraintEngine, Constraint
def test_thermal_constraint_violation():
engine = ConstraintEngine()
engine.add(Constraint(
id="THERM-001",
rule="max_junction_temp_c < 85",
scope="BOMItem[category='IC']"
))
violations = engine.evaluate({
"max_junction_temp_c": 92,
"category": "IC"
})
assert len(violations) == 1
assert violations[0].constraint_id == "THERM-001"
def test_cross_domain_clearance_constraint():
engine = ConstraintEngine()
engine.add(Constraint(
id="MECH-EE-001",
rule="pcb_edge_clearance_mm >= enclosure_wall_clearance_mm",
scope="CrossDomain[mechanical, electronics]"
))
violations = engine.evaluate({
"pcb_edge_clearance_mm": 1.5,
"enclosure_wall_clearance_mm": 2.0
})
assert len(violations) == 1
1.3 Gate Engine Testing
# tests/unit/test_gate_engine.py
from metaforge.gates import GateEngine, GateDefinition
def test_evt_gate_blocks_on_missing_coverage():
gate = GateEngine()
gate.load(GateDefinition(
name="EVT",
entry_criteria={
"requirement_coverage": "> 95%",
"bom_risk_score": "< 30",
"test_plan_approved": True
}
))
result = gate.evaluate({
"requirement_coverage": 0.80,
"bom_risk_score": 15,
"test_plan_approved": True
})
assert result.ready is False
assert "requirement_coverage" in result.blockers[0].criterion
1.4 BOM Risk Scoring
# tests/unit/test_bom_risk.py
from metaforge.supply_chain import calculate_bom_risk, BOMItem
def test_single_source_risk():
items = [
BOMItem(mpn="STM32F407", sources=1, lead_time_weeks=4, eol=False),
BOMItem(mpn="CAP-100nF", sources=5, lead_time_weeks=2, eol=False),
]
result = calculate_bom_risk(items)
assert result["level"] in ("LOW", "MEDIUM", "HIGH")
assert result["score"] > 0 # Single-source penalty
def test_eol_component_raises_risk():
items = [
BOMItem(mpn="LM317", sources=3, lead_time_weeks=3, eol=True),
]
result = calculate_bom_risk(items)
assert result["score"] >= 15 # EOL penalty
assert any("EOL" in r for r in result["recommendations"])
Layer 2: Contract Tests
Contract tests validate the schema boundaries between MetaForge components. They ensure producers and consumers agree on data shape without requiring live services.
2.1 Agent Output Contracts
Every agent must produce Pydantic-validated output conforming to its declared schema.
# tests/contract/test_agent_output_schemas.py
import pytest
from pydantic import ValidationError
from metaforge.agents.schemas import (
RequirementsOutput,
ArchitectureOutput,
BOMOutput,
TestPlanOutput,
)
AGENT_SCHEMAS = [
("REQ", RequirementsOutput, "fixtures/req_output_valid.json"),
("EE", ArchitectureOutput, "fixtures/ee_output_valid.json"),
("SC", BOMOutput, "fixtures/sc_output_valid.json"),
("TST", TestPlanOutput, "fixtures/tst_output_valid.json"),
]
@pytest.mark.parametrize("agent_id,schema,fixture_path", AGENT_SCHEMAS)
def test_valid_fixture_passes_schema(agent_id, schema, fixture_path):
data = load_fixture(fixture_path)
result = schema.model_validate(data)
assert result is not None
@pytest.mark.parametrize("agent_id,schema,fixture_path", AGENT_SCHEMAS)
def test_missing_required_fields_rejected(agent_id, schema, fixture_path):
data = load_fixture(fixture_path)
del data[list(data.keys())[0]] # Remove a required field
with pytest.raises(ValidationError):
schema.model_validate(data)
2.2 MCP Tool Adapter Contracts
Tool adapters must conform to the ToolAdapter protocol. Contract tests verify capability detection, input validation, and output shape.
# tests/contract/test_tool_adapter_protocol.py
import pytest
from metaforge.tools.protocol import ToolAdapter
ADAPTERS = [
"KiCadAdapter",
"NGSpiceAdapter",
"FreeCADAdapter",
"DigiKeyAdapter",
"CalculiXAdapter",
]
@pytest.mark.parametrize("adapter_name", ADAPTERS)
def test_adapter_implements_protocol(adapter_name):
adapter_cls = get_adapter_class(adapter_name)
assert issubclass(adapter_cls, ToolAdapter)
assert callable(getattr(adapter_cls, "detect", None))
assert callable(getattr(adapter_cls, "execute", None))
@pytest.mark.parametrize("adapter_name", ADAPTERS)
def test_adapter_returns_structured_result(adapter_name):
adapter = create_stub_adapter(adapter_name)
result = adapter.execute(get_sample_action(adapter_name))
assert hasattr(result, "status")
assert result.status in ("success", "error")
assert hasattr(result, "data")
2.3 Event Bus Contracts
Events published to Kafka must conform to declared Avro/JSON schemas.
# tests/contract/test_event_schemas.py
from metaforge.events.schemas import (
ArtifactCreatedEvent,
GateTransitionEvent,
ConstraintViolationEvent,
)
def test_artifact_created_event_schema():
event = ArtifactCreatedEvent(
artifact_id="art-001",
artifact_type="BOMItem",
session_id="sess-123",
timestamp="2026-03-07T10:00:00Z"
)
serialised = event.model_dump_json()
roundtrip = ArtifactCreatedEvent.model_validate_json(serialised)
assert roundtrip.artifact_id == event.artifact_id
def test_gate_transition_requires_approver():
with pytest.raises(ValidationError):
GateTransitionEvent(
gate="EVT",
from_status="BLOCKED",
to_status="READY",
# missing: approver_id
)
Layer 3: Integration Tests
Integration tests verify multi-component workflows using real (containerised) infrastructure.
3.1 Infrastructure: Testcontainers
# tests/integration/conftest.py
import pytest
from testcontainers.neo4j import Neo4jContainer
from testcontainers.minio import MinioContainer
from testcontainers.kafka import KafkaContainer
@pytest.fixture(scope="session")
def neo4j():
with Neo4jContainer("neo4j:5-community") as container:
yield container.get_connection_url()
@pytest.fixture(scope="session")
def minio():
with MinioContainer() as container:
yield {
"endpoint": container.get_url(),
"access_key": container.access_key,
"secret_key": container.secret_key,
}
@pytest.fixture(scope="session")
def kafka():
with KafkaContainer() as container:
yield container.get_bootstrap_server()
3.2 Digital Thread Integration
# tests/integration/test_digital_thread.py
def test_requirement_to_bom_traceability(neo4j):
"""Verify full traceability chain: Requirement -> BOMItem -> TestEvidence"""
graph = connect(neo4j)
# Create requirement
req = graph.create_node("Requirement", {
"id": "REQ-001",
"title": "Operating voltage 3.3V +/- 5%",
"status": "APPROVED"
})
# Create BOM item satisfying requirement
bom = graph.create_node("BOMItem", {
"mpn": "TPS63020",
"description": "3.3V Buck-Boost Converter"
})
graph.create_relationship(bom, "SATISFIES", req)
# Create test evidence
evidence = graph.create_node("TestEvidence", {
"type": "voltage_regulation",
"status": "PASS",
"measured_value": "3.31V"
})
graph.create_relationship(evidence, "VALIDATES", req)
# Verify traceability query
chain = graph.query("""
MATCH (e:TestEvidence)-[:VALIDATES]->(r:Requirement)
<-[:SATISFIES]-(b:BOMItem)
WHERE r.id = 'REQ-001'
RETURN r, b, e
""")
assert len(chain) == 1
def test_orphan_requirement_detection(neo4j):
"""Requirements without test coverage should be flagged"""
graph = connect(neo4j)
graph.create_node("Requirement", {
"id": "REQ-ORPHAN",
"title": "Untested requirement",
"status": "APPROVED"
})
orphans = graph.query("""
MATCH (r:Requirement)
WHERE NOT (r)<-[:VALIDATES]-(:TestEvidence)
RETURN r
""")
assert any(r["id"] == "REQ-ORPHAN" for r in orphans)
3.3 Evidence Ingestion Pipeline
# tests/integration/test_evidence_ingestion.py
def test_github_actions_webhook_ingests_evidence(neo4j, minio, kafka):
"""Simulate a GitHub Actions webhook delivering test results"""
api = create_test_client(neo4j=neo4j, minio=minio, kafka=kafka)
webhook_payload = {
"action": "completed",
"workflow_run": {
"conclusion": "success",
"artifacts": [{
"name": "test-results",
"content": encode_junit_xml(tests=[
{"name": "test_voltage_regulation", "status": "passed",
"requirement_ids": ["REQ-001"]},
{"name": "test_current_limit", "status": "passed",
"requirement_ids": ["REQ-002"]},
])
}]
}
}
response = api.post("/webhooks/github", json=webhook_payload)
assert response.status_code == 200
# Verify evidence nodes created and linked
evidence = api.get("/api/v1/evidence?source=github-actions")
assert len(evidence.json()) == 2
# Verify requirement linkage
coverage = api.get("/api/v1/requirements/REQ-001/coverage")
assert coverage.json()["covered"] is True
3.4 Supply Chain API Integration
# tests/integration/test_supply_chain.py
@pytest.mark.vcr() # Record/replay HTTP interactions
def test_bom_risk_with_live_pricing():
"""Test BOM risk scoring with recorded distributor API responses"""
sc_agent = SupplyChainAgent(
adapters=[DigiKeyAdapter(), MouserAdapter(), NexarAdapter()]
)
bom = [
{"mpn": "STM32F407VGT6", "quantity": 100},
{"mpn": "GRM155R71C104KA88D", "quantity": 500},
]
result = sc_agent.assess_risk(bom)
assert "score" in result
assert "level" in result
assert all(item["pricing"] is not None for item in result["items"])
assert all(item["availability"] is not None for item in result["items"])
Layer 4: End-to-End Scenario Tests
E2E tests run the full orchestrator against the example Drone Flight Controller project.
4.1 Scenario: PRD to EVT Gate
# tests/e2e/test_prd_to_evt.py
@pytest.mark.e2e
@pytest.mark.slow
def test_drone_fc_prd_to_evt_gate(orchestrator, drone_fc_project):
"""Full scenario: ingest PRD, extract requirements, assess BOM,
generate test plan, evaluate EVT gate readiness"""
# Step 1: Ingest PRD
session = orchestrator.run("spec", project=drone_fc_project)
assert session.status == "COMPLETED"
assert session.has_artifact("constraints.json")
# Step 2: Generate architecture
session = orchestrator.run("architecture", project=drone_fc_project)
assert session.status == "COMPLETED"
# Step 3: BOM risk assessment
session = orchestrator.run("bom-risk", project=drone_fc_project)
result = session.get_artifact("bom_risk.json")
assert result["level"] != "HIGH"
# Step 4: Generate test plan
session = orchestrator.run("test-plan", project=drone_fc_project)
assert session.has_artifact("test_plan.md")
# Step 5: Check EVT gate readiness
gate = orchestrator.evaluate_gate("EVT", project=drone_fc_project)
assert gate.coverage >= 0.95
assert gate.bom_risk_score < 30
4.2 Scenario: Compliance Checklist
# tests/e2e/test_compliance_flow.py
@pytest.mark.e2e
def test_multi_market_compliance(orchestrator, drone_fc_project):
"""Generate compliance checklists for UK + EU + USA markets"""
drone_fc_project.set_target_markets(["UK", "EU", "USA"])
session = orchestrator.run("compliance-checklist", project=drone_fc_project)
checklist = session.get_artifact("compliance_checklist.json")
# Verify all regimes detected
regimes = {item["regime"] for item in checklist}
assert "UKCA" in regimes
assert "CE" in regimes
assert "FCC" in regimes
# Verify PSTI detected for connected product
if drone_fc_project.has_connectivity:
assert "PSTI" in regimes
# Verify evidence requirements generated
for item in checklist:
assert len(item["evidence_required"]) > 0
Agent Testing Strategy
LLM Interaction Testing
Agent tests must isolate LLM calls to ensure determinism. Three approaches:
flowchart LR
subgraph Approaches["Agent Test Approaches"]
A["1. Recorded Responses<br/>VCR cassettes"]
B["2. Stub Providers<br/>Deterministic output"]
C["3. Eval Harness<br/>LLM-as-judge scoring"]
end
A -->|Fast, deterministic| CI["CI Pipeline"]
B -->|Schema validation| CI
C -->|Quality regression| NIGHTLY["Nightly Suite"]
| Approach | Use Case | Speed | When |
|---|---|---|---|
| Recorded responses | Regression tests, contract validation | Fast (<1s) | Every PR |
| Stub providers | Schema validation, error handling paths | Fast (<1s) | Every PR |
| Eval harness | Output quality, prompt regression | Slow (30s+) | Nightly / pre-release |
Stub Provider Example
# tests/agents/conftest.py
from pydantic_ai.models.test import TestModel
@pytest.fixture
def stub_llm():
"""Deterministic LLM stub for agent tests"""
return TestModel(
custom_result_text="stub response",
call_tools=["search_parts", "check_availability"],
)
def test_ee_agent_produces_valid_bom(stub_llm):
agent = ElectronicsAgent(model=stub_llm)
result = agent.run("Design power supply for 3.3V @ 2A")
# Schema validation — the real test
assert isinstance(result.data, ArchitectureOutput)
assert len(result.data.bom_items) > 0
for item in result.data.bom_items:
assert item.mpn is not None
assert item.quantity > 0
Eval Harness (Nightly)
# tests/eval/test_agent_quality.py
EVAL_CASES = [
{
"agent": "REQ",
"input": "Design a temperature sensor with BLE, battery powered, IP67",
"expected_constraints": ["operating_temp", "ble_version", "ingress_protection"],
"min_score": 0.8,
},
{
"agent": "EE",
"input": "STM32-based motor controller, 24V, 10A per phase",
"expected_components": ["MCU", "gate_driver", "MOSFETs", "current_sense"],
"min_score": 0.7,
},
]
@pytest.mark.nightly
@pytest.mark.parametrize("case", EVAL_CASES)
def test_agent_output_quality(case, live_llm):
agent = get_agent(case["agent"], model=live_llm)
result = agent.run(case["input"])
score = evaluate_output(
result=result.data,
expected=case,
rubric="engineering_completeness"
)
assert score >= case["min_score"], (
f"Agent {case['agent']} scored {score}, minimum {case['min_score']}"
)
Tool Adapter Testing
Adapter Test Matrix
Each tool adapter is tested at three levels:
| Level | What | Infrastructure | Example |
|---|---|---|---|
| Mock | Protocol compliance, error handling | None | Mock SCPI responses |
| Containerised | Full adapter against real tool | Docker | KiCad in container |
| Live | Real tool on developer machine | Local install | KiCad native |
Mock Adapter Test
# tests/tools/test_kicad_adapter.py
def test_kicad_erc_parse_errors():
adapter = KiCadAdapter(executable=MockKiCadCLI(
erc_output=FIXTURE_ERC_WITH_ERRORS
))
result = adapter.run_erc("/fake/project.kicad_sch")
assert result.status == "error"
assert result.data["error_count"] == 3
assert result.data["errors"][0]["type"] == "unconnected_pin"
def test_kicad_bom_export():
adapter = KiCadAdapter(executable=MockKiCadCLI(
bom_output=FIXTURE_BOM_CSV
))
bom = adapter.export_bom("/fake/project.kicad_sch", format="csv")
assert len(bom) == 15
assert all(item.mpn for item in bom)
SCPI Lab Equipment Test
# tests/tools/test_scpi_adapter.py
def test_scpi_voltage_measurement():
adapter = SCPIAdapter(transport=MockTCPTransport(
responses={"MEAS:VOLT:DC?": "3.312\n"}
))
voltage = adapter.measure_voltage(channel=1)
assert abs(voltage - 3.312) < 0.001
def test_scpi_timeout_handling():
adapter = SCPIAdapter(transport=MockTCPTransport(
responses={}, # No response — simulates timeout
timeout_ms=100
))
with pytest.raises(InstrumentTimeoutError):
adapter.measure_voltage(channel=1)
CI/CD Pipeline
Pipeline Stages
flowchart LR
subgraph PR["Pull Request"]
LINT["Lint<br/>ruff + mypy"]
UNIT["Unit Tests"]
CONTRACT["Contract Tests"]
SEC["Security Scan<br/>bandit + trivy"]
end
subgraph Merge["Post-Merge (main)"]
INT["Integration Tests<br/>Testcontainers"]
BUILD["Build Images"]
PUBLISH["Publish to Registry"]
end
subgraph Nightly["Nightly"]
E2E["E2E Scenarios"]
EVAL["Agent Eval Harness"]
PERF["Performance Benchmarks"]
end
subgraph Release["Release"]
SMOKE["Smoke Tests"]
STAGE["Deploy to Staging"]
ACCEPT["Acceptance Tests"]
PROD["Deploy to Prod"]
end
LINT --> UNIT --> CONTRACT --> SEC
SEC -->|pass| INT --> BUILD --> PUBLISH
PUBLISH --> E2E & EVAL & PERF
PERF --> SMOKE --> STAGE --> ACCEPT --> PROD
PR Gate Checks
# .github/workflows/pr-checks.yml
name: PR Checks
on:
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: ruff check .
- run: mypy --strict src/
unit-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: pip install -e ".[test]"
- run: pytest tests/unit/ tests/contract/ -v --cov=metaforge --cov-fail-under=85
security:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: bandit -r src/ -ll
- run: trivy fs --severity HIGH,CRITICAL .
Coverage Targets
| Component | Minimum Coverage | Rationale |
|---|---|---|
| Skills | 100% | Pure functions, fully testable |
| Constraint Engine | 95% | Safety-critical path |
| Gate Engine | 95% | Governs phase transitions |
| BOM Risk Scoring | 90% | Financial impact |
| Agent harness | 85% | Complex async paths |
| Tool adapters | 80% | External dependency boundaries |
| API endpoints | 85% | User-facing surface |
| Event handlers | 80% | Async processing |
| Overall | 85% | Floor for main branch |
Test Data Management
Fixture Strategy
tests/
fixtures/
projects/
drone-fc/ # Complete example project
PRD.md
constraints.json
bom.csv
test_plan.md
agents/
req_output_valid.json # Valid agent output fixtures
ee_output_valid.json
sc_output_valid.json
tools/
kicad/
erc_pass.txt # Tool output fixtures
erc_with_errors.txt
bom_export.csv
scpi/
voltage_response.txt
events/
artifact_created.json # Event schema fixtures
gate_transition.json
cassettes/ # VCR recorded HTTP interactions
digikey_search.yaml
mouser_search.yaml
nexar_query.yaml
Golden File Testing
For complex outputs (compliance checklists, test plans, safety cases), use golden file comparison:
# tests/unit/test_compliance_generator.py
def test_uk_eu_compliance_checklist_matches_golden(snapshot):
generator = ComplianceChecklistGenerator()
checklist = generator.generate(
product_features=["wireless", "battery", "iot"],
target_markets=["UK", "EU"]
)
snapshot.assert_match(checklist.model_dump_json(indent=2), "uk_eu_checklist.json")
Performance & Load Testing
Benchmarks
# tests/performance/test_benchmarks.py
@pytest.mark.benchmark
def test_constraint_evaluation_latency(benchmark):
"""Constraint engine must evaluate <50ms for 100 constraints"""
engine = load_constraint_engine(num_constraints=100)
context = generate_test_context()
result = benchmark(engine.evaluate, context)
assert benchmark.stats["mean"] < 0.050 # 50ms
@pytest.mark.benchmark
def test_graph_traceability_query(benchmark, neo4j):
"""Full traceability query must return <200ms"""
seed_test_graph(neo4j, requirements=500, bom_items=200, evidence=300)
result = benchmark(
query_traceability, neo4j, requirement_id="REQ-001"
)
assert benchmark.stats["mean"] < 0.200 # 200ms
Load Targets
| Operation | P50 | P95 | P99 |
|---|---|---|---|
| Constraint evaluation (100 rules) | <20ms | <50ms | <100ms |
| Graph traceability query | <100ms | <200ms | <500ms |
| BOM risk scoring (50 items) | <500ms | <1s | <2s |
| Gate readiness check | <200ms | <500ms | <1s |
| Evidence ingestion (single) | <100ms | <200ms | <500ms |
Security Testing
SAST/DAST Integration
| Tool | Scope | Frequency |
|---|---|---|
| Ruff | Python linting + security rules | Every PR |
| Bandit | Python security analysis | Every PR |
| Trivy | Container image + dependency CVEs | Every PR + nightly |
| Semgrep | Custom rules for MetaForge patterns | Every PR |
| OWASP ZAP | API endpoint scanning | Weekly |
Security-Specific Tests
# tests/security/test_api_security.py
def test_path_traversal_blocked():
response = client.post("/api/v1/artifacts", json={
"path": "../../../etc/passwd"
})
assert response.status_code == 400
def test_unauthenticated_write_rejected():
response = client.post(
"/api/v1/agent/run",
json={"skill": "spec"},
headers={} # No auth
)
assert response.status_code == 401
def test_agent_output_sanitised():
"""Agent output must not contain secrets or PII"""
output = run_agent_with_fixture("req", FIXTURE_WITH_EMBEDDED_SECRET)
assert "sk-" not in output.model_dump_json()
assert "password" not in output.model_dump_json().lower()
Observability in Tests
Structured Test Logging
All integration and E2E tests emit OpenTelemetry spans for debugging failures:
# tests/integration/conftest.py
from opentelemetry import trace
tracer = trace.get_tracer("metaforge.tests")
@pytest.fixture(autouse=True)
def trace_test(request):
with tracer.start_as_current_span(
f"test.{request.node.nodeid}",
attributes={"test.layer": "integration"}
):
yield
Test Failure Triage
flowchart TD
FAIL["Test Failure"] --> TYPE{Failure Type?}
TYPE -->|Assertion| ASSERT["Check golden files<br/>and fixture drift"]
TYPE -->|Timeout| TIMEOUT["Check Testcontainer<br/>startup / health"]
TYPE -->|Schema| SCHEMA["Run contract tests<br/>in isolation"]
TYPE -->|Flaky| FLAKY["Check for race conditions<br/>Add retry or fix ordering"]
ASSERT --> FIX["Update fixture or fix code"]
TIMEOUT --> FIX2["Increase timeout or fix infra"]
SCHEMA --> FIX3["Update producer or consumer"]
FLAKY --> FIX4["Add synchronisation or mark known-flaky"]
Test Execution Summary
| Suite | Trigger | Duration Target | Gate |
|---|---|---|---|
pytest tests/unit/ |
Every PR | <2 min | Required to merge |
pytest tests/contract/ |
Every PR | <1 min | Required to merge |
pytest tests/integration/ |
Post-merge to main | <10 min | Required for release |
pytest tests/e2e/ |
Nightly | <30 min | Advisory |
pytest tests/eval/ |
Nightly | <60 min | Advisory (quality trend) |
pytest tests/performance/ |
Nightly | <15 min | Alert on regression >20% |
pytest tests/security/ |
Every PR + weekly | <5 min | Required to merge |
Phased Rollout
Phase 1 (MVP)
- Unit tests for skills, constraints, gates, BOM risk
- Contract tests for agent output schemas
- Integration tests for Digital Thread (Neo4j)
- CI pipeline with lint + unit + contract gates
- Coverage floor: 85%
Phase 2 (Mid-Size)
- Tool adapter mock tests (KiCad, SCPI, CAD)
- Containerised integration tests (full Testcontainer suite)
- E2E scenario tests (drone FC project)
- Agent eval harness (nightly)
- Security scan integration (Bandit, Trivy)
Phase 3 (Enterprise)
- Performance benchmarks with regression alerts
- OWASP ZAP API scanning
- Chaos testing for Temporal workflow recovery
- Multi-tenant isolation tests
- Compliance evidence generation tests (safety cases)
Related Documents
| Document | Description |
|---|---|
| System Vision | Platform architecture and five pillars |
| Orchestrator Technical | Detailed system design with event-driven workflows |
| MVP Roadmap | Phased implementation plan and deliverables |
| Observability | Logging, metrics, tracing, and SLO/SLI framework |
| Constraint Engine | Engineering constraint evaluation rules |
| Testing & Validation Page | Dashboard UI for test coverage and FMEA |
Document Version: v1.0 Last Updated: 2026-03-07
| ← Observability | Architecture Home → |