Semantic Boundary Principles

Cross-cutting rules for keeping Weaver's semantic layers explicit, auditable, and hard to accidentally collapse.

This document exists because a large fraction of hard engine bugs are not "algorithm bugs." They are boundary bugs: one layer silently acting as though it were another, one relation type being treated as though it were another, or one provenance tier being reported as though it had stronger support than it actually has.

It complements ARCHITECTURE.md, XPATH.md, ERRORS.md, and URI_RESOLUTION.md.

This document was sharpened by design pressure from the Tonesu language spec and notes, especially its commitments to semantic compositionality, reasoning transparency, explicit relation typing, and treating boundary violations as named events rather than invisible accidents.

Goals

Keep semantic layers distinct even when implementation pressure pushes toward convenience shortcuts.
Make support/provenance tiers explicit in diagnostics and analysis.
Prevent interpreter, codegen, and host integrations from re-deriving meaning differently.
Make boundary crossings visible, reviewable, and testable.

Non-goals

Replacing the architecture doc.
Rewriting W3C semantics in prose.
Prohibiting all abstraction or code reuse.
Turning every implementation detail into a named principle.

What a semantic boundary is

A semantic boundary is any place where Weaver must distinguish between two things that are easy to conflate in code but not equivalent in meaning.

Examples:

source text versus AST
AST versus analyzed overlay
plain IR versus execution plans
URI resolution versus resource loading versus result publication
sequence versus singleton expectation
general comparison versus value comparison versus node comparison
observed fact versus inferred fact versus host-supplied fact
diagnostic prose versus structured diagnostic data
interpreter semantics versus generated-runtime semantics

If the distinction matters to correctness, it is a real boundary.

Core stance

Weaver should prefer explicit boundary objects and explicit relation types over clever helpers that infer meaning later.

This is the same design instinct behind several existing decisions:

plain AST/IR plus overlays rather than mutating nodes into backend bags
Sequence as an owned abstraction rather than incidental JS collection behavior
DiagnosticReport as structured data rather than formatted strings
URI resolve/load/publish separation rather than one helper doing all I/O

The recurring rule is simple:

If two things are semantically different, make that difference survive the implementation boundary.

Principle 1 — Layer identities stay distinct

Source, AST, analyzed overlays, IR, runtime plans, emit plans, generated TS, and diagnostic projections are different artifacts with different jobs.

Rules:

AST and IR stay plain-data contracts.
Derived facts live in overlays or plans, not by mutating the contract objects.
Generated code is a product artifact, not the semantic source of truth.
Formatted diagnostics are a renderer output, not the canonical error object.

If a boundary object starts carrying fields whose meaning only exists for one backend or one renderer, the boundary is being weakened.

Principle 2 — Lower layers do not silently reinterpret enclosing layers

This is the meta-level rule.

The lower-level object should not be able to reach up and redefine the meaning of the layer that contains it.

Examples:

an XPath expression should not implicitly reinterpret its containing stylesheet context without going through explicit static/dynamic context
generated code should not invent its own URI resolution semantics
runtime helpers should not "fix up" compile-time meaning by guesswork
a formatted error string should not be parsed back into structured meaning

One-way visibility is often correct:

the enclosing layer can describe, annotate, or report on the inner one
the inner layer does not get to redefine the contract of the outer one

Principle 3 — Relation types are first-class, not inferred by vibe

Weaver should not collapse distinct relation types into one generic edge unless the shared meaning is already proven.

Examples already present in design:

URI resolve -> load -> publish
XPath comparison families
runtime frames versus related spans versus causes
sequence order versus causal/grounding relationships in analysis

Practical rule:

sequence is not grounding
correlation is not causation
containment is not identity
"same output" is not full semantic parity
"resolved URI" is not "allowed to load"

Early duplication is preferable to an abstraction that erases semantic distinctions before the model is stable.

Principle 4 — Provenance tiers stay explicit

Not every fact in the engine has the same kind of support.

At minimum, Weaver should distinguish between facts that are:

directly present in user source
supplied by the host
derived mechanically by parse or resolution
inferred by analysis
validated by tests or conformance fixtures

This does not require a grand taxonomy everywhere. It does require that diagnostics, analysis results, and boundary contracts avoid presenting an inference as though it were a source fact.

Examples:

a suggestion is not the same as a proven fix
a host canonical URI is not the same as the lexical href
a static-analysis warning is not the same as a parse failure
a typed extension-function signature is not the same as proof that a runtime value already has the required JS shape

Principle 5 — Preserve lexical form and resolved identity separately

When the engine transforms a thing into a more resolved form, it should usually preserve both:

the lexical/user-facing form for diagnostics and auditability
the resolved/normalized form for execution, caching, and identity

Examples:

lexical QName versus resolved QName
lexical URI reference versus resolved/canonical URI
source expression text versus analyzed expression metadata

If we keep only the resolved form, we lose explainability. If we keep only the lexical form, we lose stable execution identity.

Principle 6 — Boundary crossings should be named events

The engine should not treat a boundary crossing as an invisible, uninteresting implementation detail when it affects meaning.

Examples:

sequence-of-many used where singleton is required
host policy denying URI resolution/load
type coercion from XDM space into JS extension-function inputs
diagnostic suggestion upgrading from hint to deterministic fix
moving from belief-level or report-level information to established compile-time knowledge

When practical, these crossings should surface as:

typed APIs
structured diagnostic details
explicit helper names
focused tests

The point is not to make every crossing noisy. The point is to make it hard for meaning-changing crossings to happen by accident.

Principle 7 — Visibility beats prohibition

Weaver will not prevent every bad inference or shortcut. It should make them hard to do invisibly.

That means:

keep structured diagnostics instead of only prose
keep parity tests instead of trusting codegen by inspection
keep source maps instead of asking users to debug generated TS blind
keep explicit context/focus passing instead of hidden evaluator state
keep clear host contracts instead of ambient I/O assumptions

The system does not need to ban every mistake. It needs to expose enough structure that mistakes become reviewable.

Principle 8 — Stable primitives before surface growth

Do not solve every awkward feature request by adding a new semantic kind, new escape hatch, or new public API surface prematurely.

Awkwardness is often diagnostic:

maybe the IR is missing a concept
maybe the Sequence boundary is too weak
maybe provenance/support is being collapsed
maybe a host contract is underspecified

The correct response is usually to strengthen the underlying boundary, not to spray new ad hoc constructs across the surface.

Principle 9 — Same meaning across backends or it is a bug

Interpreter, codegen, watch mode, and future editor integrations may use different execution paths and renderers. They must not invent different semantic rules.

Parity is required for:

core evaluation meaning
base-URI choice where the same source is available
structured diagnostic identity and relevant details
host contract usage for URI operations

If one backend arrives at a different meaning by re-deriving semantics in its own local helpers, that is not implementation freedom. That is drift.

Weaver applications

These are the concrete places where the principles already bite:

Surface	Boundary to protect	Existing/desired rule
XPath AST	syntax vs analyzed meaning	plain AST plus semantic preparation overlays
XSLT IR	contract vs execution helpers	plain IR plus `RuntimePlan` / `EmitPlan`
Diagnostics	structured meaning vs human text	`DiagnosticReport` is canonical
URI handling	resolve vs load vs publish	resolver contract stays split
Sequence handling	semantic sequence vs JS container	owned `Sequence` abstraction
Comparisons	general vs value vs node	separate families, not one vague compare helper
Context	static vs dynamic vs focus	explicit context/focus passing
Codegen	reference semantics vs emitted artifact	generated code uses runtime boundary, not private rules

Litmus tests

We are on the right track if these stay true:

a reviewer can point to where a meaning-changing boundary is crossed
diagnostics can tell whether a fact was lexical, resolved, inferred, or suggested
codegen calls the same runtime boundary for host-dependent behavior
AST and IR can still be serialized and inspected without backend junk attached
relation types stay explicit in tests and APIs

We are drifting if any of these become common:

one helper both resolves, loads, parses, and normalizes because it is convenient
backend-specific caches or bindings start living on AST/IR nodes
formatted text becomes the only durable form of a diagnostic
a "smart" abstraction erases comparison-family or provenance distinctions
source-facing identity is lost once something is normalized for execution

Review questions

When a design change feels plausible but slightly dirty, ask:

Which semantic boundary is this crossing?
Is the crossing explicit in the API/model, or only implicit in code?
Are two relation types being collapsed because they look similar in JS?
Have we preserved both the user-facing form and the execution identity?
Would interpreter and codegen still agree if they both followed this rule literally?

If those questions are hard to answer, the design probably needs a sharper boundary before more code is added.

This document should change when we discover a missing cross-cutting boundary rule, not every time one subsystem gets a new feature.