ShadowPayload mode — implementation status
This doc tracks the work breakdown for ibsr collect-payload (ShadowPayload mode), the second operating mode introduced in docs/safety.md. It exists for cross-conversation continuity and should be deleted once the mode reaches feature-parity with ibsr collect.
Goal
Land the payload-aware traffic-intelligence path: TC ingress/egress BPF programs that sample TCP payload to a ringbuf; userspace TCP-stream reassembler that parses HTTP/JSON-RPC and pairs requests with responses; per-window aggregator that emits ResponseAggregates matching nr-substrate’s offline nr_training/features/responses.py semantics exactly. Used to validate the Phase 1 close-gate criterion of the PRODUCTION-ARCHITECTURE-PLAN (in nr-substrate) on a held-out TLS-fronted lab bundle: production extractor produces numerically identical feature values to the offline extractor for all 7 CIPHER_AGNOSTIC_V2 features.
What’s landed
Framing
docs/safety.md: rewritten for the hyperscaler/cloudflare model. Single load-bearing safety guarantee — no drops/redirects/modifies — with two operating modes (StrictCounter, ShadowPayload) that differ in observation capability underneath.
Schema (ibsr-schema)
SCHEMA_VERSIONbumped to 6.SUPPORTED_VERSIONS = [5, 6]so existing v5 snapshots still round-trip.- New
ResponseAggregatesstruct withcount,resp_bytes_total,req_bytes_max,resp_bytes_max,amp_ratio_{mean,median,max}. Field semantics match offlineresponses.pyexactly. ResponseAggregates::from_pairs(&[(req, resp)])aggregator pinned by tests (offline rule: pairs withrequest_bytes <= 0excluded from ratios).Snapshot::with_resp_aggregates()builder for v6 emission;resp_aggregatesfield is#[serde(skip_serializing_if = Option::is_none)]so v6-without-aggregates is byte-shape-equivalent to v5.
Safety verification (ibsr-bpf::safety)
SafetyProfile { StrictCounter, ShadowPayload }enum.analyze_source_with_profile/analyze_elf_with_profile— per-mode-aware verifiers. Oldanalyze_source/analyze_elfpreserved as thin StrictCounter wrappers (back-compat).- Mode-invariant rules apply to both: no drops/redirects/modifies, no DEVMAP/XSKMAP/CPUMAP, no
bpf_xdp_adjust_*/bpf_skb_change_*/bpf_clone_redirect/bpf_skb_store_bytes. - Mode-specific (StrictCounter only): forbid
bpf_ringbuf_*,bpf_perf_event_output, BPF_MAP_TYPE_RINGBUF, BPF_MAP_TYPE_PERF_EVENT_ARRAY; require BPF_MAP_TYPE_LRU_HASH. - 16+ tests cover both profiles, mode-invariant, and mode-specific.
TC payload BPF program (ibsr-bpf/src/bpf/tc_payload.bpf.c)
- Two TC programs (
tc_payload_ingress,tc_payload_egress), attached at the configured interface (typicallylofor post-term loopback vantage). - Filters by configured server-port set.
- Samples up to
PAYLOAD_SAMPLE_BYTES(1024) of TCP payload per packet to a ringbuf. - Always returns
TC_ACT_OK— no drops. - Ringbuf reservation failure drops the event, never the packet.
- Pinned offset table in source comments; userspace decoder pins the same offsets via
tc_payload_event::tests::raw_payload_event_field_offsets_match_pinned_table. - Safety-verified:
analyze_source_with_profile(source, ShadowPayload)passes; same source rejected under StrictCounter (pinned by test). - Build wired in
ibsr-bpf/build.rs— both BPF sources compile; skeleton generation producescounter.skel.rs+tc_payload.skel.rs.
Userspace decoder + bridge (ibsr-bpf::tc_payload_event, ibsr-collector::payload)
RawPayloadEvent:#[repr(C)]mirror of the BPF struct, exact layout match.EXPECTED_RAW_EVENT_SIZE = 1064.decode_event(&[u8]) -> Result<DecodedEvent, DecodeError>: safe bytes-to-struct decoder with length, sample-size, and direction validation.PayloadEvent::from_decoded(&DecodedEvent, &server_ports)bridge: translates network byte order, infersDirectionfrom server-port set membership (dst_port in set → ToServer; src_port in set → FromServer), canonicalises FlowKey across both directions.
Userspace handler (ibsr-collector::payload)
FlowKey— directional canonicalisation.FlowReassembler— per-flow byte buffers + httparse-based head detection + Content-Length-based body-length accounting + HTTP/1.1 keep-alive support.RpcPair— emitted when request + response complete on the same flow.WindowAggregator— accumulates pairs, emitsResponseAggregatesontake_window().PayloadHandler— top-level orchestrator with bounded flow table- LRU eviction.
- 17 tests (12 reassembler/aggregator + 5 bridge) cover the request/response pairing, message-split-across-events, HTTP/1.1 keep-alive, malformed input, buffer overflow, LRU eviction, end-to-end aggregator-vs-direct-from-pairs equivalence.
What’s remaining
Userspace orchestrator + CLI parser (TDD-COMPLETE)
Now landed:
ibsr-collector/src/cli.rs:Command::CollectPayload(CollectPayloadArgs)variant with full CLI parser. Flags:-p / --dst-port,--dst-ports,-i / --iface(defaultlo),--out-dir,--window-sec,--max-flows,--ringbuf-bytes,--max-files,--max-age,--duration-sec,-v. 21 tests pin parsing + validation.ibsr-collector/src/payload_collector.rs:PayloadEventSourcetrait +MockPayloadEventSourcefor tests.PayloadCollectorConfig+PayloadCollectorError+PayloadWindowResult+PayloadLoopResult.build_payload_snapshot— emits v6 with/withoutresp_aggregatesbased on window data.collect_payload_window— single-window orchestrator with poll-then-deadline-check loop semantics.collect_payload_loop— multi-window runner with the “IBSR runs without dying” contract: writer failures, source failures, decode failures all keep the loop going. Only shutdown terminates.args_to_config/args_to_server_ports— wiring layer between validated CLI args and the orchestrator’s runtime types.- 27 tests pin: empty/non-empty windows, paired RPC handling, malformed-event resilience, source-error resilience, shutdown short-circuit, writer-error propagation, multi-window aggregation, decode-error tallies, max-windows cap, end-to-end CLI-to-snapshot wiring.
BPF loader + TC qdisc attach + ringbuf consumer (LANDED — integration-test-pending)
The kernel-attach Rust glue is now landed:
ibsr-bpf/src/tc_payload_loader.rs:- Pure-function scaffolding (TDD-tested):
build_port_filter_entries(ports) → Vec<(key, value)>InterfaceResolvertrait +NixInterfaceResolver/MockInterfaceResolverimplsPendingEventsthread-safe queueQueueBackedEventSource(production-shaped)
- Production attacher (kernel-bound):
LibbpfPayloadCollector::attach(iface, ports, resolver)- Box::leak’d OpenObject for ‘static skel lifetime (matches BpfMapReader pattern)
- clsact qdisc creation via TcHookBuilder
- TC ingress + egress filter attach
- Port-filter map programming
- RingBuffer consumer with callback pushing into
PendingEvents - Drop order: ringbuf → hooks → qdisc → skel (graceful detach, no orphan state)
- Pure-function scaffolding (TDD-tested):
ibsr-collector/src/payload_collector.rs:PayloadEventSourceimpl forLibbpfPayloadCollector: pump → drain.PayloadEventSourceimpl forQueueBackedEventSource: drain. (Both share the orchestrator’s loop; the difference is whether a kernel pump runs before drain.)
ibsr-collector/src/commands/collect_payload.rs:execute_collect_payload(TDD-tested with mock attacher).TcPayloadAttachertrait +AttachErrorvariants.
ibsr-collector/src/main.rs:LibbpfTcPayloadAttacher— production attacher.- Wires
ibsr collect-payloadsubcommand end-to-end.
Verified non-root: ibsr collect-payload -p 8899 -i lo loads the BPF skeleton (libbpf reports the load attempt), fails cleanly with EPERM, exits 1, prints “BPF program load failed: Operation not permitted (os error 1)”.
Verified live (root) on 2026-05-06: sudo ibsr collect-payload -p 8899 -i lo --window-sec 2 --duration-sec 6 → 3 windows × ~57 RPC pairs from organic Solana validator traffic → 3 v6 snapshots with populated resp_aggregates. Verified clean clsact qdisc teardown on shutdown (tc qdisc show dev lo clean before AND after). Live integration fixes committed (commit 1445ade):
- BPF verifier rejection of variable-size
bpf_skb_load_bytesresolved via bucketed constant-size dispatch (power-of-2 chunks); preserves all sample bytes. - Orphan clsact qdisc resolved via explicit
Dropimpl onLibbpfPayloadCollector(libbpf-rs 0.24’sTcHookhas noDropby design — manual teardown required).
Verified end-to-end pipeline: scripts/inference_loop.py consumed the v6 snapshots, loaded the V8 production model (cipher-agnostic-v2, isotonic-calibrated, 7 features), and emitted predictions with 6/7 features active (only pcap.unique_src_ports NaN, by design — IBSR keys on src_ip). Score 0.0 (benign) on all 3 organic-traffic windows.
Live integration tests (TO DO with root)
The 3 #[ignore]d tests in ibsr-bpf::tc_payload_loader::tests document the operational verification scenarios:
integration_attach_and_detach_on_lo: bring up + tear down on loopback without leaving an orphan clsact qdisc (tc qdisc show dev loclean before + after).integration_round_trip_one_event: drive one HTTP request through localhost:8899, observe one payload event in the ringbuf.integration_orphan_qdisc_cleanup_on_drop: kill the binary mid-run, verify Drop ran (qdisc removed).
To run these:
sudo cargo test -p ibsr-bpf --lib tc_payload_loader -- --ignored --nocapture
Phase 1 close-gate validation (TO DO after live integration)
After the integration tests pass:
- Run a Sui F10 saturating reproducer with TLS-fronting against the local validator.
- Concurrently run
ibsr collect-payload --iface lo --dst-port <validator_port>to capture the bundle’s traffic at the post-term loopback vantage. - Run
phase1_cross_validate.py(in nr-substrate) comparing the IBSR-extracted feature dict against the offlinenr_trainingextractor’s reading of the bundle’sresponses.parquet. - Expected: 7/7 PASS within
PHASE_1_TOLERANCE(cardinality + max-byte exact, amp_ratio ±1e-4 absolute). /methodology-reviewat Stage B close-out — Phase 1 close-gate clears empirically on the live system.
Phase 1 close-gate validation
Once ibsr collect-payload is operational:
- Run a Sui F10 saturating reproducer with TLS-fronting against the local validator.
- Concurrently run
ibsr collect-payload --iface lo --dst-port <validator_port>to capture the bundle’s traffic at the post-term loopback vantage. - Run
phase1_cross_validate.py(in nr-substrate) comparing the IBSR-extracted feature dict against the offlinenr_trainingextractor’s reading of the bundle’sresponses.parquet. - Expected: 7/7 PASS within
PHASE_1_TOLERANCE(cardinality + max-byte exact, amp_ratio ±1e-4 absolute). /methodology-reviewat Stage B close-out — Phase 1 close-gate clears empirically on the live system.
Test coverage so far
ibsr-bpf: 111 tests (was 69 pre-Stage-B; +42; 6 marked#[ignore]for kernel integration).ibsr-schema: 46 tests (was 35 pre-Stage-B; +11).ibsr-collector: 257 tests (was 182 pre-Stage-B; +75).- Workspace total: 506/506 passing.
All Stage-B-related changes are TDD-first: tests written before implementation, and cargo test is run after every meaningful chunk.
What runs without dying (operational robustness pinned by tests)
- Writer fails every iteration → loop continues; windows_failed accumulates; no panic. (
loop_continues_through_writer_failures) - Event source returns errors every poll → snapshot still emits; source_errors counts; no panic. (
loop_recovers_after_intermittent_source_errors) - Malformed event bytes → decode_errors increments; subsequent valid events still aggregate; no panic. (
malformed_event_increments_decode_errors_but_loop_continues) - Shutdown signal mid-window → events drained; final snapshot emitted; clean exit. (
shutdown_during_active_polling_emits_snapshot_before_exit) - 1000 unrelated events in one batch → events_filtered tallies; no pathological behavior. (
many_unrelated_events_in_one_batch_does_not_overflow) - HTTP/1.1 keep-alive / pipelined RPCs → multiple pairs aggregate correctly across the same flow. (
keep_alive_two_request_response_pairs) - Buffer overflow → direction state resets lossy; no panic. (
buffer_overflow_resets_direction) - Aggregator output matches
ResponseAggregates::from_pairsexactly → bridge contract to the offline-extractor’s semantics. (aggregator_matches_offline_semantics,multiple_pairs_aggregate_per_offline_semantics)
Honest framing
ShadowPayload mode is a real safety-posture relaxation relative to the original IBSR contract: under the new mode, payload bytes are observable in userspace, where the original mode kept them inside the kernel. The docs/safety.md rewrite makes this explicit and points operators at the trade-off. The load-bearing no-drop / no-redirect / no-modify guarantee is preserved mechanically across both modes, but the privacy posture differs and operators are responsible for whatever data-handling discipline their context requires.