김태오

[Borg-Orchestrator 06] From Synthetic Traces to Live Kubernetes Validation 본문

ML

[Borg-Orchestrator 06] From Synthetic Traces to Live Kubernetes Validation

ystc1247 2026. 5. 21. 02:12

After the six-layer stack existed, the next problem was making it feel alive. Synthetic traces were useful, but I wanted the dashboard to react to Kubernetes state, not just replayed rows.

This phase had two jobs. First, I needed trace and synthetic inputs to keep exercising the orchestrator when no cluster was running. Second, I needed a live Kubernetes loop that could collect a snapshot, run the agents, emit decisions, and update the dashboard repeatedly. The dashboard had to show motion: changing cluster state, repeated decisions, reward updates, and event logs.

The live loop

The live loop is where the control-plane pieces came together. It collects a Kubernetes snapshot, creates an Observation, asks each agent for a proposal, resolves the proposals through the referee, emits the decision, steps the backend, updates reward state, and then writes the dashboard state files.

proposals = [agent.act(obs) for agent in agents]
action = resolve(proposals)
reason = _decision_reason(snapshot, action.agent_name, action.kind.value)
action_label = _action_label(action)

decision_payload = {
    "agent": action.agent_name,
    "kind": action.kind.value,
    "target": action.target,
    "payload": dict(action.payload),
    "action_label": action_label,
    "score": float(action.score),
    "proposal_count": len(proposals),
    "proposals": [
        {"agent": p.agent_name, "kind": p.kind.value, "target": p.target, "score": float(p.score)}
        for p in proposals
    ],
}

state.decision(decision_payload)

The detail I cared about here was the proposal list. A dashboard that only shows the final action hides the conflict. When the final action is AgentA:replicate, I still want to know what AgentB and AgentC wanted. Otherwise I cannot tell whether the system was calmly aligned or whether safety just overrode everything.

Running the live Kubernetes path

The local run command became long because I wanted the loop to be explicit: which config, which event directory, which kubeconfig, how often to sample, whether to tune or skip tuning, and whether to apply exercise stimuli.

PYTHONPATH=orchestrator_stack .venv/bin/python orchestrator_stack/run.py live-kubernetes-run   --config orchestrator_stack/config/orchestrator.example.json   --event-dir orchestrator_stack/runtime/visualization-experimental   --kubeconfig "$HOME/Documents/borg_orchestrator_clusters/kubeconfig-experimental"   --interval-seconds 3   --max-iterations 5   --namespace-prefixes borg-orchestrator-exercise,borg-comparison-workload,default,test-   --trace-out orchestrator_stack/runtime/visualization-experimental/live_kubernetes_trace.json   --trials 3   --prometheus-base-url http://127.0.0.1:19090   --no-policy --no-tune   --exercise-cluster   --exercise-namespace borg-orchestrator-exercise   --exercise-interval-iterations 1   --exercise-randomize --exercise-seed 31

I often ran in fast mode with tuning disabled because I wanted to debug live behavior without waiting for policy training. That is why some dashboard captures show Ray and Optuna as disabled. That was not a bug in the dashboard. It was the run mode I selected to get lively Kubernetes state quickly.

What the full dashboard showed me

The full dashboard helped me catch mismatches between the story I wanted to tell and the state the system was actually producing. If the active stage was complete but the event log had no cluster samples, something was wrong. If reward changed but decisions did not, I needed to inspect the backend. If the exerciser was active but queue pressure stayed flat, the Kubernetes stimulus was probably not doing what I thought.

In the captured run, the event sequence is visibly not a single static recommendation. It moves through AgentA replicate, AgentB memory balloon proposals, AgentC admission/deprioritize behavior, and then another AgentA replicate as risk and SLA state change. That was the kind of liveness I wanted: not animation for its own sake, but traceable state changes.

The awkward live-data problems

Live Kubernetes data introduced boring but real problems. Metrics Server can lag. Prometheus port-forwards can fail. A Kind cluster can behave differently from EKS. Pending pods can be caused by deliberate unschedulable node selectors, resource pressure, or controller choices. I had to make the dashboard expose enough detail to interpret those cases instead of flattening everything into a single score.

That is why the dashboard kept both raw-ish cluster state and interpreted decision state. The raw state tells me what Kubernetes is showing. The decision state tells me what the orchestrator thinks it should do. The interesting debugging happens when those two disagree.

Where this phase ended

By the end of this phase, I had a live experimental dashboard that could keep up with the local Kubernetes loop. It was not yet a fair baseline comparison. It was the first time the project felt like a control system instead of a batch experiment.