Skip to content

Kubernetes-native scheduler backend + manifest generation in forge package #162

@initializ-mk

Description

@initializ-mk

Context

The current scheduler (forge-core/scheduler/scheduler.go) uses a
single 30s ticker goroutine and persists state to
<WorkDir>/.forge/memory/SCHEDULES.md. Three operational problems
when running as a container:

  1. No persistence by default — LLM-set schedules (via the
    schedule_set builtin tool) and run history disappear on pod
    restart unless a PVC is mounted at <WorkDir>/.forge.
  2. Not horizontally safe — two replicas both ticking on the same
    SCHEDULES.md fire every schedule twice and race on file rename.
  3. Invisible to standard K8s tooling — operators can't
    kubectl get cronjobs to inspect what's scheduled.

K8s already solves all three (cluster-singleton CronJob controller,
durable in etcd, native kubectl integration). The right
architecture is a hybrid backend that picks the cluster when running
in-cluster and falls back to the file store otherwise.

Proposal — Two parts

Part 1 — Hybrid scheduler backend (runtime side)

A new ScheduleBackend interface with two implementations chosen by
environment detection at startup:

// forge-core/scheduler/backend.go (new)
type ScheduleBackend interface {
    Sync(ctx context.Context, entries []Schedule) error  // declarative; startup + hot-reload
    Add(ctx context.Context, s Schedule) error           // dynamic; from schedule_set tool
    Delete(ctx context.Context, id string) error
    List(ctx context.Context) ([]Schedule, error)
}

type FileBackend struct { ... }         // wraps the existing MemoryScheduleStore + ticker
type KubernetesBackend struct { ... }   // uses k8s.io/client-go to CRUD CronJobs

Detection:

func InCluster() bool {
    _, err := os.Stat("/var/run/secrets/kubernetes.io/serviceaccount/token")
    return err == nil
}

forge.yaml escape hatch:

scheduler:
  backend: auto                 # auto | file | kubernetes
  kubernetes:
    namespace: ""               # defaults to the pod's own namespace
    service_url: ""             # the agent's in-cluster Service URL (required in k8s mode)
    allow_dynamic: false        # whether schedule_set (LLM-driven) can create CronJobs at runtime

Part 2 — forge package generates CronJob manifests

Today forge package emits a Deployment + Service + ConfigMap for
the agent. It should also emit one CronJob per entry in
forge.yaml schedules[]. Operators then kubectl apply -k ./k8s
once and get the agent + every declarative schedule materialized as
real CronJobs — no runtime CRUD calls needed for static schedules.

The runtime KubernetesBackend's Sync() reconciles in case someone
edits forge.yaml between deployments, but the steady-state expectation
is "declarative schedules are baked into the deploy manifest, dynamic
ones go through the API."

This also covers the case where the operator wants K8s-native
scheduling without granting the agent pod RBAC to create CronJobs —
set scheduler.kubernetes.allow_dynamic: false (the default), let
forge package generate the CronJobs, agent only needs RBAC to
list/get for the schedule_list tool.

Authentication — reuse the existing loopback token

Forge already mints an internal bearer token for channel plugins to
call back into the A2A endpoint:

  • Runner.ResolveAuth() generates r.authToken at startup
    (runner.go:201,215).
  • Stored via auth.StoreToken(WorkDir, token) to
    <WorkDir>/.forge/runtime.token with 0600 permissions
    (forge-core/auth/token.go:47) so internal callers can read it.
  • A static_token auth provider is prepended to the chain keyed on
    this token, identity {UserID: "forge-internal", Source: "internal"} (runner.go:2425-2436).
  • Channel plugins consume it via Runner.AuthToken().

CronJobs reuse the exact same token. No new auth surface to
design. CronJobs send the token as Authorization: Bearer — the
existing loopback static_token provider validates it, the
auth_verify event lands with Source: "internal" identical to a
channel callback.

Token provisioning — manifest is a template, NOT a credential

forge package MUST NOT embed the token value in any generated
file.
The build pipeline runs in CI / developer workstations and
its output ends up in git repos, container registries, and operator
laptops — none of which are appropriate places for a long-lived
bearer token. Base64-encoded data inside a Secret manifest is
plaintext as far as version control is concerned.

Instead, forge package emits a Secret template with empty
data
plus runtime-readable instructions for the operator:

# k8s/internal-token-secret.yaml — generated by `forge package`
apiVersion: v1
kind: Secret
metadata:
  name: my-agent-internal-token
  namespace: default
  labels:
    forge.agent.id: my-agent
type: Opaque
# data:
#   token: <BASE64-OF-RUNTIME-TOKEN>
#
# This Secret is intentionally generated WITHOUT a `data` field.
# The token is a security credential and must not be checked into
# version control. Populate it once per deployment via one of:
#
#   1. kubectl create secret generic my-agent-internal-token \
#        --from-literal=token="$(cat .forge/runtime.token)" \
#        -n default --dry-run=client -o yaml | kubectl apply -f -
#
#   2. Use your secret-manager operator (ExternalSecrets / Sealed
#      Secrets / SOPS / Vault Agent Injector) and point its
#      external-secret manifest at this Secret name.
#
#   3. For first-deploy bootstrap from a clean checkout where no
#      .forge/runtime.token exists yet, run `forge auth show-token`
#      against the deployed pod (after the Deployment has minted a
#      token on its volume) or pre-mint one with
#      `forge auth mint-token`.

The Deployment manifest references the Secret by name as today;
applying the Deployment before the Secret exists fails the pod
readiness check with a clear MountVolume.SetUp failed for volume "internal-token": secret "my-agent-internal-token" not found
operators get a loud "you forgot the token" signal rather than a
silent fallback.

A new forge auth subcommand (small follow-on, in scope here):

Command Behavior
forge auth show-token Print the token from <WorkDir>/.forge/runtime.token. Exit 1 + clear error if absent.
forge auth mint-token Generate a fresh token, store it via auth.StoreToken, print it to stdout. Useful for first-time deploys.
forge auth secret-yaml Print a ready-to-apply Secret YAML with the token loaded from local store. Pipe straight to kubectl apply -f -. Default behavior matches the option-1 example above but in one command.

These belong in the same PR as the K8s backend because they're the
operator-facing primitives that close the loop on the
"manifest-without-credential" decision.

Generated CronJob shape

For each forge.yaml schedules[] entry:

apiVersion: batch/v1
kind: CronJob
metadata:
  name: forge-aibuilderdemo-daily-summary
  namespace: default
  labels:
    forge.agent.id: aibuilderdemo
    forge.schedule.id: daily-summary
    forge.schedule.source: yaml         # "yaml" or "llm"
spec:
  schedule: "0 9 * * *"
  concurrencyPolicy: Forbid              # K8s-native overlap prevention
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: trigger
              image: curlimages/curl:8.10.1
              env:
                - name: FORGE_AUTH_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: my-agent-internal-token
                      key: token
              args:
                - -sX
                - POST
                - http://my-agent.default.svc:8383/
                - -H
                - "Authorization: Bearer $(FORGE_AUTH_TOKEN)"
                - -H
                - "X-Forge-Schedule-Id: daily-summary"
                - -H
                - "Content-Type: application/json"
                - --data
                - '{"jsonrpc":"2.0","id":"1","method":"tasks/send","params":{"id":"sched-daily-summary-$(date +%s)","message":{"role":"user","parts":[{"type":"text","text":"<schedule description from forge.yaml>"}]}}}'

concurrencyPolicy: Forbid is K8s's native equivalent of the current
scheduler's schedule_skip on overlap — same semantic, free.

Audit-event linkage

The agent recognizes a scheduled fire by the X-Forge-Schedule-Id
request header. Middleware reads it at the A2A boundary and stashes
it in ctx alongside the existing workflow / tenancy context. The
runner emits schedule_fire itself before dispatching, then the
normal session_start → llm_call → invocation_complete chain runs,
capped with schedule_complete. Same audit shape as today; the
cluster is just the remote ticker.

RBAC

In the KubernetesBackend at runtime the agent's ServiceAccount needs:

- apiGroups: ["batch"]
  resources: ["cronjobs"]
  verbs:
    - get          # always
    - list         # always (powers schedule_list tool)
    - create       # only when allow_dynamic: true
    - patch        # only when allow_dynamic: true
    - delete       # only when allow_dynamic: true OR a yaml schedule was removed

forge package emits a Role + RoleBinding scoped to the agent's
own namespace with the minimum verbs based on
scheduler.kubernetes.allow_dynamic. Default falseget,
list only.

Granting create/delete is a meaningful privilege escalation —
essentially "let the LLM schedule arbitrary HTTP calls back to me"
when allow_dynamic: true. Document loudly.

On restart — the user-described behavior

KubernetesBackend's Sync() is idempotent. On restart:

  1. List all CronJobs in the namespace with label forge.agent.id=<self>.
  2. For each forge.yaml entry: if CronJob exists with matching spec →
    leave. If exists with stale spec → patch. If absent → create.
  3. For each existing CronJob NOT in forge.yaml AND labeled
    forge.schedule.source: yaml → delete (handles renamed/removed
    schedules).
  4. CronJobs labeled forge.schedule.source: llm are left alone (the
    LLM owns those; user code shouldn't reap them on restart).

Steady state: the cluster is the source of truth. schedule_list
returns the live CronJob set. No SCHEDULES.md to keep in sync.

Local fallback unchanged

Outside the cluster — forge run on a laptop, CI, a non-k8s VM —
detection returns false, backend resolves to FileBackend, today's
30s-ticker + SCHEDULES.md behavior is byte-identical to current
main. No regression risk for the dev path.

Implementation footprint

~300-400 lines total:

File Change
forge-core/scheduler/backend.go (new) ScheduleBackend interface
forge-core/scheduler/file_backend.go (new) Wraps existing MemoryScheduleStore + ticker behind the interface
forge-core/scheduler/k8s_backend.go (new) client-go based; uses BatchV1().CronJobs(ns) for CRUD
forge-core/scheduler/k8s_manifest.go (new) Pure-Go CronJob YAML generation (no client-go dep — usable from forge package without API access)
forge-core/types/config.go scheduler.backend + scheduler.kubernetes block
forge-cli/runtime/runner.go Pick backend at startup; thread service URL + auth token into KubernetesBackend
forge-cli/build/k8s_stage.go (or wherever forge package emits manifests) Emit one CronJob per schedules[] entry + the credential-less Secret template + the optional Role/RoleBinding
forge-cli/cmd/auth.go (new) forge auth show-token / mint-token / secret-yaml subcommands
forge-cli/server/middleware Read X-Forge-Schedule-Id header, stash in ctx
forge-cli/runtime/runner.go schedule dispatch If X-Forge-Schedule-Id is set on the inbound, emit schedule_fire / schedule_complete around the dispatch
docs/deployment/scheduler-kubernetes.md (new) RBAC table, manifest examples, token-provisioning runbook, security model, comparison with file backend
Tests Unit tests against a fake kubernetes.Interface; manifest-generation golden tests asserting Secret has NO data field; e2e against kind cluster (optional)

Out of scope

  • Schedule history retrieval in k8s mode — could read K8s Job
    status, but easier and more uniform: keep history from the audit
    stream (schedule_complete events already carry status +
    duration). schedule_history tool reads from a small in-memory
    ring buffer fed by the audit emitter, regardless of backend.
  • Cross-namespace deployments — first cut assumes CronJob and
    agent live in the same namespace.
  • Multi-cluster — one cluster per agent.
  • Refresh of agent service URL on Service IP changes — once the
    CronJob is created, it points at the Service's DNS name (stable);
    Pod IP changes are irrelevant.
  • Real-time interactive replacementschedule_set from an
    in-flight chat re-creates a CronJob; the next K8s scheduler tick
    picks it up (kube-controller-manager defaults to a 100ms loop, so
    the lag is negligible).
  • Auto-rotating the internal token — initial implementation
    treats the token as a long-lived credential. Operators rotate by
    re-deploying with a fresh token in the Secret + the agent pod
    picking it up on restart. Auto-rotation with a transitional grace
    window is a separate follow-on.

Verification

  1. forge run on a laptop — confirm FileBackend, SCHEDULES.md
    still written, no client-go in the binary's import graph for
    in-process detection. Zero behavior change.
  2. forge package on the same agent — confirm k8s/ directory now
    contains:
    • cronjob-<sched-id>.yaml files matching the manifest shape
    • internal-token-secret.yaml with no data field (golden
      test asserts this — committing a token-bearing manifest must
      be impossible)
    • role-scheduler.yaml + rolebinding-scheduler.yaml
  3. forge auth secret-yaml | kubectl apply -f - populates the
    Secret out-of-band.
  4. kubectl apply -k ./k8s and confirm CronJobs appear; kubectl get cronjobs shows them; on schedule time, a Job pod fires and curls
    the agent's Service URL.
  5. Tail the audit socket and confirm schedule_fire
    session_startllm_callinvocation_complete
    schedule_complete lands, with the inbound auth_verify showing
    Source: "internal".
  6. With scheduler.kubernetes.allow_dynamic: true, call
    schedule_set mid-conversation and confirm a new CronJob is
    created in the namespace.
  7. Restart the agent pod and confirm CronJobs persist; schedule_list
    returns the same set; no SCHEDULES.md is written to disk.
  8. Apply only the Deployment without the Secret; confirm the pod
    stays NotReady with a clear secret "..." not found event,
    rather than starting with no scheduling.

Related

  • See the conversation that led to this issue: hybrid backend reusing
    the loopback static_token (runner.go:2425-2436) that channel
    plugins already use.
  • forge-core/scheduler/scheduler.go — current ticker implementation
  • forge-cli/runtime/scheduler_store.go — current file-backed store
  • forge-core/auth/token.goStoreToken / LoadToken against
    <WorkDir>/.forge/runtime.token

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions