Kubernetes-native scheduler backend + manifest generation in forge package

## Context

The current scheduler (`forge-core/scheduler/scheduler.go`) uses a
single 30s ticker goroutine and persists state to
`<WorkDir>/.forge/memory/SCHEDULES.md`. Three operational problems
when running as a container:

1. **No persistence by default** — LLM-set schedules (via the
   `schedule_set` builtin tool) and run history disappear on pod
   restart unless a PVC is mounted at `<WorkDir>/.forge`.
2. **Not horizontally safe** — two replicas both ticking on the same
   SCHEDULES.md fire every schedule twice and race on file rename.
3. **Invisible to standard K8s tooling** — operators can't
   `kubectl get cronjobs` to inspect what's scheduled.

K8s already solves all three (cluster-singleton CronJob controller,
durable in etcd, native `kubectl` integration). The right
architecture is a hybrid backend that picks the cluster when running
in-cluster and falls back to the file store otherwise.

## Proposal — Two parts

### Part 1 — Hybrid scheduler backend (runtime side)

A new `ScheduleBackend` interface with two implementations chosen by
environment detection at startup:

```go
// forge-core/scheduler/backend.go (new)
type ScheduleBackend interface {
    Sync(ctx context.Context, entries []Schedule) error  // declarative; startup + hot-reload
    Add(ctx context.Context, s Schedule) error           // dynamic; from schedule_set tool
    Delete(ctx context.Context, id string) error
    List(ctx context.Context) ([]Schedule, error)
}

type FileBackend struct { ... }         // wraps the existing MemoryScheduleStore + ticker
type KubernetesBackend struct { ... }   // uses k8s.io/client-go to CRUD CronJobs
```

Detection:

```go
func InCluster() bool {
    _, err := os.Stat("/var/run/secrets/kubernetes.io/serviceaccount/token")
    return err == nil
}
```

forge.yaml escape hatch:

```yaml
scheduler:
  backend: auto                 # auto | file | kubernetes
  kubernetes:
    namespace: ""               # defaults to the pod's own namespace
    service_url: ""             # the agent's in-cluster Service URL (required in k8s mode)
    allow_dynamic: false        # whether schedule_set (LLM-driven) can create CronJobs at runtime
```

### Part 2 — `forge package` generates CronJob manifests

Today `forge package` emits a Deployment + Service + ConfigMap for
the agent. It should also emit one CronJob per entry in
`forge.yaml` `schedules[]`. Operators then `kubectl apply -k ./k8s`
once and get the agent + every declarative schedule materialized as
real CronJobs — no runtime CRUD calls needed for static schedules.

The runtime KubernetesBackend's `Sync()` reconciles in case someone
edits forge.yaml between deployments, but the steady-state expectation
is "declarative schedules are baked into the deploy manifest, dynamic
ones go through the API."

This also covers the case where the operator wants K8s-native
scheduling without granting the agent pod RBAC to create CronJobs —
set `scheduler.kubernetes.allow_dynamic: false` (the default), let
`forge package` generate the CronJobs, agent only needs RBAC to
list/get for the `schedule_list` tool.

## Authentication — reuse the existing loopback token

Forge already mints an internal bearer token for channel plugins to
call back into the A2A endpoint:

- `Runner.ResolveAuth()` generates `r.authToken` at startup
  (`runner.go:201,215`).
- Stored via `auth.StoreToken(WorkDir, token)` to
  `<WorkDir>/.forge/runtime.token` with `0600` permissions
  (`forge-core/auth/token.go:47`) so internal callers can read it.
- A `static_token` auth provider is prepended to the chain keyed on
  this token, identity `{UserID: "forge-internal", Source:
  "internal"}` (`runner.go:2425-2436`).
- Channel plugins consume it via `Runner.AuthToken()`.

CronJobs reuse the **exact same token**. No new auth surface to
design. CronJobs send the token as `Authorization: Bearer` — the
existing loopback static_token provider validates it, the
`auth_verify` event lands with `Source: "internal"` identical to a
channel callback.

## Token provisioning — manifest is a template, NOT a credential

**`forge package` MUST NOT embed the token value in any generated
file.** The build pipeline runs in CI / developer workstations and
its output ends up in git repos, container registries, and operator
laptops — none of which are appropriate places for a long-lived
bearer token. Base64-encoded data inside a Secret manifest is
plaintext as far as version control is concerned.

Instead, `forge package` emits a **Secret template with empty
data** plus runtime-readable instructions for the operator:

```yaml
# k8s/internal-token-secret.yaml — generated by `forge package`
apiVersion: v1
kind: Secret
metadata:
  name: my-agent-internal-token
  namespace: default
  labels:
    forge.agent.id: my-agent
type: Opaque
# data:
#   token: <BASE64-OF-RUNTIME-TOKEN>
#
# This Secret is intentionally generated WITHOUT a `data` field.
# The token is a security credential and must not be checked into
# version control. Populate it once per deployment via one of:
#
#   1. kubectl create secret generic my-agent-internal-token \
#        --from-literal=token="$(cat .forge/runtime.token)" \
#        -n default --dry-run=client -o yaml | kubectl apply -f -
#
#   2. Use your secret-manager operator (ExternalSecrets / Sealed
#      Secrets / SOPS / Vault Agent Injector) and point its
#      external-secret manifest at this Secret name.
#
#   3. For first-deploy bootstrap from a clean checkout where no
#      .forge/runtime.token exists yet, run `forge auth show-token`
#      against the deployed pod (after the Deployment has minted a
#      token on its volume) or pre-mint one with
#      `forge auth mint-token`.
```

The Deployment manifest references the Secret by name as today;
applying the Deployment before the Secret exists fails the pod
readiness check with a clear `MountVolume.SetUp failed for volume
"internal-token": secret "my-agent-internal-token" not found` —
operators get a loud "you forgot the token" signal rather than a
silent fallback.

A new `forge auth` subcommand (small follow-on, in scope here):

| Command | Behavior |
|---------|----------|
| `forge auth show-token` | Print the token from `<WorkDir>/.forge/runtime.token`. Exit 1 + clear error if absent. |
| `forge auth mint-token` | Generate a fresh token, store it via `auth.StoreToken`, print it to stdout. Useful for first-time deploys. |
| `forge auth secret-yaml` | Print a ready-to-apply Secret YAML with the token loaded from local store. Pipe straight to `kubectl apply -f -`. Default behavior matches the option-1 example above but in one command. |

These belong in the same PR as the K8s backend because they're the
operator-facing primitives that close the loop on the
"manifest-without-credential" decision.

## Generated CronJob shape

For each `forge.yaml` `schedules[]` entry:

```yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: forge-aibuilderdemo-daily-summary
  namespace: default
  labels:
    forge.agent.id: aibuilderdemo
    forge.schedule.id: daily-summary
    forge.schedule.source: yaml         # "yaml" or "llm"
spec:
  schedule: "0 9 * * *"
  concurrencyPolicy: Forbid              # K8s-native overlap prevention
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: Never
          containers:
            - name: trigger
              image: curlimages/curl:8.10.1
              env:
                - name: FORGE_AUTH_TOKEN
                  valueFrom:
                    secretKeyRef:
                      name: my-agent-internal-token
                      key: token
              args:
                - -sX
                - POST
                - http://my-agent.default.svc:8383/
                - -H
                - "Authorization: Bearer $(FORGE_AUTH_TOKEN)"
                - -H
                - "X-Forge-Schedule-Id: daily-summary"
                - -H
                - "Content-Type: application/json"
                - --data
                - '{"jsonrpc":"2.0","id":"1","method":"tasks/send","params":{"id":"sched-daily-summary-$(date +%s)","message":{"role":"user","parts":[{"type":"text","text":"<schedule description from forge.yaml>"}]}}}'
```

`concurrencyPolicy: Forbid` is K8s's native equivalent of the current
scheduler's `schedule_skip` on overlap — same semantic, free.

## Audit-event linkage

The agent recognizes a scheduled fire by the `X-Forge-Schedule-Id`
request header. Middleware reads it at the A2A boundary and stashes
it in ctx alongside the existing workflow / tenancy context. The
runner emits `schedule_fire` itself before dispatching, then the
normal `session_start → llm_call → invocation_complete` chain runs,
capped with `schedule_complete`. Same audit shape as today; the
cluster is just the remote ticker.

## RBAC

In the KubernetesBackend at runtime the agent's ServiceAccount needs:

```yaml
- apiGroups: ["batch"]
  resources: ["cronjobs"]
  verbs:
    - get          # always
    - list         # always (powers schedule_list tool)
    - create       # only when allow_dynamic: true
    - patch        # only when allow_dynamic: true
    - delete       # only when allow_dynamic: true OR a yaml schedule was removed
```

`forge package` emits a Role + RoleBinding scoped to the agent's
own namespace with the minimum verbs based on
`scheduler.kubernetes.allow_dynamic`. Default `false` → `get`,
`list` only.

Granting create/delete is a meaningful privilege escalation —
essentially "let the LLM schedule arbitrary HTTP calls back to me"
when `allow_dynamic: true`. Document loudly.

## On restart — the user-described behavior

KubernetesBackend's `Sync()` is idempotent. On restart:

1. List all CronJobs in the namespace with label `forge.agent.id=<self>`.
2. For each forge.yaml entry: if CronJob exists with matching spec →
   leave. If exists with stale spec → patch. If absent → create.
3. For each existing CronJob NOT in forge.yaml AND labeled
   `forge.schedule.source: yaml` → delete (handles renamed/removed
   schedules).
4. CronJobs labeled `forge.schedule.source: llm` are left alone (the
   LLM owns those; user code shouldn't reap them on restart).

Steady state: the cluster is the source of truth. `schedule_list`
returns the live CronJob set. No SCHEDULES.md to keep in sync.

## Local fallback unchanged

Outside the cluster — `forge run` on a laptop, CI, a non-k8s VM —
detection returns false, backend resolves to FileBackend, today's
30s-ticker + SCHEDULES.md behavior is byte-identical to current
main. No regression risk for the dev path.

## Implementation footprint

~300-400 lines total:

| File | Change |
|------|--------|
| `forge-core/scheduler/backend.go` (new) | `ScheduleBackend` interface |
| `forge-core/scheduler/file_backend.go` (new) | Wraps existing MemoryScheduleStore + ticker behind the interface |
| `forge-core/scheduler/k8s_backend.go` (new) | `client-go` based; uses BatchV1().CronJobs(ns) for CRUD |
| `forge-core/scheduler/k8s_manifest.go` (new) | Pure-Go CronJob YAML generation (no client-go dep — usable from forge package without API access) |
| `forge-core/types/config.go` | `scheduler.backend` + `scheduler.kubernetes` block |
| `forge-cli/runtime/runner.go` | Pick backend at startup; thread service URL + auth token into KubernetesBackend |
| `forge-cli/build/k8s_stage.go` (or wherever forge package emits manifests) | Emit one CronJob per schedules[] entry + the credential-less Secret template + the optional Role/RoleBinding |
| `forge-cli/cmd/auth.go` (new) | `forge auth show-token` / `mint-token` / `secret-yaml` subcommands |
| `forge-cli/server/middleware` | Read `X-Forge-Schedule-Id` header, stash in ctx |
| `forge-cli/runtime/runner.go` schedule dispatch | If `X-Forge-Schedule-Id` is set on the inbound, emit `schedule_fire` / `schedule_complete` around the dispatch |
| `docs/deployment/scheduler-kubernetes.md` (new) | RBAC table, manifest examples, token-provisioning runbook, security model, comparison with file backend |
| Tests | Unit tests against a fake `kubernetes.Interface`; manifest-generation golden tests asserting Secret has NO `data` field; e2e against kind cluster (optional) |

## Out of scope

- **Schedule history retrieval in k8s mode** — could read K8s Job
  status, but easier and more uniform: keep history from the audit
  stream (`schedule_complete` events already carry status +
  duration). `schedule_history` tool reads from a small in-memory
  ring buffer fed by the audit emitter, regardless of backend.
- **Cross-namespace deployments** — first cut assumes CronJob and
  agent live in the same namespace.
- **Multi-cluster** — one cluster per agent.
- **Refresh of agent service URL on Service IP changes** — once the
  CronJob is created, it points at the Service's DNS name (stable);
  Pod IP changes are irrelevant.
- **Real-time interactive replacement** — `schedule_set` from an
  in-flight chat re-creates a CronJob; the next K8s scheduler tick
  picks it up (kube-controller-manager defaults to a 100ms loop, so
  the lag is negligible).
- **Auto-rotating the internal token** — initial implementation
  treats the token as a long-lived credential. Operators rotate by
  re-deploying with a fresh token in the Secret + the agent pod
  picking it up on restart. Auto-rotation with a transitional grace
  window is a separate follow-on.

## Verification

1. `forge run` on a laptop — confirm FileBackend, SCHEDULES.md
   still written, no client-go in the binary's import graph for
   in-process detection. Zero behavior change.
2. `forge package` on the same agent — confirm `k8s/` directory now
   contains:
   - `cronjob-<sched-id>.yaml` files matching the manifest shape
   - `internal-token-secret.yaml` with **no `data` field** (golden
     test asserts this — committing a token-bearing manifest must
     be impossible)
   - `role-scheduler.yaml` + `rolebinding-scheduler.yaml`
3. `forge auth secret-yaml | kubectl apply -f -` populates the
   Secret out-of-band.
4. `kubectl apply -k ./k8s` and confirm CronJobs appear; `kubectl get
   cronjobs` shows them; on schedule time, a Job pod fires and curls
   the agent's Service URL.
5. Tail the audit socket and confirm `schedule_fire` →
   `session_start` → `llm_call` → `invocation_complete` →
   `schedule_complete` lands, with the inbound `auth_verify` showing
   `Source: "internal"`.
6. With `scheduler.kubernetes.allow_dynamic: true`, call
   `schedule_set` mid-conversation and confirm a new CronJob is
   created in the namespace.
7. Restart the agent pod and confirm CronJobs persist; `schedule_list`
   returns the same set; no SCHEDULES.md is written to disk.
8. Apply only the Deployment without the Secret; confirm the pod
   stays NotReady with a clear `secret "..." not found` event,
   rather than starting with no scheduling.

## Related

- See the conversation that led to this issue: hybrid backend reusing
  the loopback static_token (`runner.go:2425-2436`) that channel
  plugins already use.
- `forge-core/scheduler/scheduler.go` — current ticker implementation
- `forge-cli/runtime/scheduler_store.go` — current file-backed store
- `forge-core/auth/token.go` — `StoreToken` / `LoadToken` against
  `<WorkDir>/.forge/runtime.token`

Command	Behavior
`forge auth show-token`	Print the token from `<WorkDir>/.forge/runtime.token`. Exit 1 + clear error if absent.
`forge auth mint-token`	Generate a fresh token, store it via `auth.StoreToken`, print it to stdout. Useful for first-time deploys.
`forge auth secret-yaml`	Print a ready-to-apply Secret YAML with the token loaded from local store. Pipe straight to `kubectl apply -f -`. Default behavior matches the option-1 example above but in one command.

File	Change
`forge-core/scheduler/backend.go` (new)	`ScheduleBackend` interface
`forge-core/scheduler/file_backend.go` (new)	Wraps existing MemoryScheduleStore + ticker behind the interface
`forge-core/scheduler/k8s_backend.go` (new)	`client-go` based; uses BatchV1().CronJobs(ns) for CRUD
`forge-core/scheduler/k8s_manifest.go` (new)	Pure-Go CronJob YAML generation (no client-go dep — usable from forge package without API access)
`forge-core/types/config.go`	`scheduler.backend` + `scheduler.kubernetes` block
`forge-cli/runtime/runner.go`	Pick backend at startup; thread service URL + auth token into KubernetesBackend
`forge-cli/build/k8s_stage.go` (or wherever forge package emits manifests)	Emit one CronJob per schedules[] entry + the credential-less Secret template + the optional Role/RoleBinding
`forge-cli/cmd/auth.go` (new)	`forge auth show-token` / `mint-token` / `secret-yaml` subcommands
`forge-cli/server/middleware`	Read `X-Forge-Schedule-Id` header, stash in ctx
`forge-cli/runtime/runner.go` schedule dispatch	If `X-Forge-Schedule-Id` is set on the inbound, emit `schedule_fire` / `schedule_complete` around the dispatch
`docs/deployment/scheduler-kubernetes.md` (new)	RBAC table, manifest examples, token-provisioning runbook, security model, comparison with file backend
Tests	Unit tests against a fake `kubernetes.Interface`; manifest-generation golden tests asserting Secret has NO `data` field; e2e against kind cluster (optional)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kubernetes-native scheduler backend + manifest generation in forge package #162

Context

Proposal — Two parts

Part 1 — Hybrid scheduler backend (runtime side)

Part 2 — `forge package` generates CronJob manifests

Authentication — reuse the existing loopback token

Token provisioning — manifest is a template, NOT a credential

Generated CronJob shape

Audit-event linkage

RBAC

On restart — the user-described behavior

Local fallback unchanged

Implementation footprint

Out of scope

Verification

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Kubernetes-native scheduler backend + manifest generation in forge package #162

Description

Context

Proposal — Two parts

Part 1 — Hybrid scheduler backend (runtime side)

Part 2 — forge package generates CronJob manifests

Authentication — reuse the existing loopback token

Token provisioning — manifest is a template, NOT a credential

Generated CronJob shape

Audit-event linkage

RBAC

On restart — the user-described behavior

Local fallback unchanged

Implementation footprint

Out of scope

Verification

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Part 2 — `forge package` generates CronJob manifests