Skip to content

GoatFlow 0.8.2: The Honest Release

In 0.8.1 we made GoatFlow fit in your pocket. In 0.8.2 we made it stop silently losing plugin state when nobody was looking. That’s not a marketing line — it’s the actual bug we chased for half a day this week, and it was hiding inside the platform’s own plugin manager.

Goat Portrait

An admin loading the dashboard triggered AllWidgets(), which “ensured” every plugin was loaded, which tried to spawn a second instance of a plugin that was already running, which died in 300ms, which left the manager routing requests to a ghost. Stateful plugins — anything holding a long-lived peer table, connection pool, or scheduler — silently dropped state. 0.8.2 fixes that bug, and it also fixes the class of bugs it came from.

Oh, and it ships MCP v2 too. Which is supposed to be the headline. But honestly the plugin manager work is what’s kept us up this week, so we’re leading with it.

Plugin Manager Resilience

Three changes that collectively mean “no more silent failure modes when a plugin goes sideways.”

EnsureLoaded uses the manager’s registry, not a cache flag

The EnsureLoaded function used to check a boolean discovered[].Loaded flag to decide whether to spawn a plugin. The flag was maintained by several different reload/replace/unregister code paths, at least one of which could desync it from the manager’s actual registry. When the flag said “not loaded” but a process was still running, the loader spawned a duplicate. The duplicate crashed instantly with acceptAndServe error: timeout waiting for accept — a socket collision with the original — and the manager happily started routing to the ghost.

Ground truth is now Manager.Get(name), which queries the actual registry. The cache flag is still there as a fast-path hint but never makes policy decisions. A desync logs a warning so we can see it happening, but the request doesn’t break.

Graceful shutdown with per-plugin timeouts

GRPCPlugin.Shutdown(ctx) previously ignored its context — the Shutdown RPC blocked forever if a plugin was hung. Manager.ShutdownAll called every plugin serially with context.Background(). A single hung plugin wedged the whole process exit.

Now every plugin gets its declared ResourcePolicy.ShutdownTimeout (10s default), the Shutdown RPC runs in a goroutine with a select-on-ctx bail-out, client.Kill() always runs afterwards as a supervised teardown, and cmd/goats wraps the whole pass in a 30s ceiling. A plugin that refuses to cooperate gets killed; the process exits.

Health checks

Health check

A new background goroutine probes every loaded plugin every 60s via the reserved __health_ping__ function name on the existing Call path. No protocol changes, no plugin rebuilds — any response within 5s (including “unknown function” errors from plugins that don’t handle the name) means the gRPC channel is alive. Only a context-deadline-exceeded counts as a failure. Three consecutive failures flip HealthStatus.Healthy to false and log a warn-level transition; recovery flips it back.

  ticker(60s)                   plugin
      │                            │
      │  Call("__health_ping__")   │
      │───────────────────────────►│
      │                            │ (either returns OK, "unknown
      │  response (ok or err)      │  function" err, or times out)
      │◄───────────────────────────│
      │                            │
  healthy=true                     │
  lastCheck=now                    │
  consecutiveFailures=0            │

State lives in Manager.HealthStatus(name) and Manager.AllHealthStatuses(), ready for an admin UI widget. We explicitly did not add auto-restart — that needs a proper design pass on backoff shape and crash-loop guards, and it’s queued for 0.8.3. Opt out with GOATFLOW_PLUGIN_HEALTH_CHECK=false if you need to.

MCP v2 — Dynamic Tool Discovery & SSE

The MCP server shipped in 0.6.5 had 14 hardcoded tools. Every new endpoint meant editing three places: the handler, the OpenAPI spec, and internal/mcp/tools.go. Adding a field to a tool meant editing two. It was a maintenance tax per feature.

0.8.2 replaces all of that with dynamic generation. Tools come from two sources:

  1. YAML route definitions + OpenAPI spec. Every /api/v1/ endpoint is automatically an MCP tool. The input schema comes from OpenAPI, the description from the route’s mcp_description (or OpenAPI summary as fallback), and RBAC from the existing middleware stack.
  2. Plugin MCPToolSpec. Plugins can declare their own tools in GKRegistration with full JSON Schema input schemas. These override auto-generated tools by name. Plugin tools are namespaced with the plugin name, e.g. myplugin_run_task.

The API bridge (internal/mcp/api_bridge.go) executes each tool by invoking the real Gin handler with a synthetic context. This was the cleanest insight in the redesign — RBAC, org scoping, rate limits, audit logging all come for free because MCP tools go through the same middleware stack as REST. No parallel permission system.

Two new YAML route fields give fine-grained control:

FieldPurpose
mcp_descriptionOverride the auto-generated tool description with LLM-friendly text
mcp: falseOpt this route out of MCP tool generation entirely

Streamable HTTP / SSE transport (MCP spec version 2025-03-26) replaces the previous stdio proxy. Three endpoints:

MethodPathPurpose
POST/api/mcp/sseJSON-RPC request, SSE-framed response
GET/api/mcp/sseServer-to-client notification stream (30s heartbeat)
DELETE/api/mcp/sseSession termination

Session manager has a configurable inactivity timeout (30 min default) and background cleanup. Protocol version negotiation supports both 2024-11-05 and 2025-03-26. Auth covers both JWT and API tokens via unified_auth middleware — API tokens now resolve their actual admin role from the database instead of the synthesised one the old code used. ListChanged: true is advertised so SSE clients get notified when plugins are added or removed.

Server.go went from ~1050 lines to ~130.

Admin SQL, promoted

/admin/sql existed as an MCP-only tool. It’s now a first-class REST endpoint at POST /api/v1/admin/sql with the same allowlist (SELECT, DESCRIBE, EXPLAIN, SHOW TABLES, SHOW COLUMNS), admin middleware enforcement, and dialect-portable placeholder conversion. Same permissions, same allowlist, just reachable by any HTTP client. OpenAPI spec updated with request/response schemas.

Go 1.25

Two Dependabot alerts forced the issue:

  • High: go-jose/go-jose/v3 < 3.0.5 — JWE decryption panic. Upgraded to 3.0.5.
  • Medium: golang.org/x/image < 0.38.0 — TIFF decoder OOM. The fix version requires Go 1.25.

We took the Go bump. It touches a lot of files — every Go-using Dockerfile (7 of them), the main go.mod directive, the SDK toolchain directive, the Makefile default, .env.development and .env.example (the Makefile reads GO_IMAGE from .env as the single source of truth), three helper scripts’ fallback images, the CI workflow’s setup-go pin, and the README badge. All bumped from various 1.24.x to 1.25.

The toolbox image needed a second round of pin bumps. goimports v0.24.0, gosec v2.21.4, staticcheck 2024.1.1, and golangci-lint v1.64.8 all transitively depended on golang.org/x/[email protected] or older, which has constant-arithmetic source that doesn’t compile under Go 1.25 (invalid array length -delta * delta in tokeninternal.go). New pins:

ToolOldNew
goimportsv0.24.0v0.42.0
golangci-lintv1.64.8v2.5.0 (note: v2 import path is /v2/cmd/...)
gosecv2.21.4latest
staticcheck2024.1.1latest
govulncheckv1.1.4 (was pinned)latest (un-pinned)

gosec and staticcheck are on latest until we spot known-good post-Go-1.25 tags in CI logs worth pinning to.

Known follow-up: Dockerfile’s WASM-builder stage still uses tinygo/tinygo:0.32.0, which only supports Go source up to ~1.22. If a WASM plugin declares go 1.25 in its go.mod, the build will fail there. TinyGo 0.34+ is required. We’ll bump that in 0.8.3 alongside the WASM plugin rebuild.

By the Numbers

  • 1050 → 130 — lines in internal/mcp/server.go after the rewrite
  • 14 hardcoded tool implementations deleted
  • 3 reserved function name: __health_ping__
  • 60s health probe interval, 5s per-probe timeout, 3 consecutive failures to flip unhealthy
  • 10s default per-plugin shutdown timeout; 30s overall ceiling
  • 2 Dependabot alerts resolved (1 high, 1 medium)
  • 7 Go-using Dockerfiles bumped to 1.25
  • 10 new unit tests covering shutdown timeouts and health state transitions
  • 0 breaking changes for plugin authors

What’s Next

Goat engineer

0.8.3 (June 2026) picks up the plugin-manager auto-recovery work: auto-restart on health-check failure with exponential backoff and crash-loop guard, an admin UI widget for per-plugin health status (consuming AllHealthStatuses()), parallel shutdown to bound total release time, and the TinyGo bump. Plus the already-queued plugin UI offline-cache work, WebAuthn/FIDO2 hardware keys, and performance benchmarks.

0.9.0 (August 2026) ships the first-party open source plugins — FAQ/Knowledge Base, Calendar, Process Management.

1.0.0 (November 2026) is the production cut.

Bonus Track: The Bug That Fixed Itself

For months, MCP requests authenticated with an API token (rather than a JWT) couldn’t reach admin-only endpoints. The token’s owner was an admin, the database said so, but the MCP middleware didn’t know that — it had its own role-resolution path that synthesised a role from the token’s grants rather than looking up the underlying account. We knew about it. There was a TODO in the code. It was on the backlog.

We didn’t fix it in this release. Or, more accurately: we didn’t try to fix it. The MCP rewrite around the API bridge made it go away as a side effect.

Here’s why. The old MCP server reimplemented authorisation inline because it executed tool handlers itself. Every MCP tool call went through the MCP server’s idea of “is this user allowed?” — a parallel, simpler, slightly-wrong copy of what the REST middleware did. The REST middleware looked up the database role; the MCP path looked up a synthesised one.

When the MCP server stopped having tool handlers and started invoking the real Gin handlers via the API bridge, every tool call started going through the real auth middleware — the same one REST uses. The database lookup happens. The admin role resolves correctly. The bug is gone.

We didn’t write a single line of code targeting this. We deleted the parallel implementation, and the parallel bug deleted itself.

This is what “API bridges beat parallel implementations” looks like in practice. Every time you find yourself reimplementing authorisation, validation, or rate limiting for the B-protocol version of an A-protocol endpoint, stop. The parallel implementation will drift, will accumulate its own bugs, and the only winning move is to bridge through the original. The MCP rewrite shaved 920 lines off server.go and silently fixed a backlog ticket — both because we removed code rather than added it.


Questions? Feedback? Open a GitHub Discussion and let us know what you think!

Back to Blog