Open Responses Solves the Wrong Problem

/Metadata

A new spec dropped: Open Responses. It promises interoperability across LLM providers. One schema for OpenAI, Anthropic, Gemini, local models. Write once, run anywhere.

The spec is thorough. Items are polymorphic, stateful, streamable. Semantic events instead of raw deltas. Provider-specific extensions via namespaced prefixes. RFC-style rigor.

There’s just one problem: this was already solved.

The commoditized layer

Response normalization has been table stakes since GPT-3.5. LiteLLM does it. OpenRouter does it. The Vercel AI SDK does it. Every multi-provider abstraction layer figured this out years ago.

The spec acknowledges error handling. It mentions response.failed events, defines error types. But it glosses over the hard part. What happens when your stream dies mid-response?

Three categories of errors

When you’re building agent infrastructure, errors fall into three buckets:

Harness → LLM provider (overloaded, auth, rate limits): Solved. Every framework handles this.
Agent execution (bugs, tool failures, token limits): Implementation details. Each case is self-contained.
Frontend → harness stream failures: This is where the pain is.

Mid-stream failures are barely handled. Retry mechanisms are fragile. Debugging is a nightmare. And here’s the kicker: even when you use a provider abstraction like OpenRouter, each backend (AWS Bedrock, Azure, Anthropic direct) has different error semantics for the same model.

The war story

I built a granular error classifier. Thirty-plus cases covering OpenRouter error codes, connection-level errors, provider-specific quirks:

// OpenRouter 401 errors - retry (OpenRouter has transient 401 bugs)
if (statusCode === 401) {
  return {
    isRetryable: true,
    statusCode,
    errorType: 'server_error', // Treat as server error since it's a provider bug
    originalError: error,
  };
}

Rate limits, server errors, timeouts, ECONNRESET, UND_ERR_HEADERS_TIMEOUT, problematic finish reasons. I tried to be smart about what’s retryable vs terminal.

Then I gave up and wrote this:

/**
 * Optimistic error classifier - retry everything except user aborts
 *
 * Philosophy: Retry on any error unless the user explicitly cancelled.
 * Max retry attempts protect against infinite loops.
 * Transient failures are common, so retrying is usually the right call.
 */
export function classifyErrorOptimistic(error, options) {
  if (options?.abortSignal?.aborted) {
    return { isRetryable: false, errorType: 'user_abort', originalError };
  }
  return { isRetryable: true, errorType: 'retryable', originalError };
}

The sophisticated classifier still exists in my codebase. I don’t use it. The only reliable strategy is “retry everything.” Provider error semantics are undocumented, inconsistent, and change without notice.

What’s missing

Open Responses could standardize:

Server-side checkpointing: Provider tracks progress, client can request “resume from sequence X”
Partial response semantics: What does a “partial but usable” response look like?
Recovery event types: Specific events for “stream interrupted,” “resumable,” “non-recoverable”
Client acknowledgment protocol: Client confirms receipt, server knows what was delivered

None of this is in the spec. The previous_response_id field assumes a completed response to resume from. Useless when your response never finished.

The real interoperability problem

An open standard for LLM APIs is genuinely useful. But if Open Responses only normalizes the easy layer (response formats) while ignoring stream resilience, it’s solving a problem that was already solved.

The hard problem isn’t “how do I parse a tool call from Claude vs GPT.” It’s “what do I do when my stream dies at token 847 of a 2000-token response, across three different backends, each with different failure modes.”

Until a spec addresses that, we’re all writing our own optimistic retry classifiers.

I’ve opened an issue on the Open Responses repo to discuss this.