Recently, while implementing OpenAI structured outputs using json_schema with MCP (Model Context Protocol), I ran into a serious performance issue.
The model would:
- Start streaming normally
- Then call
mcp.list - And suddenly⦠pause for 30 seconds
- After that, the next chunk appeared
After investigation, the root cause turned out to be not specifying tool_choice. When I added tool_choice, the pause dropped to 8β10 seconds.
Let's break down why this happens.
π What is json_schema in OpenAI?
OpenAI now allows enforcing structured outputs using:
response_format: {
type: "json_schema",
json_schema: { ... }
}This guarantees:
- Strict JSON
- No malformed outputs
- Predictable structure
- Production-ready parsing
This is much more reliable than "please return JSON" prompting.
π What is MCP?
MCP (Model Context Protocol) allows models to dynamically:
- List tools (
mcp.list) - Call tools
- Fetch tool schemas
- Interact with external systems
If you're using something like https://mcp.botsify.com/mcp/list_actions, the model dynamically evaluates available tools before deciding what to call.
π¨ The Real Problem: 30-Second Pause
Here's what was happening:
- Model begins streaming.
- It internally triggers
mcp.list. - It evaluates all tools.
- It thinks deeply.
- It decides whether to call a tool.
- Streaming pauses for ~30 seconds.
Why? Because I did not define:
tool_choice: "auto"
Or:
tool_choice: { "type": "function", "function": { "name": "myTool" } }π§ Why Missing tool_choice Causes Delay
When tool_choice is NOT provided, the model must:
- Evaluate all available tools
- Decide whether to call one
- Compare with schema requirements
- Validate output format
- Possibly retry internally
This internal reasoning phase is expensive β especially when:
- Using
json_schema - Using MCP dynamic tools
- Streaming responses
- Large tool lists
The model enters a "decision paralysis" loop: it has to satisfy both the structured output contract and the possibility of calling tools, so it spends a long time reasoning before emitting the next token.
β‘ Why Adding tool_choice Reduced It to 8β10s
When I added tool_choice, the model:
- No longer had to evaluate whether to use a tool
- Skipped tool selection reasoning
- Directly executed the intended flow
- Reduced internal retries
The pause dropped from 30s β 8β10s.
There's still some delay because MCP still loads, tool schema validation still occurs, and structured output validation still runs β but the heavy decision-making phase is reduced.
π What's Happening Internally (Advanced)
When using json_schema, MCP tools, streaming, and no tool_choice, the model must satisfy two constraints at once:
- It must follow the strict JSON schema for the final response.
- It must decide whether a tool call is needed (and which one).
If tool output also needs to match the schema, the model may internally "simulate" or plan tool outputs before streaming. That planning step adds significant latency. By setting tool_choice, you remove the need for that decision step.
π Best Practices If You're Using MCP + json_schema
β 1. Always Define tool_choice
If you know the tool:
tool_choice: {
"type": "function",
"function": { "name": "updateBotSettings" }
}If you want auto:
tool_choice: "auto"
Even "auto" is better than undefined β it gives the model a clear instruction to consider tools without re-evaluating from scratch.
β 2. Reduce Tool Count
If you expose many tools (e.g. 20+) with complex schemas and nested properties, the model's reasoning time increases. Keep only the necessary tools per request when possible.
β 3. Avoid Huge JSON Schemas
Large json_schema definitions increase validation time, retry loops, and token usage. Flatten or simplify where you can.
β 4. Measure Streaming Gaps
Don't just measure total time. Measure first-token time, time before a tool call, and time between chunks β that's where hidden latency shows up.
π‘ Key Insight
The delay was not the network, not MCP server speed, and not streaming itself. It was model decision latency. The fix was giving the model clearer instructions: the more freedom you give the model (e.g. no tool_choice), the slower it may think before producing the next token.