Over the past year, it feels like everything in AI revolves around LLMs.
From chatbots to content generation to coding assistants, many products today are thin layers on top of APIs from OpenAI and Anthropic.
What if all LLMs suddenly become unreliable or unaffordable?
The Hidden Risk: Subsidized Intelligence
- Heavy infrastructure costs are absorbed by big companies
- Pricing does not fully reflect true compute usage
- Startups are building aggressively on top of this assumption
If token prices rise and free tiers disappear, many AI products become expensive to run or hard to scale.
Scenario: When LLMs Break or Become Expensive
1. Greater reliance on traditional systems alongside LLMs
Rule engines, classical ML, and deterministic workflows were always in production. In tighter cost conditions, teams rely on them more and reserve LLM calls for tasks where language reasoning adds clear value.
2. Rise of hybrid AI
Rules -> ML -> Retrieval -> LLM (last step only)
In this model, the LLM is a premium reasoning and language layer, not the foundation.
3. Intentional LLM usage
- Simple tasks: rules/templates
- Data queries: retrieval systems
- Predictions: ML models
- Complex reasoning: LLMs
This shift can reduce cost significantly, often 30-80% depending on workload.
A New Hiring Signal: Cost-Aware Engineers
Companies will evaluate who can reduce token usage, design fallbacks, and avoid unnecessary LLM calls.
What Does a Resilient AI System Look Like?
A production-grade request path:
User Request
|
v
Cache Layer (check first)
|
v
Rule Engine (cheap, deterministic)
|
v
Retrieval System (facts/data)
|
v
LLM (only if necessary)
With multi-model fallback, caching, and model routing, you keep quality while controlling cost.
Infrastructure diagram
+---------------------+
| API Gateway / BFF |
+----------+----------+
|
+-----v------+
| Rate Limit |
+-----+------+
|
+--------------------+--------------------+
| |
+-----v------+ +------v------+
| Redis Cache| | Rule Engine |
+-----+------+ +------+------+
| |
| +----------v----------+
| | Retrieval (PG + VDB)|
| +----------+----------+
| |
+-------------------+---------------------+
|
+-----v------+
| LLM Router |
+--+-----+---+
| |
+-----------+ +-------------+
| |
+-------v--------+ +-------v--------+
| Primary Model | | Secondary Model|
+-------+--------+ +-------+--------+
| |
+---------------+---------------+
|
+-----v------+
| Observability|
+------------+
Real code: model routing + fallback + retrieval + cache
This example follows the diagram: cheap layers first, LLM last, with retries and observability.
import crypto from "node:crypto";
type Tier = "simple" | "medium" | "complex";
type Intent = "faq" | "order_status" | "policy" | "analytics" | "draft_email";
type AiResult = { answer: string; source: string; model?: string; cached?: boolean };
const providers = {
simple: ["local-mini", "openai-gpt-4o-mini"],
medium: ["openai-gpt-5-mini", "anthropic-haiku"],
complex: ["openai-gpt-5", "anthropic-sonnet"],
} as const;
function sha(text: string) {
return crypto.createHash("sha256").update(text).digest("hex");
}
function chooseTier(args: { intent: Intent; confidence: number; hasNumbers: boolean }): Tier {
if (args.intent === "faq" || args.intent === "order_status") return "simple";
if (args.intent === "policy" && args.confidence > 0.8) return "simple";
if (args.intent === "analytics" && args.hasNumbers) return "complex";
return args.confidence > 0.7 ? "medium" : "complex";
}
function buildCacheKey(args: {
tenantId: string;
userId: string;
prompt: string;
tier: Tier;
policyVersion: string;
kbVersion: string;
}) {
// Include tenant/user/versions to avoid context mismatch and stale responses
return `ai:v3:${args.tenantId}:${args.userId}:${args.tier}:${args.policyVersion}:${args.kbVersion}:${sha(args.prompt)}`;
}
function tryRuleEngine(prompt: string): AiResult | null {
if (/^hi|hello|thanks$/i.test(prompt.trim())) {
return { answer: "Hi. How can I help?", source: "rule-engine" };
}
if (/^pricing$/i.test(prompt.trim())) {
return { answer: "Pricing starts at $49 per month.", source: "rule-engine" };
}
return null;
}
async function retrieveFacts(prompt: string): Promise<string[]> {
// Example only: could query Postgres + vector DB
return vectorStore.search(prompt, { topK: 4 });
}
export async function generateAnswer(userId: string, prompt: string): Promise<AiResult> {
const started = Date.now();
const tenantId = auth.getTenantId();
const classified = await classifyIntent(prompt); // { intent, confidence, hasNumbers }
const tier = chooseTier(classified);
const cacheKey = buildCacheKey({
tenantId,
userId,
prompt,
tier,
policyVersion: "2026-04",
kbVersion: "kb-172",
});
const cached = await redis.get(cacheKey);
if (cached) {
metrics.increment("ai.cache_hit", 1, { tier });
return { ...JSON.parse(cached), cached: true };
}
metrics.increment("ai.cache_miss", 1, { tier });
const ruleResult = tryRuleEngine(prompt);
if (ruleResult) {
await redis.set(cacheKey, JSON.stringify(ruleResult), "EX", 1800);
return ruleResult;
}
const context = await retrieveFacts(prompt);
if (!context.length && classified.intent !== "draft_email") {
metrics.increment("ai.retrieval_empty", 1, { intent: classified.intent });
}
const chain = providers[tier];
for (const model of chain) {
try {
const response = await callProvider(model, {
prompt,
context,
timeoutMs: 9000,
maxOutputTokens: tier === "complex" ? 1200 : 500,
});
const parsed = safeJsonParse(response.text);
if (!parsed.ok) {
metrics.increment("ai.invalid_output", 1, { model });
throw new Error("invalid_output_schema");
}
const result: AiResult = {
answer: response.text,
source: "llm",
model,
};
await redis.set(cacheKey, JSON.stringify(result), "EX", 3600);
metrics.timing("ai.success_latency_ms", Date.now() - started, { model, tier });
metrics.gauge("ai.tokens_input", response.usage.inputTokens, { model });
metrics.gauge("ai.tokens_output", response.usage.outputTokens, { model });
metrics.gauge("ai.cost_usd", response.usage.costUsd, { model });
return result;
} catch (error) {
logger.warn({ model, tier, error }, "provider_failed_trying_next");
metrics.increment("ai.provider_failover", 1, { model, tier });
}
}
metrics.increment("ai.template_fallback", 1, { tier });
return {
answer: "Service is busy right now. Please retry in a moment.",
source: "template-fallback",
};
}
How to build cost-optimized AI infrastructure
1) Use LLMs selectively
// Use deterministic flows first; call LLM only for language-heavy cases
if (intent === "track_order" && orderId) {
const order = await ordersApi.get(orderId);
return {
source: "workflow",
message: `Order ${order.id} is ${order.status} and will arrive on ${order.eta}.`,
};
}
if (intent === "refund_policy") {
return {
source: "kb-retrieval",
message: await faqStore.get("refund_policy_v3"),
};
}
// Only here use LLM for nuanced or ambiguous requests
return await llmAnswer(userQuery);
2) Add prompt/output caching
// Cache is powerful but can return stale or wrong context
const key = `ai:v3:${tenantId}:${userId}:${policyVersion}:${kbVersion}:${sha(normalizedPrompt)}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
await redis.set(key, JSON.stringify(result), "EX", 900); // short TTL for dynamic content
Use TTL strategy and versioned keys. Include personalization fields to prevent context mismatch.
3) Route models by intent, confidence, and cost target
if (intent === "faq" && confidence > 0.9) model = "local-mini";
else if (intent === "analytics") model = "gpt-5";
else if (monthlySpendUsd > budgetCap) model = "gpt-5-mini";
else model = "gpt-5-mini";
4) Design multi-provider fallback
const fallbackChain = ["openai-gpt-4.1", "anthropic-sonnet", "local-mistral"];
for (const model of fallbackChain) {
try { return await callProvider(model, payload); }
catch { continue; }
}
5) Use hybrid pipelines: rules + ML + retrieval + LLM
const result =
runRules(input) ??
runClassifier(input) ??
answerFromRetrieval(input) ??
await answerFromLLM(input);
Real-world pain points teams hit in production
Caching challenges
- Stale responses after pricing or policy updates
- Context mismatch when the same prompt comes from different users or tenants
- Need for invalidation strategy, not just TTL
Architecture
API -> Cache (tenant+user scoped) -> Policy/KB version check -> LLM
Code
const key = `resp:${tenantId}:${userId}:${policyVersion}:${kbVersion}:${sha(prompt)}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached);
await redis.set(key, JSON.stringify(result), "EX", 900);
Prompt and model incompatibility
Same prompt can produce different output quality across models. Teams often maintain model-specific prompt templates and response validators.
Architecture
Prompt Registry -> Model Adapter -> Provider
Code
const promptByModel = {
"openai-gpt-5": buildOpenAIPrompt(input),
"anthropic-sonnet": buildAnthropicPrompt(input),
};
const payload = promptByModel[model] ?? buildDefaultPrompt(input);
const response = await callProvider(model, payload);
Retrieval bottlenecks
- Bad embeddings reduce answer quality before LLM is even called
- Irrelevant chunks increase hallucination risk
- Vector and database latency can dominate total response time
Architecture
Query -> Re-ranker -> Top chunks -> LLM
Code
const chunks = await vectorStore.search(query, { topK: 20 });
const reranked = await rerank(query, chunks);
const selected = reranked.slice(0, 5);
if (!selected.length) return fallbackNoContext();
return callProvider(model, { query, context: selected });
Observability requirements
Track token usage, cost per request, fallback rates, invalid output rates, and latency per model/provider. Without this, debugging is guesswork.
Architecture
Request Path -> Metrics + Logs + Traces -> Alerting
Code
metrics.gauge("ai.cost_usd", usage.costUsd, { model, provider });
metrics.gauge("ai.tokens_in", usage.inputTokens, { model });
metrics.gauge("ai.tokens_out", usage.outputTokens, { model });
metrics.timing("ai.latency_ms", elapsedMs, { model, provider });
metrics.increment("ai.fallback_count", didFallback ? 1 : 0, { provider });
Failure modes in production
- Cache poisoning from malformed or low-quality responses
- Fallback loops that retry too aggressively and spike cost
- Silent downgrade to a weaker model without alerting
- Budget explosions during traffic spikes
Architecture
Guardrails -> Fallback Controller -> Budget Gate -> Alerts
Code
if (!isValidSchema(output)) throw new Error("reject_bad_output");
if (retryCount >= 2) return templateFallback();
if (monthlySpendUsd > hardCapUsd) return deferToQueue(request);
if (model !== expectedModel) alert("silent_model_downgrade", { model });
What real scale still needs
- Queue-based async processing for spikes and long tasks
- Rate limiting per provider to avoid bans and throttling
- Budget guardrails that downgrade or defer expensive requests
- Output validation layer (schema + policy checks)
- Multi-region failover for provider and network incidents
Architecture
API -> Queue -> Workers -> Provider Pools (multi-region) -> Validator
Code
await queue.publish("ai.jobs", job);
const permit = await limiter.consume(`${provider}:${region}`);
if (!permit) throw new Error("provider_rate_limited");
if (costSoFarUsd > userBudgetUsd) return downgradeModel();
const parsed = validateWithSchema(result, OutputSchema);
if (!parsed.ok) throw new Error("schema_validation_failed");
const activeRegion = await pickHealthyRegion(["us-east-1", "eu-west-1"]);
Positioning this correctly
LLM-first systems are fragile at scale. The next wave requires deeper engineering beyond prompting: reliability, cost control, and system design discipline.
Where this shows up in practice
- Support chatbots with strict SLA and policy constraints
- SaaS automation where cost per workflow must stay predictable
- Internal tools where fallback behavior matters more than demo quality
Architecture examples
Support chatbot: API -> Rules -> Retrieval -> LLM -> Policy Validator
SaaS automation: Webhook -> Queue -> Workflow Engine -> LLM step -> Audit Log
Internal tools: UI -> Cache -> Fallback chain -> Human handoff
Code
if (channel === "support" && slaMs < 2000) return fastPathAnswer(input);
if (workflow.costUsd > workflow.maxCostUsd) return stopAndNotify(workflow.id);
if (fallbackLevel > 1) return handoffToHuman(ticketId);
Final thought
LLMs are powerful, but they are a tool, not the system.
If LLM costs go 10x tomorrow, your architecture should still survive.