AInspiro
中文
We Built an AI Customer Service in 3 Months, Ran 100K Conversations, Then Killed It — A Postmortem
AI ToolsROI Impact: Array

We Built an AI Customer Service in 3 Months, Ran 100K Conversations, Then Killed It — A Postmortem

🤖 This article was generated by AI. Content is for informational purposes only.

Let me start with the conclusion: we spent three months building an AI customer service system, ran 100,000 conversations, the resolution rate did go up, and then we killed the whole thing.


It wasn't a technical failure. Honestly, technically speaking, it was quite successful. But this experience taught me something — AI customer service usually doesn't die from technical problems. It dies from business problems.


Today I'm breaking down the entire process, hoping to help teams considering AI customer service avoid the same detours.

Why We Wanted AI Customer Service

We're an enterprise SaaS company with roughly 50,000 users and around 10,000 daily actives. The support team had 4 people handling about 300-400 tickets per day, more during peak periods.


The boss's math was simple: 4 agents at ¥8K/month each, about ¥400K/year total. If AI could handle 70% of queries, keeping 1 person as backup would save ¥300K/year. Plus, AI doesn't sleep — 24/7 availability means better user experience.


The math was sound. On paper, at least.

How We Built It in Three Months

Month 1 was mainly tool selection and framework setup. We tried several solutions before settling on Coze for conversation orchestration, with Claude API as the underlying model. Why Coze? Because it's relatively friendly to non-technical people — product managers could participate in adjusting conversation flows without needing developers to write code for everything. Claude had just released Opus 4.6, and its long-text comprehension and multi-turn reasoning were genuinely strong. We figured it would be more than enough for customer service conversations.


Month 2 was all about feeding the knowledge base. We imported all our ticket records from the past two years, FAQ documents, and product manuals — about 8,000+ QA pairs. We even had the support lead annotate which answers were “good” and which were “bad,” and used that data for a round of fine-tuning.


Month 3 was gray-box testing. We routed 10% of traffic to the AI agent first, with human agents online as backup. We reviewed data daily, tuned prompts, and fixed bugs. By the end of month three, the resolution rate stabilized at around 72%, and we felt ready for full deployment.

The Real Data from 100K Conversations

After going live, we ran for a month and processed 100,000 conversations. The numbers looked okay:

72% resolution rate, average response time dropped from 4 minutes (human) to 8 seconds, first-response satisfaction at 87%.

But a few metrics were buried deep, and nobody noticed them at first.


First, complaint rate went up 15%. Yes, you read that right — complaints increased, not decreased. I'll explain why below.


Second, human handoff rate was 28%. That means nearly 3 out of every 10 conversations still ended up with a human. But here's the problem: when conversations were handed off, the context was often a mess. The AI had chatted with the user for a while, but either didn't ask for key information or recorded it incorrectly. The human agent basically had to start from scratch.


Third, API costs far exceeded expectations. We originally estimated about ¥5,000/month in API fees. The actual cost was over ¥12,000. The reason: many conversations were very long. A user asks a question, the AI answers, the user isn't satisfied and rephrases, and this goes back and forth for several rounds, doubling token consumption.

Why We Finally Killed It

What made us decide to pull the plug wasn't money. It was complaints.


That 15% increase in complaint rate, when broken down, fell into two main categories:


The first: “Your AI doesn't understand humans at all.” These were usually users with complex problems requiring multi-step troubleshooting. Something like “my data sync keeps failing” — the root cause could be permissions, network, version, you name it. The AI could only offer generic solutions, and when users tried them and they didn't work, they got furious.


The second type was more devastating: “Are you guys just trying to avoid solving my problem by shoving a robot at me?” Once that sentiment takes hold, no amount of remediation can salvage it. We lost several key accounts, all because they felt we “didn't care.”


Frankly, users don't want “instant replies.” They want to feel “taken seriously.” The AI's speed was impressive, but that speed actually made users feel dismissed.

What We Got Wrong

Mistake 1: Knowledge Base Quality Matters More Than Model Capability

We spent a lot of time tuning prompts, but eventually realized the real impact came from the knowledge base. 8,000 QA pairs sounds like a lot, but many contained outdated information — the product had been updated but the FAQ hadn't, features had been deprecated but old docs remained. The AI would confidently recommend a feature that no longer existed. How could users not be angry?

Mistake 2: The Human Handoff Was Broken

This was the biggest pit. The handoff between AI and human was too crude. When an agent picked up a conversation, they couldn't see what the AI had discussed, or if they could, the information structure was wrong and unusable. We later added a conversation summary feature where the AI auto-generates a summary before handing off, but the results were mediocre — the AI thought certain information was important, but the agent didn't. What the agent wanted to know, the AI hadn't recorded.

Mistake 3: We Underestimated the Damage from Long-Tail Problems

A 72% resolution rate looks good, but it means 28% of problems weren't resolved. Some of that 28% were genuinely complex issues the AI couldn't handle. But another portion were issues the AI “handled” — incorrectly. The latter is the real bomb. Users don't know the answer is wrong, follow the AI's advice, create bigger problems, and then come back furious.

Which Scenarios Actually Justify AI Customer Service

After killing it, we did a long retrospective and came to a few conclusions:


Good fits for AI customer service:

  • Highly standardized, high-repetition scenarios (tracking shipments, checking balances, resetting passwords)
  • Scenarios where user expectation is “get a quick answer” rather than “deeply solve my problem”
  • High-volume, low-per-interaction-value scenarios (like C-end e-commerce inquiries)

Bad fits:

  • B2B SaaS — user questions often involve business logic and require understanding context
  • High-LTV products — every customer is precious, can't risk it with AI
  • Long problem chains — scenarios requiring multi-round investigation and cross-department coordination

We happened to hit all three bad-fit criteria simultaneously.

The Final Tally

Let's do the full accounting: 3 months of build costs (3-person team + API fees) came to about ¥180K. One month of live API costs: ¥12K. Several key accounts lost due to complaints: combined annual contract value around ¥500K. Discounts and compensation to win back those clients: another ¥100K or so.


And the agent salary savings? About ¥60K over 3 months.


Do the math however you want — it's a loss.


One last thing: I'm not anti-AI customer service. In the right scenarios, it genuinely reduces costs and improves efficiency, no question about it. What I'm saying is, don't get seduced by the “AI can replace customer service” narrative. It replaces repetitive labor, but it can't replace “making users feel valued.” That's the core value of customer service.


If you're considering AI customer service, ask yourself one question first: do your users want speed, or do they want to be taken seriously? If it's the latter, think about how much you're willing to pay for that.


We paid the tuition. Hopefully, you won't have to pay it again.