Hindi Voice AI QA Scorecard

Hindi Voice AI QA: ASR, Intent और Accent Testing का Practical Scorecard technical diagram

Voice AI QA में क्या-क्या मापना चाहिए?

Hindi voice AI का demo आसान लग सकता है, लेकिन production में असली challenge अलग होता है: noisy audio, mixed Hindi-English speech, local accents, incomplete commands और emotional customer support conversations।

इसीलिए सिर्फ़ “transcript सही आया या नहीं” देखना काफी नहीं है। Voice AI QA को ASR, intent, response, escalation और business outcome तक measure करना चाहिए।

Fact base

Microsoft Speech documentation WER को speech recognition accuracy के लिए standard metric बताती है। WER insertion, deletion और substitution errors को reference transcript के total words से compare करता है।

IndicVoices paper में 7,348 hours speech, 16,237 speakers, 145 districts और 22 Indian languages का dataset describe किया गया है। यह दिखाता है कि Indian speech data में geographic और linguistic diversity कितनी बड़ी है।

PIB के BHASHINI note के अनुसार BHASHINI voice services 22 languages और text services 36 languages में support करती है। यानी Indian language voice workflows national-scale use case बन चुके हैं।

Minimum scorecard

ASR WER: clean, noisy, mobile और regional accent buckets में अलग-अलग measure करें।
Intent accuracy: user ने क्या चाहा, system ने सही intent समझा या नहीं।
Slot accuracy: amount, date, order ID, language, location जैसी entities सही निकलीं या नहीं।
Escalation accuracy: refund, fraud, KYC या angry user जैसे cases सही समय पर human को गए या नहीं।
Resolution quality: user को useful next step मिला या नहीं।

Layer	Metric	Human QA check
Audio	WER / noise	क्या transcript real audio से match करता है?
NLU	Intent / slots	क्या user का business intent सही पकड़ा गया?
Policy	Escalation	क्या risky case human agent तक गया?
Experience	Completion	क्या customer को clear answer या next step मिला?

TrainPlex workflow

TrainPlex reviewers transcript check, intent label, response score, failure reason और improvement note देते हैं। फिर weekly failure taxonomy बनती है: accent miss, Hinglish miss, domain word miss, wrong escalation, unsafe answer या incomplete answer।

इससे AI team को सिर्फ़ “accuracy number” नहीं मिलता, बल्कि actionable backlog मिलता है जिसे product और model teams improve कर सकती हैं।

Production QA plan: 7-day pilot कैसे चलाएं

Voice AI QA को शुरू करने का best तरीका छोटा लेकिन controlled pilot है। पहले 100–300 real या realistic calls लें, उन्हें language, accent, channel और intent buckets में divide करें, फिर हर bucket से sample निकालकर human review करवाएं।

अगर आपका bot sales qualification, customer support, payment follow-up या collections जैसे workflows संभालता है, तो हर workflow के लिए अलग rubric बनना चाहिए। Sales call में “lead intent” important होगा, support call में “resolution” और “escalation” ज्यादा important होगा।

Reviewer rubric example

Transcript: क्या ASR ने user के words सही पकड़े?
Intent: क्या system ने user की जरूरत सही समझी?
Entity: क्या amount, date, city, product, ticket ID जैसी जानकारी सही निकली?
Response: क्या जवाब useful, safe और context-aware था?
Escalation: क्या unclear या risky case human team को भेजा गया?
Tone: क्या reply Indian customer context में natural लगा?

Common Hindi/Hinglish failure cases

Indian voice AI में कुछ failures बार-बार आते हैं। जैसे “payment kat gaya par order confirm nahi hua” को bot payment issue समझे या order status issue? “kal wala plan cancel karna hai” में “kal wala” किस date को refer कर रहा है? “Mera KYC stuck hai” में KYC workflow compliance-sensitive है, इसलिए wrong answer high-risk हो सकता है।

इन cases को generic accuracy metric में छुपा देना आसान है। लेकिन trained human reviewer इन्हें failure taxonomy में capture करता है, ताकि product team exact सुधार कर सके।

Output client को कैसा मिलना चाहिए?

Deliverable	Client को क्या मिलता है?	Use
QA scorecard	WER, intent accuracy, escalation accuracy, resolution score	Weekly quality tracking
Failure taxonomy	Top failure reasons by language/accent/workflow	Model और prompt improvement
Reviewed samples	Accepted/rejected examples with notes	Training and audit trail
Improvement backlog	Prioritized fixes for product/model team	Next sprint planning

When to involve humans

हर call को human review करना जरूरी नहीं है। लेकिन launch phase, new language rollout, high-risk workflows और complaint-heavy categories में human review बहुत valuable है। Mature system में sampling strategy use करें: high-confidence calls कम review हों, low-confidence या high-risk calls ज्यादा review हों।

यही balanced model cost भी control करता है और quality भी improve करता है। TrainPlex इसी operating layer को build करता है।