Core System Architecture

Edited

Every ANET call involves three coordinated systems.

Speech-to-Text (STT)

STT:

  • Converts caller speech into transcript text

  • Detects pauses and end-of-turn signals

  • Provides confidence signals

  • Supports language detection

Accurate transcription directly affects intent identification.


Language Model (LLM)

The LLM:

  • Interprets the caller meaning

  • Identifies likely intent

  • Determines the next best question

  • Applies configured routing logic

  • Generates structured summaries

The LLM does not operate independently. It follows configured intent and action rules.


Text-to-Speech (TTS)

TTS:

  • Converts system responses into audio

  • Maintains language alignment with STT

  • Produces real-time conversational responses

When language changes, the entire stack (STT, LLM, TTS) changes together.