Semantic Analysis Pipeline

Automated topic detection and clustering system processing 10K+ daily Discord messages for Z.ai community insights

Tech Brief

Z.ai's Discord server gets tens of thousands of messages every day. The problem was that conversations overlap constantly—someone asks about an API issue, then billing, then someone else continues the API thread, making it impossible for officials to understand what users are actually talking about or find past discussions.

I built a full pipeline that automatically detects topic boundaries, clusters conversations, and makes them searchable, running every 12 hours across all channels.

Messages are pulled from Supabase in paginated chunks, pre-cleaned of bots, URLs, and noise. I then embed every message individually using an embedding model and run a custom TextTiling algorithm adapted for chat. A sliding window moves through the conversation comparing 3 messages on the left against 3 on the right using cosine similarity. When similarity drops, it signals a topic shift. I then compute depth scores to measure how sharp that drop is, smooth them with a moving average, and apply a threshold to confirm real boundaries. Segments smaller than 3 or larger than 80 messages get merged or force-split.

After segmentation, I build context blocks where each message carries its 2 previous messages for richer meaning, never crossing segment boundaries to avoid mixing unrelated topics. These context blocks go through a second embedding pass because single message embeddings lack conversational context. This two-pass design gives accurate boundary detection in the first pass and high-quality search embeddings in the second.

I then run a two-pass LLM classification. First, I sample 15 segments and let the LLM discover what categories actually exist in the server like API Issues, Bug Reports, or Billing Questions. Then I use those discovered categories to classify every segment, keeping labels grounded in real user conversations rather than guessing upfront.

Everything is stored in Qdrant, a dedicated vector database, with metadata like timestamps, channel ID, and boundary scores for filtered search. Structured cluster data goes to Supabase.

The result: officials can now see topic distribution across the server, search past conversations by meaning not just keywords, spot spikes in specific issues, and correlate problems with release timelines. What used to take hours of manual reading now runs automatically.

The pipeline runs with Redis distributed locking to prevent duplicate runs and fault-tolerant batch orchestration with exponential backoff retries throughout.

Created by Hasin Raiyan