NLP

Categorify-Clustering

Developed a text clustering and profiling application utilizing unsupervised machine learning techniques to group text descriptions intelligently. The system leverages Sentence Transformers to generate rich text embeddings, which are then processed by K-Means, Gaussian Mixture Models (GMM), and Hierarchical Clustering algorithms. It features a fully functional FastAPI web interface where users can dynamically select a clustering model, input text, and receive a human-readable profiling insight.

April 12, 2026
Source Code

Technologies Used

FastAPIScikit-LearnSentence TransformersK-MeansGMMPython

Problem Statement

Organizations often possess large volumes of unstructured textual data, such as user profiles, feedback, or descriptions, but lack a systematic method to group them into actionable segments. Manual categorization is extremely time-consuming and prone to human bias. The business problem lies in effectively extracting thematic insights and clustering similar text entries without predefined labels.

Solution

Categorify-Clustering automates the text profiling process by applying unsupervised machine learning. By transforming raw text into dense vector embeddings using Sentence Transformers, the application clusters the data using K-Means, GMM, or Hierarchical Clustering. The integrated FastAPI backend provides a scalable interface for users to submit text and instantly receive a human-readable profile, enabling businesses to dynamically segment users or text data at scale.

Key Features

Sentence Transformer based text embeddings

Support for K-Means, GMM, and Hierarchical Clustering

FastAPI web interface for interactive profiling

Automated model serialization and inference pipeline

Engineering Challenges

01

Tuning hyperparameters (like n_clusters) for optimal silhouette scores in an unsupervised setting

02

Managing embedding generation latency for real-time web interface requests

03

Designing a modular architecture to support multiple clustering algorithms seamlessly

Results & Metrics

Successfully deployed a dynamic text clustering web service

Achieved robust cluster separation using advanced embeddings over traditional TF-IDF

Reduced manual text categorization effort to near-zero

Lessons Learned

💡

Sentence Transformers significantly outperform sparse vector approaches for semantic clustering

💡

Gaussian Mixture Models provide useful probabilistic cluster assignments for borderline data points

💡

Exposing ML pipelines via FastAPI requires careful handling of model loading and memory management