Categorify-Clustering
Developed a text clustering and profiling application utilizing unsupervised machine learning techniques to group text descriptions intelligently. The system leverages Sentence Transformers to generate rich text embeddings, which are then processed by K-Means, Gaussian Mixture Models (GMM), and Hierarchical Clustering algorithms. It features a fully functional FastAPI web interface where users can dynamically select a clustering model, input text, and receive a human-readable profiling insight.
Technologies Used
Problem Statement
Organizations often possess large volumes of unstructured textual data, such as user profiles, feedback, or descriptions, but lack a systematic method to group them into actionable segments. Manual categorization is extremely time-consuming and prone to human bias. The business problem lies in effectively extracting thematic insights and clustering similar text entries without predefined labels.
Solution
Categorify-Clustering automates the text profiling process by applying unsupervised machine learning. By transforming raw text into dense vector embeddings using Sentence Transformers, the application clusters the data using K-Means, GMM, or Hierarchical Clustering. The integrated FastAPI backend provides a scalable interface for users to submit text and instantly receive a human-readable profile, enabling businesses to dynamically segment users or text data at scale.
Key Features
Sentence Transformer based text embeddings
Support for K-Means, GMM, and Hierarchical Clustering
FastAPI web interface for interactive profiling
Automated model serialization and inference pipeline
Engineering Challenges
Tuning hyperparameters (like n_clusters) for optimal silhouette scores in an unsupervised setting
Managing embedding generation latency for real-time web interface requests
Designing a modular architecture to support multiple clustering algorithms seamlessly
Results & Metrics
Successfully deployed a dynamic text clustering web service
Achieved robust cluster separation using advanced embeddings over traditional TF-IDF
Reduced manual text categorization effort to near-zero
Lessons Learned
Sentence Transformers significantly outperform sparse vector approaches for semantic clustering
Gaussian Mixture Models provide useful probabilistic cluster assignments for borderline data points
Exposing ML pipelines via FastAPI requires careful handling of model loading and memory management