Approaches for Training Models on Binary Categories with Sub-Categories

·Jun 08, 2024 01:49 PM

Hello folks, trying to train a model on binary categories which itself comprise of multiple sub-categories, any ideas on what type of approach could work well for such top-level classification? Also, are there ways to filter/clean these larger binary datasets where each category consisting of multiple categories, think of Priority vs Other classification where within priority & other, they could be many sub-categories?

4 comments

· Sorted by Oldest

Jason
·
Are you trying to classify the sub categories or categories or both? I could imagine doing this a couple ways with more than one model or single model depending on goals. Using XGBoost? If the categories are text based you might want to consider OpenAI for cleaning up category groups
Deepanshu
·
Hey Jason doing both through two different systems, the second sub-category system is based on a finetuned sbert model, the issue I’m facing is with binary classifier at the top currently using LinearSVC, it’s not performing really well in prod
Deepanshu
·
Jason any thoughts/ideas on this?
Jason
·
It sounds like you are doing Sbert -> Embedding -> LinearSVC (Classify) Its hard to say without seeing the data, the steps I probably would take are trying to see if the embeddings generated by SBert actually extract information that separates the data then seeing classifier chosen (in your case I think a linear one) actually can work well. I would probably make sure the embeddings out of SBERT are giving you solid separation for sub-categories. You could do a quick visual check with UMAP colored by category to see if there are clear groups Or a technique that is more quantifiable (k-means inertia, etc..) If embeddings show clear separation based on your classes, then the question is are the clusters discernible by a linear classifier / are you over fitting or just doing the best you can do with the classifier chosen. Might want to test different classifiers.