Hello folks, trying to train a model on binary categories which itself comprise of multiple sub-categories, any ideas on what type of approach could work well for such top-level classification? Also, are there ways to filter/clean these larger binary datasets where each category consisting of multiple categories, think of Priority vs Other classification where within priority & other, they could be many sub-categories?
Are you trying to classify the sub categories or categories or both? I could imagine doing this a couple ways with more than one model or single model depending on goals. Using XGBoost? If the categories are text based you might want to consider OpenAI for cleaning up category groups
It sounds like you are doing Sbert -> Embedding -> LinearSVC (Classify) Its hard to say without seeing the data, the steps I probably would take are trying to see if the embeddings generated by SBert actually extract information that separates the data then seeing classifier chosen (in your case I think a linear one) actually can work well. I would probably make sure the embeddings out of SBERT are giving you solid separation for sub-categories. You could do a quick visual check with UMAP colored by category to see if there are clear groups Or a technique that is more quantifiable (k-means inertia, etc..) If embeddings show clear separation based on your classes, then the question is are the clusters discernible by a linear classifier / are you over fitting or just doing the best you can do with the classifier chosen. Might want to test different classifiers.
