Performing large document classification into few categories, trying to understand how much data needs to be labelled by annotators, currently took sample of 100k docs out of some 40M docs and got it labelled around 52k, few questions:
Is there a better data scien-cy/analytical way to determine which categories to label further and how many? thought to try embedding these large annotated documents with some embedding model and cluster them
Secondly, is there a better embedding model or technique to get richer embeddings from these large texts to use for document classification?
Deepanshu Is your goal to build a classifier and generate labels for the documents? Or is it being used in a more general LLM RAG Retrieval system? 1) A lot of NLP teams are moving toward using LLMs as classifiers given the flexibility and strength but this can highly depend on your use case. If you have low volume, I probably would start with an LLM "Ask GPT-4 to classify your documents into x classes [X, Y ,Z] based on description" . If you have a high volume of classifications you need to generate daily, 100k-millions+, the approach you are thinking of create an embedding then clustering (or add a classifier) is an option. In terms of how much to label, in the world of using LLMs as classifiers that are already pre-trained, the number of labels you need are orders of magnitude less than if you are trying to build a training set. The labeling is just to benchmark your system, can be 1000s of samples depending on the number of categories. So the question is, are you training something or is the output of LLM/Embedding good enough to classify/label without training. That will depend on system volume. 2) Given the latest context windows and strengths of LLMs I would try the text-embedding-3-small and text-embedding-3-large models that replace Ada. If you are using an LLM as the final classifier & embedder (that is what goes on internally), embeddings won't matter. If you are building a classifier at the end of the embedding stage, you can get more fancy in embeddings but start simple. Colbert would be a search and retrieval example of how complex you can get.
Sounds like a fun problem to play around with. I’ve done similar stuff in the past pre LLMs and there is some traditional ML work around optimal number of clusters and stuff but I haven’t really followed up on how the field progressed in the past 3+ years but can give some pointers on what you may want to follow up on. It should be fairly easy to take a small subset of your data and try this approach since it’s still a pretty good way to learn about embeddings imo (also deg try Phoenix to see some sick plots of the embeddings). The embedding and clustering route is fairly out of date and not really the norm anymore so i’d really recommend following Jason’s advice on using LLMs for this tho since they’re great at unseen classification tasks Relevant links - Embedding Leaderboard sk-learn clustering performance docs Elbow Method Rand Index https://phoenix.arize.com/
If you have 40m docs I would look into some of the open source models and either using them as is with good prompting or a multi-step pipeline with some chain of thought / guardrails if cost is a factor since GPT-4 is $$. Could also consider finetuning one of them or finetuning a traditional non llm transformer for classification (just know the pros and cons of having to retrain, not being able to really use this again if you have more categories, etc.)
Cheap, high scale and fast: Pretty common is BERT embeddings (average embeddings to get doc embedding, if you need to break up doc chunks) and supervised classifier post for doc embeddings.
Thanks Jason, was trying to not use LLM at this stage given it would n’t be possible to customise, secondly we are collecting training data with domain expert labelling it. A bit more context, we had a decent document classifier system which was a multi-step pipeline with first Inlier/Outlier classifier into some priority categories and then a simple ML classifier, so the issue was with, it worked well but not superb, like there were many categories where model would get confused etc So now as we are labelling more data, would like to utilise this data in better ways, the documents are like medical reports, medical bills, police reports etc
