hey all! i’m trying to pull out structured data from an (unstructured) text transcript of a meeting. any advice on what best practices are? first thing that comes to mind is just using LLMs and prompting to get desired outputs. but would love to hear what tools y’all have found to be best when pulling structured data (dictionary with desired keys and values of data) out of unstructured data
Chaninder R. I'll let some other jump in as well, but if you decide to go the LLM route - which I think is totally valid - then I'd recommend:
Doing as much preprocessing of your text as you can to create a standardized format. Not sure what your text looks like, but the more noise you can remove from it before you pass it into your model, the better job your model will do
If you need the output in a structured format, using structured outputs from a model, or a library like instructor, generally works better than encouraging the model to respond in a structured format
We have a small example that shows experimenting with different models for structured text extraction. This experiment starts with a predefined golden dataset, and using that to benchmark how good each model is at the task. Could be a good starting point, even if you just wanted to pull out the Set Up LangChain Task section that's actually defining the extraction tool
thank you for this - i spent some time looking thru options, this helps a lot. the text will be a few seconds of instructions on what data to update in my structured data dictionary. i have found 4o w/ prompting and testing edge cases to be pretty reliable
