Hey everyone! Check out our new blog post on the types of LLM guardrails, covering everything from input validation to output monitoring and dynamic guards. Learn how to balance safety and functionality while implementing important measures to protect your LLM applications. Read it here: https://arize.com/blog-course/llm-guardrails-types-of-guards/
LlamaIndex just released a new instrumentation module! Check out some basics and the integration with Arize Phoenix here: https://arize.com/blog/llamaindexs-newly-released-instrumentation-module-and-arize-phoenix-integration/
For my latest collaboration with Arize AI’s Aparna Dhinakaran, we set out to investigate the following question: Given a large set of time series data within the context window, how well can LLMs detect anomalies or movements in the data? In other words, should you trust your money with a stock-picking GPT-4 or Claude 3 agent? To answer, we conducted a series of experiments comparing the performance of large language models in detecting anomalous time series patterns. You don’t want to miss these results (one model clearly stands out): https://arize.com/blog-course/large-language-model-performance-in-time-series-analysis
For my latest piece in partnership with Arize AI co-founder Aparna Dhinakaran, we ran ran several experiments to evaluate and compare GPT-4, Claude 2.1 and Claude 3.0 Opus’s generation capabilities. A few big takeaways: - Inherent model behaviors and prompt engineering matter A LOT in RAG systems. - Simply adding “Please explain yourself then answer the question” to a prompt template significantly improves (more than 2x) GPT-4’s performance at an array of tasks. It’s clear that when an LLM talks answers out, it seems to help in unfolding ideas. It’s possible that by explaining, a model is re-enforcing the right answer in embedding/attention space. - The verbosity of a model’s responses introduces a variable that can significantly influence their perceived performance. This nuance may suggest that future model evaluations should consider the average length of responses as a noted factor, providing a better understanding of a model’s capabilities and ensuring a fairer comparison. Read it: https://arize.com/blog-course/research-techniques-for-better-retrieved-generation-rag/
Using LLMs to conduct numeric evals, while increasingly popular, is finicky and unreliable. That's the main takeaway of my latest blog with Aparna Dhinakaran, which tackles research on how well several major LLMs -- OpenAI's GPT-4, Anthropic's Claude, and Mistral AI's Mixtral-8x7b -- conduct numeric evaluations (in short, not great). TL;DR research takeaways:
Numeric score evaluations across LLMs are not consistent, and small differences in prompt templates can lead to massive discrepancies in results.
Even holding all independent variables (model, prompt template, context) constant can lead to varying results across multiple rounds of testing. LLMs are not deterministic, and some are not at all consistent in their numeric judgements.
We don't believe ChatGPT, Claude, or Mixtral (the three models we tested) handle continuous ranges well enough to use them for numeric score evals.
Read it: https://arize.com/blog-course/numeric-evals-for-llm-as-a-judge/
Have you heard of the Needle In a Haystack Test? My latest piece in partnership with Aparna Dhinakaran, Co-Founder of Arize AI, covers the ins and outs of the test for evaluating the performance of LLM RAG systems across different sizes of context, summarizing the great work from Greg Kamradt and adding to it with new research. The main takeaways from the research:
Not all LLMs are the same. Models are trained with different objectives and requirements in mind. For example, Anthropic's Claude is known for being a slightly wordier model, which often stems from its objective to not make unsubstantiated claims.
Minute differences in prompts can lead to drastically different outcomes across models due to this fact. Some LLMs need more tailored prompting to perform well at specific tasks.
When building on top of LLMs – especially when those models are connected to private data – it is necessary to evaluate retrieval and model performance throughout development and deployment. Seemingly insignificant differences can lead to incredibly large differences in performance, and in turn, customer satisfaction.
Read on more for context: https://arize.com/blog-course/the-needle-in-a-haystack-test-evaluating-the-performance-of-llm-rag-systems/
Despite its seeming simplicity, there is a lot to unpack about effective prompt engineering. For my latest collaboration with Arize AI co-founder and Chief Product Officer Aparna Dhinakaran, we set out to create the ultimate intro to effective prompt engineering for developers — covering templating, common tools (i.e. prompt registries and playgrounds), evaluation of prompts, and more! Includes code-along examples. Read it: https://arize.com/blog-course/evaluating-prompt-playground/
Hey everybody! Really enjoyed collaborating with Jason Erick S. and the Arize team on this piece on the efficacy of different retrieval approaches and best practices around things like chunk size and chunking strategy. Includes both results and test scripts to parameterize retrieval on your own docs, determine performance with LLM evaluations and provides a repeatable framework to reproduce results. Read it here: https://arize.com/blog-course/evaluation-of-llm-rag-chunking-strategy/
