Trusting your LLM-as-a-Judge
The problem with using LLM Judges is that it’s hard to trust them. If an LLM judge rates your output as “clear”, how do you know what it means by clear? How clear is clear for an LLM? What kinds of things does it let slide? or how reliable is it over time?
In this post, I’m going to show you how to align your LLM Judges so that you trust them to some measurable degree of confidence. I’m going to do this with as little setup and tooling as possible, and I’m writing it in Typescript, because there aren’t enough posts about this for non-Python developers.
Step 0 — Setting up your project
Let’s create a simple command-line customer support bot. You ask it a question, and it uses some context to respond with a helpful reply.
mkdir SupportBot
cd SupportBot
pnpm init
Install the necessary dependencies (we’re going to the ai-sdk and evalite for testing).
pnpm add ai @ai-sdk/openai dotenv...
