LLM-as-a-judge is the fastest way to scale evaluation and the easiest way to fool yourself. Three biases hurt teams the most: position bias (the first option wins), verbosity bias (longer answers win), and self-preference (a model rates its own outputs higher).
Counter them with randomized order, length-normalized scoring, and judge diversity. Calibrate every judge against a human-rated subset before trusting its scores in CI.