@fuenfgeld

@fuenfgeld/pydantic-evals

skill

Test and evaluate AI agents and LLM outputs using code-first evaluation framework with strong typing. Use when the user wants to: (1) Create evaluation datasets with test cases for AI agents, (2) Define evaluators (deterministic, LLM-as-Judge, custom, or span-based), (3) Run evaluations and generate reports, (4) Compare model performance across experiments, (5) Integrate evaluations with Pydantic AI agents, (6) Set up observability with Logfire, (7) Generate test datasets using LLMs, (8) Implement regression testing for AI systems.

★ 5MIT

@fuenfgeld/pydantic-evals

@fuenfgeld/pydantic-ai-agents