EFEVALFORGE

AI Engineering, not just AI features

CI/CD for reliable AI agents.

EvalForge is a reliability cockpit for RAG and agent systems. It runs prompt regression evals, structured-output validation, guardrails, traces, cost and latency tracking, and a deploy gate that says PASS or FAIL based on measured results. Compare a baseline RAG chatbot against an engineered system and see the difference live.

BAML-shaped typed outputRAGAS-style metricsLangfuse-shape tracesPydantic schema enforcementGuardrails AI patternsHyDE + rerank

System status

API
offline
Mode
live
Default model
Knowledge chunks
0
Eval threshold (faithfulness)
Eval threshold (citation accuracy)

The pipeline

1

Retrieve

Baseline: weak hash-embedding cosine. Engineered: BM25 + HyDE union + lexical-semantic rerank.

2

Generate

Baseline: bare prompt, raw string. Engineered: BAML-shape system prompt with typed JSON contract enforced by Pydantic parse + 1-shot retry.

3

Guard

PII, prompt injection, jailbreak, refusal correctness, and citation-faithfulness guards run inline. Failures are visible on every trace.

4

Gate

A run of 25 golden questions is scored on 10 metrics. The deploy gate emits PASS/FAIL and exact reasons.

Knowledge base