AI Engineering, not just AI features
CI/CD for reliable AI agents.
EvalForge is a reliability cockpit for RAG and agent systems. It runs prompt regression evals, structured-output validation, guardrails, traces, cost and latency tracking, and a deploy gate that says PASS or FAIL based on measured results. Compare a baseline RAG chatbot against an engineered system and see the difference live.
System status
- API
- offline
- Mode
- live
- Default model
- —
- Knowledge chunks
- 0
- Eval threshold (faithfulness)
- ≥ —
- Eval threshold (citation accuracy)
- ≥ —
The pipeline
Retrieve
Baseline: weak hash-embedding cosine. Engineered: BM25 + HyDE union + lexical-semantic rerank.
Generate
Baseline: bare prompt, raw string. Engineered: BAML-shape system prompt with typed JSON contract enforced by Pydantic parse + 1-shot retry.
Guard
PII, prompt injection, jailbreak, refusal correctness, and citation-faithfulness guards run inline. Failures are visible on every trace.
Gate
A run of 25 golden questions is scored on 10 metrics. The deploy gate emits PASS/FAIL and exact reasons.
Knowledge base