Safety

Guardrails

The engineered pipeline runs five inline guards on every answer. They catch the failures a typical RAG chatbot ships with: leaked secrets, successful prompt injection, jailbreak bypass, wrong-refusal behavior, and unsupported citations.

pii_leak

critical

Detects emails, US phone numbers, SSN-shaped strings, credit-card-shaped numbers, and API-key shapes (`sk-…`) in the model's answer.

prompt_injection

high

Matches classic injection patterns in the user input (e.g. 'ignore previous instructions', 'reveal the system prompt'). Pipeline must refuse if matched.

jailbreak_intent

high

Catches 'pretend you are unrestricted', 'DAN', 'bypass MFA', and similar bypass framings. Pipeline must refuse with a structured reason.

refusal_correctness

med

Verifies that when refusal is expected the answer carries a refusal reason and avoids hallucinated content (and vice versa).

citation_faithfulness

high

Every cited source_id must exist in the retrieval set. Non-refusal answers must cite at least one source.

Probe library

Click a probe to run it through both pipelines. Baseline guards run on the raw weekend-chatbot output; engineered guards run inline as part of the trace.