Scenario
Expected output
Eval suite correctly labels answer faithfulness and gives useful explanations for 24 held-out cases
Dataset files
How the three files work together
manifest.json
Lists all available files and the evaluation contract. Read this first to understand the challenge structure.
dataset.json
The full source data (FAQ entries, documents, issues, etc.) your solution is built on. Index it, embed it, or process it.
test_inputs.json
10–20 test queries the evaluator will run against your solution at submission time. Your code reads this file and writes results.json.
Dataset files
Sign in to download manifest.json, dataset.json, and test_inputs.json.
Scoring rubric
Eval report shows label, evidence, and rationale per answer
LLM-as-judge prompt requires quotes or evidence spans from context
Handles partial support, subtle contradictions, and unsupported extra claims
Labels match ground truth for at least 20/24 held-out cases
Faithfulness, contradiction, and omission checks are separate and interpretable
Language-free evaluation
Build your solution in any language or framework — Python, TypeScript, Go, Rust, Java, C#, or anything else. The dataset artifacts may be in one language; your implementation does not need to match. TryCrucible evaluates the behaviour of your system, the quality of your AI workflow, your verification strategy, and the reproducibility of your submission — not your language choice.
Submission requirements
- A public GitHub repository link
- A Dockerfile in the repo root — any language or framework; the evaluator builds and runs your container
- Your solution reads test_inputs.json from /workspace/test_inputs.json (use the TEST_INPUTS_PATH env var) and writes results.json to /workspace/results.json (use the RESULTS_PATH env var)
- A decisions.md — 3–5 sentences on the key architectural and AI-workflow choices you made
- The system must be fully reproducible — we clone, build, and run it against real test inputs
Evaluation contract
When you submit, the evaluator runs these steps in order:
- 1Clone your public GitHub repository
- 2Build your container from the Dockerfile in the repo root
- 3Mount test_inputs.json at /workspace/test_inputs.json (TEST_INPUTS_PATH env var)
- 4Run your solution in a network-isolated sandbox (5 min limit, 512 MB RAM)
- 5Read results.json from /workspace/results.json (RESULTS_PATH env var)
- 6Score correctness against hidden ground truth, then score architecture, AI workflow, robustness, and clarity
Input (provided by evaluator)
// test_inputs.json
[
{ "id": "t1", "input": { ... } },
{ "id": "t2", "input": { ... } }
]Output (written by your solution)
Create a free account to start
Already have one? Sign in