TryCrucible — Prove your AI engineering skills

Scenario

You are given 50 research questions, each with 5 source documents. Build a pipeline: a planner breaks the question into sub-queries, specialist agents process each sub-query, a synthesiser produces the final answer containing all required key facts.

Expected output

Final synthesis contains all required key facts for 8 of 10 test questions

Dataset files

How the three files work together

manifest.json

Lists all available files and the evaluation contract. Read this first to understand the challenge structure.

dataset.json

The full source data (FAQ entries, documents, issues, etc.) your solution is built on. Index it, embed it, or process it.

test_inputs.json

10–20 test queries the evaluator will run against your solution at submission time. Your code reads this file and writes results.json.

📦

Dataset files

Scoring rubric

Clarity5%

Full agent trace is logged and interpretable

LLM usage20%

Each agent has a tight focused prompt, no agent is doing multiple unrelated jobs

Robustness10%

Handles sub-agent failures, contradictory source documents, and incomplete retrieval

Correctness25%

Key facts from ground truth appear in final synthesis for >= 8/10 questions

Architecture20%

Agent roles are distinct, state handoff between agents is explicit and inspectable

Language-free evaluation

Build your solution in any language or framework — Python, TypeScript, Go, Rust, Java, C#, or anything else. The dataset artifacts may be in one language; your implementation does not need to match. TryCrucible evaluates the behaviour of your system, the quality of your AI workflow, your verification strategy, and the reproducibility of your submission — not your language choice.

Submission requirements

A public GitHub repository link
A Dockerfile in the repo root — any language or framework; the evaluator builds and runs your container
Your solution reads test_inputs.json from /workspace/test_inputs.json (use the TEST_INPUTS_PATH env var) and writes results.json to /workspace/results.json (use the RESULTS_PATH env var)
A decisions.md — 3–5 sentences on the key architectural and AI-workflow choices you made
The system must be fully reproducible — we clone, build, and run it against real test inputs

Evaluation contract

When you submit, the evaluator runs these steps in order:

1Clone your public GitHub repository
2Build your container from the Dockerfile in the repo root
3Mount test_inputs.json at /workspace/test_inputs.json (TEST_INPUTS_PATH env var)
4Run your solution in a network-isolated sandbox (5 min limit, 512 MB RAM)
5Read results.json from /workspace/results.json (RESULTS_PATH env var)
6Score correctness against hidden ground truth, then score architecture, AI workflow, robustness, and clarity

// I/O contract — same schema for every challenge

Input (provided by evaluator)

// test_inputs.json
[
  { "id": "t1", "input": { ... } },
  { "id": "t2", "input": { ... } }
]

Output (written by your solution)

Leaderboard

🏆

No scores yet on this challenge

Be the first engineer to complete it — your name goes to the top of this leaderboard.

10engineers joined

32challenges started

3verified scores earned

What you earn

// example profile artifact — your completed challenge lives here permanently

you

trycrucible.io/profile/you

Challenges

Top score

AI Agentshard

Build a multi-agent research pipeline

84/100

Correctness

Architecture

Decision quality

Complete this — it lives on your profile permanently

Your score, code, and decisions doc become a public, verifiable artifact. Free forever.

Objective score (0–100)AI + human reviewedPublic profile link

Already have an account? Sign in →