Examples and Demos¶
Tepa ships with three runnable demos in the demos/ directory. Each is a standalone project that demonstrates a different pipeline behaviour — autonomous self-correction, single-cycle data analysis, and human-in-the-loop interaction. They're designed to be run, read, and adapted.
Demo Map¶
| Demo | Use Case | Key Concepts |
|---|---|---|
| API Client Generation | Code generation with automated testing | Multi-cycle self-correction, structured expectedOutput, shell_execute |
| Student Progress Analysis | Data analysis and report generation | Single-cycle completion, reasoning steps, scratchpad, data_parse |
| Study Plan Generator | Interactive, human-guided output | postPlanner approval gate, postEvaluator verdict override, async events |
If you're evaluating Tepa for a specific use case, the table above should point you to the most relevant demo. If you're new and just want to see the pipeline in action, start with the API Client Generation demo — it shows the full self-correction loop most clearly.
Before You Run¶
All three demos live in the demos/ directory of the repository. Prerequisites:
# 1. Clone the repo (if you haven't already)
git clone https://github.com/frandi/tepa-ai.git
cd tepa-ai
# 2. Install all dependencies from the root
npm install
# 3. Build all packages
npm run build
# 4. Set your API key
export ANTHROPIC_API_KEY=sk-ant-...
All three demos use the Anthropic provider. Each demo directory has its own run.sh script — always use run.sh rather than npm start directly, because it cleans up any previously generated output files before running so the pipeline always starts from a clean state.
API Client Generation¶
Directory: demos/api-client-gen/
Demonstrates: Autonomous code generation, multi-cycle self-correction, verifiable test-based evaluation
This demo generates a typed TypeScript API client for the JSONPlaceholder API, writes tests, runs them, and self-corrects if they fail. It's fully autonomous — no human input after the initial run.
What It Shows¶
This demo answers the question: what does self-correction actually look like in practice?
The success criterion isn't "did the agent generate code" — it's "do the tests pass." The Evaluator runs npx vitest run and checks the exit code. If the generated code compiles but fails tests (a common failure mode — the agent might use new axios() instead of axios.create() to match the existing project pattern), the Evaluator feeds back exactly what failed and the Planner produces a minimal revised plan. The pipeline fixes only what broke.
This is the clearest demonstration of the Plan-Execute-Evaluate loop in the codebase. See How Tepa Works for the conceptual explanation and Pipeline in Detail for the self-correction mechanics.
What Happens¶
- The Planner reads the prompt and produces a plan: explore the project structure, discover code style, probe API endpoints, generate types, generate the client, generate tests, run the tests
- The Executor explores the mock project at
my-project/and finds anHttpClientclass insrc/utils/http.tsthat wrapsaxios.create() - It probes the JSONPlaceholder API to discover response shapes, then generates
src/api/types.ts,src/api/jsonplaceholder.ts, andsrc/api/__tests__/jsonplaceholder.test.ts - It runs
npx vitest runviashell_executeto verify the tests pass - The Evaluator checks whether all three files exist, follow the expected patterns, and all tests pass
- If tests fail, the Evaluator's feedback guides the Planner to produce a minimal fix — typically just correcting the client initialisation
Prompt¶
goal: >
Create a TypeScript API client module for the JSONPlaceholder API
(https://jsonplaceholder.typicode.com).
All file paths are relative to the projectRoot directory specified in context.
context:
projectRoot: ./my-project
language: TypeScript
runtime: Node.js
httpClient: axios
testFramework: vitest
existingStructure: |
my-project/
└── src/
└── utils/
└── http.ts # HttpClient class wrapping axios.create()
expectedOutput:
- path: src/api/jsonplaceholder.ts
description: A typed API client class with methods for posts, users, and comments
criteria:
- Uses axios.create() with baseURL following existing project patterns
- Methods are fully typed with return types
- Exports a default client instance
- path: src/api/types.ts
description: TypeScript type definitions for API response shapes
- path: src/api/__tests__/jsonplaceholder.test.ts
description: Test file using vitest
criteria:
- All tests passing when run with npx vitest run
The criteria entries are what drive self-correction. The requirement that tests must pass when run with npx vitest run means the Evaluator doesn't just check if a test file exists — it verifies the code actually works. See Pipeline in Detail — Structured expectedOutput for how criteria arrays affect evaluation.
Tools¶
| Tool | Purpose |
|---|---|
file_read |
Read existing project files to discover code style |
file_write |
Write generated code files |
directory_list |
Explore the mock project structure |
file_search |
Find files matching patterns |
shell_execute |
Run npx vitest run to verify tests |
http_request |
Probe API endpoints to discover response shapes |
Configuration¶
const tepa = new Tepa({
tools: [
fileReadTool,
fileWriteTool,
directoryListTool,
fileSearchTool,
shellExecuteTool,
httpRequestTool,
],
provider: new AnthropicProvider(),
config: {
limits: {
maxCycles: 3, // Room for one attempt + two correction cycles
maxTokens: 400_000, // Higher budget: code generation steps produce longer outputs
},
logging: { level: "verbose" },
},
});
Self-Correction in Action¶
The pipeline typically completes in 1–2 cycles:
- Cycle 1 — Generates all three files. Tests may fail if the agent doesn't correctly follow the
axios.create()pattern fromsrc/utils/http.ts. - Cycle 2 (if needed) — The Evaluator's feedback identifies exactly which test failed and why. The Planner produces a minimal revised plan — usually just fixing the client initialisation. Tests pass.
The key observation: cycle 2 doesn't re-generate everything. The Planner reads _execution_summary from the scratchpad, sees that types and tests were generated correctly, and produces a plan that only touches the client file. This is minimal revision, not a full restart.
Running¶
Student Progress Analysis¶
Directory: demos/student-progress/
Demonstrates: Data pipeline, LLM reasoning steps, scratchpad state management, single-cycle completion
This demo analyses student grade and attendance data for a class, produces a comprehensive insight report, and exports a flagged-students summary. It demonstrates that the self-correction loop adds zero overhead when it isn't needed — a well-defined task with sufficient tools completes in a single cycle.
What It Shows¶
This demo answers two questions: how does Tepa handle multi-step data analysis? and does the pipeline add overhead when self-correction isn't needed?
The answer to the second question is no — the pipeline runs one cycle and exits immediately when the Evaluator passes. There's no retry tax for tasks that are well-specified.
It also demonstrates reasoning steps — plan steps with an empty tools array where the LLM produces analysis as text rather than invoking a tool. Step 4 in the typical plan is a pure reasoning step: the LLM receives the computed metrics from prior steps and generates tailored recommendations for each at-risk student. No tools needed — just synthesis. See How Tepa Works — The Planner for the reasoning step concept.
What Happens¶
- The Planner produces a plan: read and parse CSVs, compute class metrics, identify at-risk students, correlate attendance with performance (reasoning step), write the report
- The Executor reads
grades.csv(1,344 rows — 28 students × 6 subjects × 8 assignments) andattendance.csv(1,764 rows — 28 students × 63 school days) usingdata_parse - It computes averages, pass rates, and per-subject trends — carrying intermediate results across steps via the
scratchpadtool - A reasoning step synthesises findings and generates tailored recommendations for each flagged student — pure LLM reasoning, no tool calls
- It writes
progress-report.mdandflagged-students.csvto the data directory - The Evaluator performs structural checks (files exist, correct CSV format) and qualitative checks (recommendations are specific, not generic) — passes on the first attempt
Mock Data¶
The CSV files contain realistic data designed to produce meaningful analysis:
- 28 students across 6 subjects (Math, English, Science, History, Art, PE)
- Class average: ~72%, with Math declining and Art/English improving
- 5 at-risk students with distinct patterns: multi-subject failure + chronic absence (Liam Chen, ~54%), isolated Math failure despite strong attendance (Jake Thompson, ~63%), sharpest attendance decline (Aisha Patel, ~67%), and two others
- Strong attendance-performance correlation (~0.73) — the report surfaces this automatically
- One improvement story — Noah Kim (61% → 73.5%) provides a positive data point
Prompt¶
goal: >
Analyze student learning progress for Class 5B (Fall 2025 semester)
and produce an insight report with actionable recommendations.
context:
classDir: ./class-5b
gradeFile: grades.csv
attendanceFile: attendance.csv
gradeFileColumns: student_name, subject, assignment_name, score, max_score, date
attendanceFileColumns: student_name, date, status
studentCount: 28
subjectCount: 6
gradingPolicy:
failing: 60
intervention: 70
notes:
- Parent-teacher conferences are scheduled for next week
- Below 60% is failing, below 70% needs intervention
expectedOutput:
- path: ./class-5b/progress-report.md
description: A comprehensive progress report
criteria:
- Class-wide performance overview with averages and pass rates
- Per-subject trend analysis identifying improving and declining subjects
- Individual student flags for at-risk students
- Correlation between attendance and performance
- Actionable recommendations for each flagged student
- path: ./class-5b/flagged-students.csv
description: Summary CSV of at-risk students for quick reference
criteria:
- Columns include student name, overall percentage, urgency level, attendance rate, primary concern
Notice the notes field in context — "Parent-teacher conferences are scheduled for next week" is domain context that informs the Evaluator's qualitative check. The Evaluator can assess whether the recommendations are appropriately timely and actionable for that context, not just formally complete.
Tools¶
| Tool | Purpose |
|---|---|
file_read |
Read CSV data files |
file_write |
Write report and flagged students CSV |
directory_list |
Explore the data directory |
data_parse |
Parse CSV into structured row objects |
shell_execute |
Run data processing scripts if needed |
scratchpad |
Carry computed metrics between steps |
log_observe |
Record analytical observations to the pipeline log |
The scratchpad is essential in this demo — the Executor computes averages and pass rates in one step, then reads them back in a later reasoning step that generates recommendations. Without the scratchpad, each step would only have access to its declared dependencies' direct output.
Configuration¶
const tepa = new Tepa({
tools: [
fileReadTool,
fileWriteTool,
directoryListTool,
dataParseTool,
shellExecuteTool,
scratchpadTool,
logObserveTool,
],
provider: new AnthropicProvider(),
config: {
limits: {
maxCycles: 3, // Allowed, but typically completes in 1
maxTokens: 250_000, // Lower budget sufficient — no code generation
},
logging: { level: "verbose" },
},
});
Running¶
Study Plan Generator¶
Directory: demos/study-plan/
Demonstrates: Human-in-the-loop interaction, async event callbacks, plan approval, verdict override
This demo shows Tepa's event system used for inserting human decision points into the pipeline. The user provides a learning goal, reviews the generated plan before execution begins, and can accept or reject results after evaluation. No part of the pipeline runs without the user's explicit approval at each checkpoint.
What It Shows¶
This demo answers the question: how do you put a human in control of an autonomous pipeline?
The mechanism is two async event callbacks — one that pauses after planning, one that pauses after evaluation. Because event callbacks can return Promises, the pipeline waits at each checkpoint until the user responds. The pipeline doesn't know or care that it's paused for human input — from its perspective, a callback returned a Promise that eventually resolved. See Event System Patterns — Human-in-the-Loop for the full pattern documentation.
Interactive Flow¶
Here's what a typical run looks like:
=== Tepa Demo: Study Plan (Human-in-the-Loop) ===
What would you like to study?
> Learn Rust programming
--- Plan (4 steps) ---
research: Research Rust learning resources (LLM reasoning)
outline: Create study plan outline (scratchpad)
write: Write detailed study plan (file_write)
review: Review and finalize (file_read)
Do you approve this plan? (yes/no): yes
research: OK — Research Rust learning resources (2140 tok, 3200ms)
outline: OK — Create study plan outline (1800 tok, 2100ms)
write: OK — Write detailed study plan (3200 tok, 4500ms)
review: OK — Review and finalize (1100 tok, 1800ms)
--- Evaluation: PASS (confidence: 0.85) ---
Summary: Study plan covers all criteria with specific weekly topics and resources.
=== Result ===
Status: pass | Cycles: 1 | Tokens: 12,400
If the Evaluator returns fail, the user gets to decide whether to retry or accept:
--- Evaluation: FAIL (confidence: 0.45) ---
Feedback: Missing time estimates for weeks 3-4. Resources are too generic.
Continue with another cycle to improve? (yes/no): no
[User override] Accepting current results.
=== Result ===
Status: pass | Cycles: 1 | Tokens: 11,200
Notice the result: Status: pass even though the Evaluator returned fail. The postEvaluator callback flipped the verdict to "pass" when the user declined to retry — the pipeline treats a user-overridden verdict identically to a genuine evaluator pass.
What Happens¶
- The entry script prompts the user for a learning goal and injects it into the prompt context
- The Planner generates a plan for producing the study plan
- Pipeline pauses —
postPlannercallback displays the plan and awaits user approval - After approval, the Executor writes a detailed study plan to
study-plan.md - The Evaluator checks quality criteria: weekly breakdowns, concrete resources, time estimates
- If
fail—postEvaluatorcallback asks the user whether to run another cycle or accept the results - If the user accepts, the verdict is overridden to
"pass"and the pipeline exits
The Event Callbacks¶
Plan approval gate — postPlanner:
events: {
postPlanner: [
async (data: unknown) => {
const plan = data as Plan;
console.log(`\n--- Plan (${plan.steps.length} steps) ---`);
for (const step of plan.steps) {
const tools = step.tools.length > 0 ? step.tools.join(", ") : "LLM reasoning";
console.log(` ${step.id}: ${step.description} (${tools})`);
}
const answer = await ask("\nDo you approve this plan? (yes/no): ");
if (answer !== "yes" && answer !== "y") {
throw new Error("Plan rejected by user");
}
// Returning void — original plan passes through unchanged
},
],
}
The callback is async and awaits the user's response. The pipeline waits on this Promise. Throwing rejects the plan and aborts the pipeline. Returning nothing approves it as-is.
Verdict override — postEvaluator:
postEvaluator: [
async (data: unknown) => {
const result = data as EvaluationResult;
if (result.verdict === "fail") {
const answer = await ask("Continue with another cycle to improve? (yes/no): ");
if (answer !== "yes" && answer !== "y") {
console.log(" [User override] Accepting current results.\n");
return { ...result, verdict: "pass" as const }; // Flips the verdict
}
}
// Returning void — original verdict passes through, pipeline continues to next cycle
},
],
Returning a modified EvaluationResult with verdict: "pass" stops the re-planning loop. The pipeline checks the verdict after all postEvaluator callbacks complete — a flipped verdict is treated identically to a genuine evaluator pass.
Prompt¶
The prompt uses a placeholder that gets injected at runtime:
goal: >
Create a personalized study plan based on the user's learning goal.
Write the complete study plan to the file specified in context.outputFile.
context:
outputDir: .
outputFile: ./study-plan.md
userInput: "{{USER_INPUT}}"
planFormat: |
Structure as Markdown with:
- Title and overview
- Weekly breakdowns (topics, resources, exercises, time estimates)
- Tips for staying on track
expectedOutput:
- path: ./study-plan.md
description: A comprehensive personalized study plan
criteria:
- Clear title reflecting the user's goal
- Weekly breakdowns with specific topics
- Concrete learning resources (books, courses, websites)
- Practice exercises or projects per week
- Estimated time commitments
The entry script injects the user's input before running:
const userInput = await rl.question("What would you like to study?\n> ");
prompt.context.userInput = userInput;
Tools¶
| Tool | Purpose |
|---|---|
file_read |
Read existing files in the output directory |
file_write |
Write the study plan to study-plan.md |
directory_list |
Explore the output directory |
scratchpad |
Carry research notes and outline between steps |
This demo uses the fewest tools — it's primarily an LLM reasoning task. The scratchpad carries research findings from early steps into the final writing step without requiring an explicit dependency chain on the raw tool outputs.
Configuration¶
const tepa = new Tepa({
tools: [fileReadTool, fileWriteTool, directoryListTool, scratchpadTool],
provider: new AnthropicProvider(),
config: {
limits: {
maxCycles: 3,
maxTokens: 250_000,
},
logging: { level: "verbose" },
},
});
Running¶
Shared Observability Pattern¶
All three demos register the same three event hooks for pipeline visibility. These aren't required — the pipeline runs identically without them — but they represent a useful baseline for any Tepa integration. Add them when you need to see what's happening; strip them for a clean production run.
postPlanner — Print the generated plan as a step list:
postPlanner: [(data: unknown) => {
const plan = data as Plan;
for (const step of plan.steps) {
const tools = step.tools.length > 0 ? step.tools.join(", ") : "LLM reasoning";
const deps = step.dependencies.length > 0 ? ` ← ${step.dependencies.join(", ")}` : "";
console.log(` ${step.id}: ${step.description} (${tools})${deps}`);
}
}],
postStep — Log each step's result with token and timing data:
postStep: [(data: unknown) => {
const { step, result } = data as PostStepPayload;
const icon = result.status === "success" ? "OK" : "FAIL";
console.log(` ${step.id}: ${icon} — ${step.description} (${result.tokensUsed} tok, ${result.durationMs}ms)`);
if (result.error) console.log(` → ${result.error}`);
}],
postEvaluator — Print the verdict with confidence score and feedback:
postEvaluator: [(data: unknown) => {
const result = data as EvaluationResult;
const icon = result.verdict === "pass" ? "PASS" : "FAIL";
console.log(`--- Evaluation: ${icon} (confidence: ${result.confidence.toFixed(2)}) ---`);
if (result.verdict === "pass" && result.summary) console.log(` Summary: ${result.summary}`);
if (result.verdict === "fail" && result.feedback) console.log(` Feedback: ${result.feedback}`);
}],
These three hooks cover the three moments developers most want visibility into: what is the agent about to do? (plan), how did each step go? (step), and did it work? (evaluation). For the full patterns — adding human approval gates, safety filters, monitoring integration — see Event System Patterns.
What's Next¶
- API Reference — Complete interface definitions for everything used in these demos.
- Configuration — Tune cycle limits, token budgets, and per-stage models for your use case.
- Event System Patterns — Extend the observability hooks above into full human-in-the-loop workflows, safety filters, and monitoring integrations.
- Contributing — Add your own demo, tool, or provider to the repository.