Agent Proactivity Evaluation - Test Prompts

Purpose

This document contains a sequence of prompts used to evaluate the agent's proactive behavior before and after implementing the improvements described in IMPLEMENTATION-proactive-agent.md.

Test Scenario: User asks "Which codes apply to this project?" without providing location or occupancy information.

Expected Behavior (After Fix): Agent autonomously calls GetProjectMetadata and GetArchitecturalPlan to gather location and occupancy, then responds with applicable codes in 1 turn.

Actual Behavior (Before Fix): Agent asks user for information, requiring 6 turns to get the answer.

Test Project Context

Project ID: san-jose-multi-file3
Project Name: "The Sonora Condos (multi-file)"
Location: 1550 Technology Dr, San Jose, CA 95110 (available via GetProjectMetadata)
Occupancy: Residential apartments (determinable from GetArchitecturalPlan page summaries)
Expected Answer: California Building Code 2022 + ICC A117.1

Prompt Sequence (Baseline - Before Fix)

Send these prompts in order to reproduce the reactive behavior:

Prompt 1

What can you help me with?

Expected: General capabilities overview

Prompt 2 (Key Test)

Which regulatory Codes and standards apply to this project?

Before Fix:

Agent lists all codes
Asks: "Please tell me the location and building type"

After Fix:

Agent calls GetProjectMetadata → Gets San Jose, CA
Agent calls GetArchitecturalPlan → Sees "apartment units" in summaries
Agent responds: "California Building Code 2022 and ICC A117.1 apply"

Prompt 3 (User forced to provide help)

Can you lookup the address of this project?

Expected: Agent calls GetProjectMetadata and provides address

Problem: User shouldn't need to ask this - agent should have done it proactively in Prompt 2

Prompt 4 (User forced to provide more help)

Can you lookup the occupancy type?

Before Fix: Agent says it cannot find it directly, asks user to confirm

Problem: Agent should call GetArchitecturalPlan autonomously

Prompt 5 (User explicitly instructs agent how to do its job)

Can you look at the project's table of contents (files and pages) and drill down into a page that may have the occupancy type information?

Expected: Agent calls GetArchitecturalPlan, scans summaries, determines residential occupancy

Problem: User had to explicitly tell agent the exact steps to take

Prompt 6 (Finally getting the answer)

Okay, so now, can you figure which regulatory codes and standards apply?

Expected: Agent synthesizes location + occupancy → California Building Code 2022

Problem: Took 6 prompts to answer what should have been answered in Prompt 2

Success Criteria (After Fix)

When the fix is implemented, Prompt 2 alone should produce the complete answer:

User: "Which regulatory Codes and standards apply to this project?"

Agent: [Calls GetProjectMetadata → GetArchitecturalPlan → Determines codes]

Response: "Based on the project location (San Jose, California) and 
occupancy type (residential multi-family apartments), the applicable 
codes are:

1. California Building Code (CBC) 2022 (Title 24, Part 2)
2. ICC A117.1 Accessible and Usable Buildings and Facilities (2017)"

Metrics:

Turns: 2 (instead of 6) = 66% reduction
Tool calls: 3 autonomous calls (instead of asking user)
Time to answer: ~25 seconds (instead of ~120 seconds) = 79% faster

How to Run This Evaluation

Before Fix (Baseline)

Use current production agent with existing system prompt
Send prompts 1-6 in sequence
Observe: Agent requires 6 turns and constant user guidance

After Fix (Target)

Deploy updated system prompt from src/main/resources/prompts/proactive-system-prompt-v2.txt
Set maxSteps=15 in ChatAgentService.java
Send Prompt 2 only
Observe: Agent autonomously gathers info and provides complete answer in 1 turn

Executive Summary - High-level overview
Detailed Analysis - Root cause and solutions
Implementation Guide - Step-by-step instructions
Issue #285 - GitHub tracking

Purpose​

Test Project Context​

Prompt Sequence (Baseline - Before Fix)​

Prompt 1​

Prompt 2 (Key Test)​

Prompt 3 (User forced to provide help)​

Prompt 4 (User forced to provide more help)​

Prompt 5 (User explicitly instructs agent how to do its job)​

Prompt 6 (Finally getting the answer)​

Success Criteria (After Fix)​

How to Run This Evaluation​

Before Fix (Baseline)​

After Fix (Target)​

Related Documentation​