Testing Ask CDP AI

This repo includes a repeatable workflow for testing reviewed Ask CDP AI questions against the local AI server. Use it when improving answer quality, grounding, prompt behavior, or response reliability.

Run these commands from the ai-server directory unless noted otherwise.

Test Data

Reviewed free-text feedback lives in:

bash

comments.csv

Reviewed location fixtures live in:

bash

org-data/<orgId>.json

The fixtures are endpoint-shaped location profiles fetched from the backend path GET /api/v1/locations/id/{organization_id}. They are used as locationData when replaying questions.

Approved follow-up question candidates live in:

bash

app/data/approved_follow_up_questions.json

Run One Question

Use this when tuning a specific bad answer:

bash

script/run-test-question 831391 "has the city ranked the hazards in their disclosure?"

The script:

Starts a temporary local location stub backed by org-data.
Starts the AI server on a free local port.
Sends the question to /v1/chat/completions.
Prints the assistant response.
Cleans up both temporary servers.

It reads API keys from .env. Transient failures are retried by default.

Optional retry controls:

bash

TEST_QUESTION_RETRIES=4 TEST_QUESTION_BACKOFF=2 \
  script/run-test-question 10894 "tell me about climate mitigation in LA"

Run A Batch

Start the API server separately:

bash

uv run uvicorn app.main:app --host 127.0.0.1 --port 8088

In another shell, run reviewed comment cases:

bash

uv run python utility_scripts/run_chat_eval.py \
  --comments-csv comments.csv \
  --org-data-dir org-data \
  --output tmp-reviewed-question-results.json

For transient upstream issues:

bash

uv run python utility_scripts/run_chat_eval.py \
  --comments-csv comments.csv \
  --org-data-dir org-data \
  --retries 2 \
  --retry-backoff 1 \
  --output tmp-reviewed-question-results.json

To inspect normalized payloads without calling the LLM:

bash

uv run python utility_scripts/run_chat_eval.py \
  --comments-csv comments.csv \
  --org-data-dir org-data \
  --dry-run-summary \
  --limit 5

Focus A Subset

Run only a few cases:

bash

uv run python utility_scripts/run_chat_eval.py \
  --comments-csv comments.csv \
  --org-data-dir org-data \
  --limit 5

Filter by location text:

bash

uv run python utility_scripts/run_chat_eval.py \
  --comments-csv comments.csv \
  --org-data-dir org-data \
  --location-filter "Rio"

Improving A Specific Result

Recommended loop:

Pick a reviewed case from comments.csv.
Run it with script/run-test-question.
Compare the answer with reviewer feedback.
Make the smallest targeted change.
Rerun the single question.
Rerun the batch or relevant filtered subset.
Run pytest.

Common fix locations:

ai-server/app/prompts/system_prompt.md: user-facing behavior, refusal style, grounding rules, terminology.
ai-server/app/prompts/suggest_follow_ups.md: follow-up selection behavior.
ai-server/app/prompts.py: prompt loading and location-context formatting.
ai-server/app/follow_ups.py: follow-up candidate selection and request shaping.
ai-server/utility_scripts/run_chat_eval.py: batch case loading, retry behavior, normalized eval payloads.
ai-server/comments.csv: reviewed question source.
ai-server/org-data: endpoint-shaped reviewed location fixtures.

Local prompt files are re-read on each request, so edits to app/prompts/system_prompt.md should affect the next local API call without a server restart. Remote prompts loaded through the SYSTEM_PROMPT URL are cached for SYSTEM_PROMPT_CACHE_SECONDS.

Automated Tests

Run all AI server tests:

bash

uv run pytest

Relevant test files:

ai-server/tests/test_chatbot.py: chat behavior and error mapping.
ai-server/tests/test_follow_ups.py: follow-up selection.
ai-server/tests/test_openai_api.py: OpenAI-compatible API contract.
ai-server/tests/test_prompts.py: prompt loading and formatting.
ai-server/tests/test_run_chat_eval.py: batch evaluation helper behavior.

Testing Ask CDP AI ​

Test Data ​

Run One Question ​

Run A Batch ​

Focus A Subset ​

Improving A Specific Result ​

Automated Tests ​

Testing Ask CDP AI

Test Data

Run One Question

Run A Batch

Focus A Subset

Improving A Specific Result

Automated Tests