Prompt Runners

SPOUT provides two prompt testing utilities to evaluate responses across different models and scenarios:

Prompt Runner: Tests prompts against a single model
Prompt Gamut: Tests prompts across multiple models

Directory Structure

prompt_runners/
├── prompt_runner.bat    # Single model testing (Windows)
├── prompt_runner.sh     # Single model testing (Unix)
├── prompt_gamut.bat     # Multi-model testing (Windows)
└── prompt_gamut.sh      # Multi-model testing (Unix)

Single Model Testing

The prompt runner tests prompts against your currently selected model:

# Windows
./prompt_runner.bat

# Unix
./prompt_runner.sh

Features

Interactive file/directory selection
Processes individual files or entire directories
Timestamps all results
Records execution time for each prompt
Maintains consistent conversation context
Saves detailed response logs

Usage Example

Run the prompt runner
Select a file or directory from the list
If selecting a directory, choose specific file or "All files"
Results are saved to tests/conversation_test-[timestamp].txt

Multi-Model Testing

The prompt gamut tests prompts across all active models defined in models.ini:

# Windows
./prompt_gamut.bat

# Unix
./prompt_gamut.sh

Features

Tests against all active models
Automatic model switching
Calculates per-model timing
Provides summary statistics
Maintains consistent testing environment
Supports batch processing

Usage Example

Run the prompt gamut
Select prompt file or directory
If selecting a directory, choose specific file or "All files"
Results are saved to tests/prompt_gamut-[timestamp].txt

Test Results

Single Model Results

Results include:

Timestamp for each test
Current model information
Individual prompt responses
Execution time per prompt
Clear formatting for analysis

Example output:

Prompting Results for basic.txt using gpt-3.5-turbo - 2024-03-21_14-30-22
===================

Prompt:
How many vowels are in "hello world"?

Response (1234ms):
There are 3 vowels in "hello world": 'e', 'o', 'o'

-------------------

Multi-Model Results

Results include:

Summary statistics
Per-model performance
Total and average durations
Comparative responses
Model-specific timing

Example output:

Model Performance Summary - 2024-03-21_14-30-22
=================================
Total Models Tested: 3
Total Duration: 0h 15m 45s
Average Duration: 5m 15s

Best Practices

Organizing Prompts

Group related prompts in directories
Use clear, descriptive filenames
One prompt per line
Include expected responses
Test edge cases

Running Tests

Test new prompts on single model first
Use prompt gamut for final validation
Monitor execution times
Compare responses across models
Document unexpected behaviors

Analyzing Results

Review timing patterns
Compare model responses
Look for consistency
Check for errors
Track performance trends

Regular testing across different models helps identify which prompts work best with specific models and can help optimize your model selection strategy.

TestingIntro to Testing

TestingCore Testing