SPOUT provides two prompt testing utilities to evaluate responses across different models and scenarios:

  • Prompt Runner: Tests prompts against a single model
  • Prompt Gamut: Tests prompts across multiple models

Directory Structure

prompt_runners/
├── prompt_runner.bat    # Single model testing (Windows)
├── prompt_runner.sh     # Single model testing (Unix)
├── prompt_gamut.bat     # Multi-model testing (Windows)
└── prompt_gamut.sh      # Multi-model testing (Unix)

Single Model Testing

The prompt runner tests prompts against your currently selected model:

# Windows
./prompt_runner.bat

# Unix
./prompt_runner.sh

Features

  • Interactive file/directory selection
  • Processes individual files or entire directories
  • Timestamps all results
  • Records execution time for each prompt
  • Maintains consistent conversation context
  • Saves detailed response logs

Usage Example

  1. Run the prompt runner
  2. Select a file or directory from the list
  3. If selecting a directory, choose specific file or "All files"
  4. Results are saved to tests/conversation_test-[timestamp].txt

Multi-Model Testing

The prompt gamut tests prompts across all active models defined in models.ini:

# Windows
./prompt_gamut.bat

# Unix
./prompt_gamut.sh

Features

  • Tests against all active models
  • Automatic model switching
  • Calculates per-model timing
  • Provides summary statistics
  • Maintains consistent testing environment
  • Supports batch processing

Usage Example

  1. Run the prompt gamut
  2. Select prompt file or directory
  3. If selecting a directory, choose specific file or "All files"
  4. Results are saved to tests/prompt_gamut-[timestamp].txt

Test Results

Single Model Results

Results include:

  • Timestamp for each test
  • Current model information
  • Individual prompt responses
  • Execution time per prompt
  • Clear formatting for analysis

Example output:

Prompting Results for basic.txt using gpt-3.5-turbo - 2024-03-21_14-30-22
===================

Prompt:
How many vowels are in "hello world"?

Response (1234ms):
There are 3 vowels in "hello world": 'e', 'o', 'o'

-------------------

Multi-Model Results

Results include:

  • Summary statistics
  • Per-model performance
  • Total and average durations
  • Comparative responses
  • Model-specific timing

Example output:

Model Performance Summary - 2024-03-21_14-30-22
=================================
Total Models Tested: 3
Total Duration: 0h 15m 45s
Average Duration: 5m 15s

Best Practices

Organizing Prompts

  • Group related prompts in directories
  • Use clear, descriptive filenames
  • One prompt per line
  • Include expected responses
  • Test edge cases

Running Tests

  • Test new prompts on single model first
  • Use prompt gamut for final validation
  • Monitor execution times
  • Compare responses across models
  • Document unexpected behaviors

Analyzing Results

  • Review timing patterns
  • Compare model responses
  • Look for consistency
  • Check for errors
  • Track performance trends
Regular testing across different models helps identify which prompts work best with specific models and can help optimize your model selection strategy.