SPOUT provides two prompt testing utilities to evaluate responses across different models and scenarios:
- Prompt Runner: Tests prompts against a single model
- Prompt Gamut: Tests prompts across multiple models
Directory Structure
prompt_runners/
├── prompt_runner.bat # Single model testing (Windows)
├── prompt_runner.sh # Single model testing (Unix)
├── prompt_gamut.bat # Multi-model testing (Windows)
└── prompt_gamut.sh # Multi-model testing (Unix)
Single Model Testing
The prompt runner tests prompts against your currently selected model:
# Windows
./prompt_runner.bat
# Unix
./prompt_runner.sh
Features
- Interactive file/directory selection
- Processes individual files or entire directories
- Timestamps all results
- Records execution time for each prompt
- Maintains consistent conversation context
- Saves detailed response logs
Usage Example
- Run the prompt runner
- Select a file or directory from the list
- If selecting a directory, choose specific file or "All files"
- Results are saved to
tests/conversation_test-[timestamp].txt
Multi-Model Testing
The prompt gamut tests prompts across all active models defined in models.ini
:
# Windows
./prompt_gamut.bat
# Unix
./prompt_gamut.sh
Features
- Tests against all active models
- Automatic model switching
- Calculates per-model timing
- Provides summary statistics
- Maintains consistent testing environment
- Supports batch processing
Usage Example
- Run the prompt gamut
- Select prompt file or directory
- If selecting a directory, choose specific file or "All files"
- Results are saved to
tests/prompt_gamut-[timestamp].txt
Test Results
Single Model Results
Results include:
- Timestamp for each test
- Current model information
- Individual prompt responses
- Execution time per prompt
- Clear formatting for analysis
Example output:
Prompting Results for basic.txt using gpt-3.5-turbo - 2024-03-21_14-30-22
===================
Prompt:
How many vowels are in "hello world"?
Response (1234ms):
There are 3 vowels in "hello world": 'e', 'o', 'o'
-------------------
Multi-Model Results
Results include:
- Summary statistics
- Per-model performance
- Total and average durations
- Comparative responses
- Model-specific timing
Example output:
Model Performance Summary - 2024-03-21_14-30-22
=================================
Total Models Tested: 3
Total Duration: 0h 15m 45s
Average Duration: 5m 15s
Best Practices
Organizing Prompts
- Group related prompts in directories
- Use clear, descriptive filenames
- One prompt per line
- Include expected responses
- Test edge cases
Running Tests
- Test new prompts on single model first
- Use prompt gamut for final validation
- Monitor execution times
- Compare responses across models
- Document unexpected behaviors
Analyzing Results
- Review timing patterns
- Compare model responses
- Look for consistency
- Check for errors
- Track performance trends
Regular testing across different models helps identify which prompts work best with specific models and can help optimize your model selection strategy.