Enhance README with comprehensive testing history and insights

Documents the complete 7-round evaluation process, from initial 6-model testing through Gemma3:12b's breakthrough selfie detection. Adds historical context for removed experimental prompts (07-11), model evolution insights, and performance characteristics discovered through extensive testing. Key additions: - Complete testing history (Take 1-7 plus mini-tests) - Model ranking evolution and breakthrough discoveries - Experimental prompt history and removal rationale - Technical insights from 768px optimization and repetition patterns - Results archive documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2026-04-26 14:57:39 +00:00 · 2025-07-09 13:35:46 -07:00 · 2025-07-09 13:35:46 -07:00 · 0848b43304
commit 0848b43304
parent 357018ee7b
1 changed files with 110 additions and 14 deletions
--- a/README.md
+++ b/README.md
@ -57,9 +57,54 @@ This framework is designed to enable natural language search for video diary con

 The framework tests different combinations of:

- **Models**: Ollama-hosted vision models (llava:7b, qwen2.5vl:7b, minicpm-v:8b, gemma3:12b)
- **Image sizes**: 768px and 1024px (optimized from testing 128px to 2048px)
- **Prompts**: Three refined strategies focusing on different aspects
+- **Models**: Ollama-hosted vision models (currently focused on gemma3:12b, minicpm-v:8b, qwen2.5vl:7b)
+- **Image sizes**: 768px and 1024px (optimized from extensive testing of 128px to 2048px)
+- **Prompts**: Three refined strategies (evolved from 11 experimental prompts)
+
+## Testing History
+
+This framework has undergone **7 major evaluation rounds** plus specialized tests, evolving from broad exploration to focused optimization:
+
+### Take 1: Initial Comprehensive Evaluation
+- **6 models** tested: llava:7b, llava:13b, llava-phi3:3.8b, qwen2.5vl:3b, moondream:1.8b, llama3.2-vision:11b
+- **7 image sizes**: 128px to 2048px
+- **6 prompts**: 01-structured-comprehensive through 06-concise-complete
+- **Key finding**: 768px-1024px identified as optimal size range
+
+### Take 2: System Prompt Experiment
+- **11 prompts** tested (added 5 experimental strategies)
+- **Complex prompts**: 07-ultra-detailed-scene, 08-memory-search-optimizer, 09-contextual-story-tagger, 10-moment-finder-pro, 11-smart-scene-decoder
+- **Key finding**: Complex prompts caused repetition without adding value
+
+### Take 3: Prompt Refinement
+- **Simplified to 3 core prompts** (01, 03, 05)
+- **Removed problematic prompts** that caused repetition
+- **Key finding**: Simpler prompts improved consistency
+
+### Take 4: Bigger Models Evaluation
+- **6 models** including larger variants: llava:13b, bakllava:7b, minicpm-v:8b, qwen2.5vl:7b, llama3.2-vision:11b
+- **Key discoveries**: 
+  - Llama3.2-vision:11b had severe repetition issues (unusable)
+  - MiniCPM-V:8b showed zero repetition and excellent emotional descriptions
+  - Qwen2.5VL:7b emerged as people detection champion
+
+### Take 5: Emotion-Focused Refinement
+- **3 optimized models**: llava:7b, qwen2.5vl:7b, minicpm-v:8b
+- **768px focus**: Confirmed as optimal image size
+- **Key insight**: MiniCPM-V excels at comprehensive scene understanding, catching background details others miss
+
+### Take 6: Gemma Models Complete Test
+- **3 Gemma family models**: gemma3:4b, gemma3:12b, gemma3:27b
+- **Breakthrough discovery**: Gemma3:12b achieved **best selfie detection** of any model (23 detections)
+- **Key finding**: 12B parameter range may be optimal for vision tasks
+
+### Take 7: No-Prompt Control Test
+- **Baseline validation**: Testing models without prompts
+- **Key finding**: Demonstrated critical importance of structured prompts
+
+### Mini-Tests: Specialized Validation
+- **Highland Cattle test**: Focused validation on challenging animal detection
+- **Keyword selector**: Interactive analysis tool development

 ### Evaluation Priorities

@ -171,18 +216,36 @@ Results are saved as CSV files with the following columns:

 ## Model Performance Insights

-Based on extensive testing, here are key findings:
+Based on extensive testing across 7 evaluation rounds, here are key findings:

-### Model Strengths
- **Qwen2.5VL**: Best for emotion keywords and mood detection
- **MiniCPM-V**: Best for comprehensive scene understanding
- **LLaVA 7B**: Most reliable with minimal repetition
+### Current Model Rankings (Take 6 Results)
+1. **Gemma3:12b** - Champion for selfie detection and balanced performance
+2. **MiniCPM-V:8b** - Best comprehensive scene understanding with zero repetition
+3. **Qwen2.5VL:7b** - Reliable and efficient with strong emotion detection

-### Optimization Findings
- **768px images**: Optimal balance of quality and performance
- **Simple prompts**: Complex prompts cause repetition without adding value
- **Emotion focus**: Users care more about how people feel than precise counting
- **Background details**: Details like "bicycles in distance" enable memory-based searches
+### Historical Model Evolution
+- **Take 1**: LLaVA:7b identified as best value proposition
+- **Take 4**: Qwen2.5VL:7b emerged as people detection champion
+- **Take 5**: MiniCPM-V:8b revealed superior scene comprehension
+- **Take 6**: Gemma3:12b breakthrough - best selfie detection ever achieved (23/58 images)
+
+### Key Technical Discoveries
+- **Image size optimization**: 768px provides 98% of 1024px quality at 75% compute cost
+- **Prompt evolution**: Complex prompts (07-11) caused repetition; simple prompts (01, 03, 05) won
+- **Model size insights**: 12B parameter range (Gemma3:12b) may be optimal for vision tasks
+- **Repetition patterns**: Larger models often have worse repetition issues (llama3.2-vision:11b unusable)
+
+### Performance Characteristics
+- **Gemma3:12b**: 23 selfie detections, rich emotional vocabulary, balanced output
+- **MiniCPM-V:8b**: 0% repetition, notices background details ("bicycles in distance")
+- **Qwen2.5VL:7b**: 115 emotion keywords, best for mood-focused searches
+- **LLaVA:7b**: Most reliable baseline with minimal repetition (13 instances)
+
+### Optimization Insights
+- **Selfie detection**: Critical for first-person video diary content
+- **Background awareness**: Details like distant objects enable memory-based searches
+- **Emotion focus**: Users search feelings more than precise object counts
+- **Model failures**: Moondream-1.8b (84% failure rate), Llama3.2-vision:11b (severe repetition)

 ## Configuration

@ -230,6 +293,8 @@ The framework communicates with Ollama's REST API:

 2. The framework will automatically detect and use the new prompt.

+**Note**: Based on testing history, complex prompts (like the experimental prompts 07-11) tend to cause repetition issues. Simple, focused prompts work best.
+
 ### Custom Analysis

 The `claude-scratchpad/` directory contains analysis scripts and reports for deeper insights into model performance.
@ -249,10 +314,41 @@ The framework includes robust error handling:
 - Use `--single-prompt` to test prompt variations efficiently
 - Monitor Ollama's resource usage during evaluation

+## Experimental Prompts (Historical)
+
+The framework originally tested 11 different prompt strategies. Here are the removed experimental prompts (Takes 1-2):
+
+### Removed Prompts (Caused Repetition Issues)
+- **02-scene-everything**: Too verbose, removed after Take 1
+- **04-generate-all-aspects**: Repetitive output, removed after Take 1  
+- **06-concise-complete**: Broke smaller models, removed after Take 1
+- **07-ultra-detailed-scene**: Severe repetition, removed after Take 2
+- **08-memory-search-optimizer**: Complex without benefit, removed after Take 2
+- **09-contextual-story-tagger**: Too narrative, removed after Take 2
+- **10-moment-finder-pro**: Inconsistent results, removed after Take 2
+- **11-smart-scene-decoder**: Overcomplicated, removed after Take 2
+
+### Surviving Prompts (Current)
+- **01-structured-comprehensive**: Primary production prompt
+- **03-single-list**: Balanced alternative
+- **05-detailed-elements**: Detailed analysis option
+
+## Results Archive
+
+The `results/` directory contains complete evaluation data:
+- **results-take1/**: Initial comprehensive evaluation (6 models × 7 sizes × 6 prompts)
+- **results-take2-system-prompt/**: System prompt experiments (11 prompts)
+- **results-take3-tweak-prompts/**: Refined prompt evaluation
+- **results-take4-bigger-models/**: Large model comparison
+- **results-take-5-emphasize-emotion/**: Emotion-focused optimization
+- **results-take6-gemma/**: Gemma family evaluation
+- **results-take7-no-prompt-oh-no/**: Control test without prompts
+- **results-mini-test-*/**: Specialized validation tests
+
 ## Contributing

 This framework is designed for experimentation and can be extended with:
- New prompt strategies
+- New prompt strategies (test carefully - complex prompts often fail)
 - Additional models
 - Different image preprocessing approaches
 - Advanced analysis scripts