mirror of
https://github.com/1SecondEveryday/image-analysis-eval.git
synced 2026-04-26 14:57:39 +00:00
Enhance README with comprehensive testing history and insights
Documents the complete 7-round evaluation process, from initial 6-model testing through Gemma3:12b's breakthrough selfie detection. Adds historical context for removed experimental prompts (07-11), model evolution insights, and performance characteristics discovered through extensive testing. Key additions: - Complete testing history (Take 1-7 plus mini-tests) - Model ranking evolution and breakthrough discoveries - Experimental prompt history and removal rationale - Technical insights from 768px optimization and repetition patterns - Results archive documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
357018ee7b
commit
0848b43304
1 changed files with 110 additions and 14 deletions
124
README.md
124
README.md
|
|
@ -57,9 +57,54 @@ This framework is designed to enable natural language search for video diary con
|
|||
|
||||
The framework tests different combinations of:
|
||||
|
||||
- **Models**: Ollama-hosted vision models (llava:7b, qwen2.5vl:7b, minicpm-v:8b, gemma3:12b)
|
||||
- **Image sizes**: 768px and 1024px (optimized from testing 128px to 2048px)
|
||||
- **Prompts**: Three refined strategies focusing on different aspects
|
||||
- **Models**: Ollama-hosted vision models (currently focused on gemma3:12b, minicpm-v:8b, qwen2.5vl:7b)
|
||||
- **Image sizes**: 768px and 1024px (optimized from extensive testing of 128px to 2048px)
|
||||
- **Prompts**: Three refined strategies (evolved from 11 experimental prompts)
|
||||
|
||||
## Testing History
|
||||
|
||||
This framework has undergone **7 major evaluation rounds** plus specialized tests, evolving from broad exploration to focused optimization:
|
||||
|
||||
### Take 1: Initial Comprehensive Evaluation
|
||||
- **6 models** tested: llava:7b, llava:13b, llava-phi3:3.8b, qwen2.5vl:3b, moondream:1.8b, llama3.2-vision:11b
|
||||
- **7 image sizes**: 128px to 2048px
|
||||
- **6 prompts**: 01-structured-comprehensive through 06-concise-complete
|
||||
- **Key finding**: 768px-1024px identified as optimal size range
|
||||
|
||||
### Take 2: System Prompt Experiment
|
||||
- **11 prompts** tested (added 5 experimental strategies)
|
||||
- **Complex prompts**: 07-ultra-detailed-scene, 08-memory-search-optimizer, 09-contextual-story-tagger, 10-moment-finder-pro, 11-smart-scene-decoder
|
||||
- **Key finding**: Complex prompts caused repetition without adding value
|
||||
|
||||
### Take 3: Prompt Refinement
|
||||
- **Simplified to 3 core prompts** (01, 03, 05)
|
||||
- **Removed problematic prompts** that caused repetition
|
||||
- **Key finding**: Simpler prompts improved consistency
|
||||
|
||||
### Take 4: Bigger Models Evaluation
|
||||
- **6 models** including larger variants: llava:13b, bakllava:7b, minicpm-v:8b, qwen2.5vl:7b, llama3.2-vision:11b
|
||||
- **Key discoveries**:
|
||||
- Llama3.2-vision:11b had severe repetition issues (unusable)
|
||||
- MiniCPM-V:8b showed zero repetition and excellent emotional descriptions
|
||||
- Qwen2.5VL:7b emerged as people detection champion
|
||||
|
||||
### Take 5: Emotion-Focused Refinement
|
||||
- **3 optimized models**: llava:7b, qwen2.5vl:7b, minicpm-v:8b
|
||||
- **768px focus**: Confirmed as optimal image size
|
||||
- **Key insight**: MiniCPM-V excels at comprehensive scene understanding, catching background details others miss
|
||||
|
||||
### Take 6: Gemma Models Complete Test
|
||||
- **3 Gemma family models**: gemma3:4b, gemma3:12b, gemma3:27b
|
||||
- **Breakthrough discovery**: Gemma3:12b achieved **best selfie detection** of any model (23 detections)
|
||||
- **Key finding**: 12B parameter range may be optimal for vision tasks
|
||||
|
||||
### Take 7: No-Prompt Control Test
|
||||
- **Baseline validation**: Testing models without prompts
|
||||
- **Key finding**: Demonstrated critical importance of structured prompts
|
||||
|
||||
### Mini-Tests: Specialized Validation
|
||||
- **Highland Cattle test**: Focused validation on challenging animal detection
|
||||
- **Keyword selector**: Interactive analysis tool development
|
||||
|
||||
### Evaluation Priorities
|
||||
|
||||
|
|
@ -171,18 +216,36 @@ Results are saved as CSV files with the following columns:
|
|||
|
||||
## Model Performance Insights
|
||||
|
||||
Based on extensive testing, here are key findings:
|
||||
Based on extensive testing across 7 evaluation rounds, here are key findings:
|
||||
|
||||
### Model Strengths
|
||||
- **Qwen2.5VL**: Best for emotion keywords and mood detection
|
||||
- **MiniCPM-V**: Best for comprehensive scene understanding
|
||||
- **LLaVA 7B**: Most reliable with minimal repetition
|
||||
### Current Model Rankings (Take 6 Results)
|
||||
1. **Gemma3:12b** - Champion for selfie detection and balanced performance
|
||||
2. **MiniCPM-V:8b** - Best comprehensive scene understanding with zero repetition
|
||||
3. **Qwen2.5VL:7b** - Reliable and efficient with strong emotion detection
|
||||
|
||||
### Optimization Findings
|
||||
- **768px images**: Optimal balance of quality and performance
|
||||
- **Simple prompts**: Complex prompts cause repetition without adding value
|
||||
- **Emotion focus**: Users care more about how people feel than precise counting
|
||||
- **Background details**: Details like "bicycles in distance" enable memory-based searches
|
||||
### Historical Model Evolution
|
||||
- **Take 1**: LLaVA:7b identified as best value proposition
|
||||
- **Take 4**: Qwen2.5VL:7b emerged as people detection champion
|
||||
- **Take 5**: MiniCPM-V:8b revealed superior scene comprehension
|
||||
- **Take 6**: Gemma3:12b breakthrough - best selfie detection ever achieved (23/58 images)
|
||||
|
||||
### Key Technical Discoveries
|
||||
- **Image size optimization**: 768px provides 98% of 1024px quality at 75% compute cost
|
||||
- **Prompt evolution**: Complex prompts (07-11) caused repetition; simple prompts (01, 03, 05) won
|
||||
- **Model size insights**: 12B parameter range (Gemma3:12b) may be optimal for vision tasks
|
||||
- **Repetition patterns**: Larger models often have worse repetition issues (llama3.2-vision:11b unusable)
|
||||
|
||||
### Performance Characteristics
|
||||
- **Gemma3:12b**: 23 selfie detections, rich emotional vocabulary, balanced output
|
||||
- **MiniCPM-V:8b**: 0% repetition, notices background details ("bicycles in distance")
|
||||
- **Qwen2.5VL:7b**: 115 emotion keywords, best for mood-focused searches
|
||||
- **LLaVA:7b**: Most reliable baseline with minimal repetition (13 instances)
|
||||
|
||||
### Optimization Insights
|
||||
- **Selfie detection**: Critical for first-person video diary content
|
||||
- **Background awareness**: Details like distant objects enable memory-based searches
|
||||
- **Emotion focus**: Users search feelings more than precise object counts
|
||||
- **Model failures**: Moondream-1.8b (84% failure rate), Llama3.2-vision:11b (severe repetition)
|
||||
|
||||
## Configuration
|
||||
|
||||
|
|
@ -230,6 +293,8 @@ The framework communicates with Ollama's REST API:
|
|||
|
||||
2. The framework will automatically detect and use the new prompt.
|
||||
|
||||
**Note**: Based on testing history, complex prompts (like the experimental prompts 07-11) tend to cause repetition issues. Simple, focused prompts work best.
|
||||
|
||||
### Custom Analysis
|
||||
|
||||
The `claude-scratchpad/` directory contains analysis scripts and reports for deeper insights into model performance.
|
||||
|
|
@ -249,10 +314,41 @@ The framework includes robust error handling:
|
|||
- Use `--single-prompt` to test prompt variations efficiently
|
||||
- Monitor Ollama's resource usage during evaluation
|
||||
|
||||
## Experimental Prompts (Historical)
|
||||
|
||||
The framework originally tested 11 different prompt strategies. Here are the removed experimental prompts (Takes 1-2):
|
||||
|
||||
### Removed Prompts (Caused Repetition Issues)
|
||||
- **02-scene-everything**: Too verbose, removed after Take 1
|
||||
- **04-generate-all-aspects**: Repetitive output, removed after Take 1
|
||||
- **06-concise-complete**: Broke smaller models, removed after Take 1
|
||||
- **07-ultra-detailed-scene**: Severe repetition, removed after Take 2
|
||||
- **08-memory-search-optimizer**: Complex without benefit, removed after Take 2
|
||||
- **09-contextual-story-tagger**: Too narrative, removed after Take 2
|
||||
- **10-moment-finder-pro**: Inconsistent results, removed after Take 2
|
||||
- **11-smart-scene-decoder**: Overcomplicated, removed after Take 2
|
||||
|
||||
### Surviving Prompts (Current)
|
||||
- **01-structured-comprehensive**: Primary production prompt
|
||||
- **03-single-list**: Balanced alternative
|
||||
- **05-detailed-elements**: Detailed analysis option
|
||||
|
||||
## Results Archive
|
||||
|
||||
The `results/` directory contains complete evaluation data:
|
||||
- **results-take1/**: Initial comprehensive evaluation (6 models × 7 sizes × 6 prompts)
|
||||
- **results-take2-system-prompt/**: System prompt experiments (11 prompts)
|
||||
- **results-take3-tweak-prompts/**: Refined prompt evaluation
|
||||
- **results-take4-bigger-models/**: Large model comparison
|
||||
- **results-take-5-emphasize-emotion/**: Emotion-focused optimization
|
||||
- **results-take6-gemma/**: Gemma family evaluation
|
||||
- **results-take7-no-prompt-oh-no/**: Control test without prompts
|
||||
- **results-mini-test-*/**: Specialized validation tests
|
||||
|
||||
## Contributing
|
||||
|
||||
This framework is designed for experimentation and can be extended with:
|
||||
- New prompt strategies
|
||||
- New prompt strategies (test carefully - complex prompts often fail)
|
||||
- Additional models
|
||||
- Different image preprocessing approaches
|
||||
- Advanced analysis scripts
|
||||
|
|
|
|||
Loading…
Reference in a new issue