sjs/gh-1SecondEveryday-image-analysis-eval

mirror of https://github.com/1SecondEveryday/image-analysis-eval.git synced 2026-03-25 09:05:49 +00:00

No description

Find a file

Sami Samhuri 0848b43304 Enhance README with comprehensive testing history and insights Documents the complete 7-round evaluation process, from initial 6-model testing through Gemma3:12b's breakthrough selfie detection. Adds historical context for removed experimental prompts (07-11), model evolution insights, and performance characteristics discovered through extensive testing. Key additions: - Complete testing history (Take 1-7 plus mini-tests) - Model ranking evolution and breakthrough discoveries - Experimental prompt history and removal rationale - Technical insights from 768px optimization and repetition patterns - Results archive documentation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>		2025-07-09 13:35:46 -07:00
photo-768	Add more results, keyword selector, and 768px photos	2025-06-26 12:46:13 -04:00
photos-original	Rename original photos	2025-06-24 09:30:03 -04:00
prompts	Tweak prompts to put the right emphasis on people (it's not counting them)	2025-06-25 09:24:18 -04:00
results-mini-test-16-highland-cattle	Improve prompt when no people are present	2025-07-09 11:49:04 -07:00
results-take-5-emphasize-emotion	Add more results, keyword selector, and 768px photos	2025-06-26 12:46:13 -04:00
results-take1	Add results-take1 summary	2025-06-24 23:51:21 -04:00
results-take2-system-prompt	Add results from runs 2 and 3	2025-06-25 01:08:16 -04:00
results-take3-tweak-prompts	Add results from runs 2 and 3	2025-06-25 01:08:16 -04:00
results-take4-bigger-models	Add results take 4 - bigger models	2025-06-25 08:38:37 -04:00
results-take6-gemma	Add more results, keyword selector, and 768px photos	2025-06-26 12:46:13 -04:00
results-take7-no-prompt-oh-no	Add take7 results with no user prompt, it's bad	2025-06-27 22:57:03 -04:00
aggregate_results.rb	Add scripts to work in parallel and aggregate results separately	2025-06-24 21:55:21 -04:00
auto_rename_photos.rb	First commit	2025-06-24 09:22:33 -04:00
CLAUDE.md	Tweak extract_tags.rb script	2025-06-26 09:42:11 -04:00
execute_rename.rb	Rename original photos	2025-06-24 09:30:03 -04:00
extract_tags.rb	Improve prompt when no people are present	2025-07-09 11:49:04 -07:00
image-data.js	Add more results, keyword selector, and 768px photos	2025-06-26 12:46:13 -04:00
keyword-selector.html	Add more results, keyword selector, and 768px photos	2025-06-26 12:46:13 -04:00
meta-prompt.txt	First commit	2025-06-24 09:22:33 -04:00
README.md	Enhance README with comprehensive testing history and insights	2025-07-09 13:35:46 -07:00
rename_photos.rb	First commit	2025-06-24 09:22:33 -04:00
resize_images.rb	First commit	2025-06-24 09:22:33 -04:00
results-take-5-keywords-comparison.csv	Add more results, keyword selector, and 768px photos	2025-06-26 12:46:13 -04:00

README.md

Image Analysis Evaluation Framework

An evaluation framework for testing Vision-Language Models (VLMs) on their ability to extract meaningful tags from images. This project helps identify the optimal combination of model, image size, and prompt strategy for generating searchable tags from video diary snippets.

Overview

This framework is designed to enable natural language search for video diary content by evaluating different VLMs' performance at extracting contextual tags from images. The goal is to find the best approach for generating tags that users can search with natural language queries like "when was I happy with friends" or "snowy mountain scenes."

Quick Start

Prerequisites

Ruby (standard library only - no gems required)
Ollama running locally with vision models installed
Image files to analyze

Basic Usage

Install Ollama models:

ollama pull llava:7b
ollama pull qwen2.5vl:7b
ollama pull minicpm-v:8b

Add your images:

# Place original images in photos-original/
cp your_images/* photos-original/

# Resize images for testing
./resize_images.rb

Run evaluation:

# Test all models with default settings
./extract_tags.rb

# Test specific models
./extract_tags.rb -m llava:7b,qwen2.5vl:7b

# Test with limited images
./extract_tags.rb --max-images 10

View results:

# Combine all results into master files
./aggregate_results.rb

# Check results/ directory for CSV outputs
ls results/

Architecture

The framework tests different combinations of:

Models: Ollama-hosted vision models (currently focused on gemma3:12b, minicpm-v:8b, qwen2.5vl:7b)
Image sizes: 768px and 1024px (optimized from extensive testing of 128px to 2048px)
Prompts: Three refined strategies (evolved from 11 experimental prompts)

Testing History

This framework has undergone 7 major evaluation rounds plus specialized tests, evolving from broad exploration to focused optimization:

Take 1: Initial Comprehensive Evaluation

6 models tested: llava:7b, llava:13b, llava-phi3:3.8b, qwen2.5vl:3b, moondream:1.8b, llama3.2-vision:11b
7 image sizes: 128px to 2048px
6 prompts: 01-structured-comprehensive through 06-concise-complete
Key finding: 768px-1024px identified as optimal size range

Take 2: System Prompt Experiment

11 prompts tested (added 5 experimental strategies)
Complex prompts: 07-ultra-detailed-scene, 08-memory-search-optimizer, 09-contextual-story-tagger, 10-moment-finder-pro, 11-smart-scene-decoder
Key finding: Complex prompts caused repetition without adding value

Take 3: Prompt Refinement

Simplified to 3 core prompts (01, 03, 05)
Removed problematic prompts that caused repetition
Key finding: Simpler prompts improved consistency

Take 4: Bigger Models Evaluation

6 models including larger variants: llava:13b, bakllava:7b, minicpm-v:8b, qwen2.5vl:7b, llama3.2-vision:11b
Key discoveries:
- Llama3.2-vision:11b had severe repetition issues (unusable)
- MiniCPM-V:8b showed zero repetition and excellent emotional descriptions
- Qwen2.5VL:7b emerged as people detection champion

Take 5: Emotion-Focused Refinement

3 optimized models: llava:7b, qwen2.5vl:7b, minicpm-v:8b
768px focus: Confirmed as optimal image size
Key insight: MiniCPM-V excels at comprehensive scene understanding, catching background details others miss

Take 6: Gemma Models Complete Test

3 Gemma family models: gemma3:4b, gemma3:12b, gemma3:27b
Breakthrough discovery: Gemma3:12b achieved best selfie detection of any model (23 detections)
Key finding: 12B parameter range may be optimal for vision tasks

Take 7: No-Prompt Control Test

Baseline validation: Testing models without prompts
Key finding: Demonstrated critical importance of structured prompts

Mini-Tests: Specialized Validation

Highland Cattle test: Focused validation on challenging animal detection
Keyword selector: Interactive analysis tool development

Evaluation Priorities

The framework prioritizes tags in this order:

People detection - Human presence, emotions, expressions, moods, activities, interactions
Overall mood/atmosphere - Emotional tone and feeling of scenes
Objects - Important items that provide context
Scene details - Colors, lighting, setting/location, time of day
Camera perspective - Selfies and POV (first-person) shots

Project Structure

├── photos-original/          # Source images (add your images here)
├── photo-768/               # Resized images (768px)
├── photo-1024/              # Resized images (1024px)
├── prompts/                 # Prompt strategies
│   ├── 01-structured-comprehensive.txt
│   ├── 03-single-list.txt
│   └── 05-detailed-elements.txt
├── results/                 # Evaluation outputs (CSV files)
├── expected-tags/           # Ground truth tags (optional)
├── claude-scratchpad/       # Analysis scripts and reports
└── *.rb                     # Ruby evaluation scripts

Scripts

Core Evaluation

`extract_tags.rb` - Main evaluation script

Usage: ./extract_tags.rb [options]

Options:
  -m, --models MODELS              Comma-separated list of models
  -t, --timeout SECONDS            Request timeout (default: 120)
      --max-images NUM             Limit number of images processed
      --no-unload                  Keep models loaded between tests
      --single-prompt NAME         Test only one prompt (01, 03, or 05)
  -v, --verbose                    Enable verbose output
  -h, --help                       Show this help message

Examples:
  ./extract_tags.rb                                    # Test all models
  ./extract_tags.rb -m llava:7b --max-images 5        # Quick test
  ./extract_tags.rb --single-prompt 03 --no-unload    # Test one prompt

Data Management

`resize_images.rb` - Image preparation

Resizes images from photos-original/ to different sizes for testing.

./resize_images.rb

`aggregate_results.rb` - Results compilation

Combines individual CSV results into master files for analysis.

./aggregate_results.rb

`rename_photos.rb` - Interactive photo management

Interactive tool for renaming photos with descriptive names.

./rename_photos.rb

`auto_rename_photos.rb` - Automated photo renaming

Uses AI to automatically generate descriptive names for photos.

./auto_rename_photos.rb

Prompts

The framework uses three refined prompt strategies:

01-structured-comprehensive.txt

List comma-separated keywords only. For this image include: people presence (if humans visible, include 'people' and describe as 'couple', 'group', or 'crowd'); their emotions, expressions, and mood (happy, relaxed, contemplative, excited, etc.); what they're doing and how they're interacting; camera perspective (include 'selfie' if self-portrait or 'pov' if first-person view); dominant colors; key objects; overall atmosphere and mood; lighting quality; and setting/location.

03-single-list.txt

Provide a single comma-separated list of keywords. If people are present, capture their emotions, expressions, moods, and what they're doing. Note if it's a selfie or POV shot. Include overall atmosphere, key objects, dominant colors, lighting, and setting/location—nothing else.

05-detailed-elements.txt

Create keywords as comma-separated values only. Focus on: people's emotions, expressions, moods, and activities if present; camera perspective (selfie or pov if applicable); the emotional atmosphere and overall mood; important objects; dominant colors; lighting quality; and location/setting details.

Output Format

Results are saved as CSV files with the following columns:

filename - Image filename
model - Model used for analysis
size - Image size (768 or 1024)
prompt - Prompt strategy used (01, 03, or 05)
tags - Extracted comma-separated tags
response_time - Processing time in seconds
success - Boolean indicating if request succeeded

Model Performance Insights

Based on extensive testing across 7 evaluation rounds, here are key findings:

Current Model Rankings (Take 6 Results)

Gemma3:12b - Champion for selfie detection and balanced performance
MiniCPM-V:8b - Best comprehensive scene understanding with zero repetition
Qwen2.5VL:7b - Reliable and efficient with strong emotion detection

Historical Model Evolution

Take 1: LLaVA:7b identified as best value proposition
Take 4: Qwen2.5VL:7b emerged as people detection champion
Take 5: MiniCPM-V:8b revealed superior scene comprehension
Take 6: Gemma3:12b breakthrough - best selfie detection ever achieved (23/58 images)

Key Technical Discoveries

Image size optimization: 768px provides 98% of 1024px quality at 75% compute cost
Prompt evolution: Complex prompts (07-11) caused repetition; simple prompts (01, 03, 05) won
Model size insights: 12B parameter range (Gemma3:12b) may be optimal for vision tasks
Repetition patterns: Larger models often have worse repetition issues (llama3.2-vision:11b unusable)

Performance Characteristics

Gemma3:12b: 23 selfie detections, rich emotional vocabulary, balanced output
MiniCPM-V:8b: 0% repetition, notices background details ("bicycles in distance")
Qwen2.5VL:7b: 115 emotion keywords, best for mood-focused searches
LLaVA:7b: Most reliable baseline with minimal repetition (13 instances)

Optimization Insights

Selfie detection: Critical for first-person video diary content
Background awareness: Details like distant objects enable memory-based searches
Emotion focus: Users search feelings more than precise object counts
Model failures: Moondream-1.8b (84% failure rate), Llama3.2-vision:11b (severe repetition)

Configuration

The framework uses these defaults:

Ollama URL: http://localhost:11434/api/generate
Default models: llava:7b, qwen2.5vl:7b, minicpm-v:8b, gemma3:12b
Image sizes: 768px and 1024px
Timeout: 120 seconds
Image formats: .jpg, .jpeg, .png, .gif, .bmp, .tiff, .tif

API Integration

The framework communicates with Ollama's REST API:

# Example API call structure
{
  "model": "llava:7b",
  "prompt": "Provide a single comma-separated list of keywords...",
  "images": ["base64_encoded_image_data"],
  "stream": false
}

Development

Adding New Models

Install the model with Ollama:
```
ollama pull model_name:tag
```

Add to the models list in extract_tags.rb:

DEFAULT_MODELS = ['llava:7b', 'qwen2.5vl:7b', 'minicpm-v:8b', 'your_model']

Adding New Prompts

Create a new prompt file in prompts/:

echo "Your prompt text here" > prompts/06-new-strategy.txt

The framework will automatically detect and use the new prompt.

Note: Based on testing history, complex prompts (like the experimental prompts 07-11) tend to cause repetition issues. Simple, focused prompts work best.

Custom Analysis

The claude-scratchpad/ directory contains analysis scripts and reports for deeper insights into model performance.

Error Handling

The framework includes robust error handling:

Continues processing on individual failures
Logs errors in result files
Provides verbose output for debugging
Includes timeout protection for long-running requests

Performance Tips

Use --no-unload to keep models loaded between tests (faster)
Start with --max-images 5 for quick testing
Use --single-prompt to test prompt variations efficiently
Monitor Ollama's resource usage during evaluation

Experimental Prompts (Historical)

The framework originally tested 11 different prompt strategies. Here are the removed experimental prompts (Takes 1-2):

Removed Prompts (Caused Repetition Issues)

02-scene-everything: Too verbose, removed after Take 1
04-generate-all-aspects: Repetitive output, removed after Take 1
06-concise-complete: Broke smaller models, removed after Take 1
07-ultra-detailed-scene: Severe repetition, removed after Take 2
08-memory-search-optimizer: Complex without benefit, removed after Take 2
09-contextual-story-tagger: Too narrative, removed after Take 2
10-moment-finder-pro: Inconsistent results, removed after Take 2
11-smart-scene-decoder: Overcomplicated, removed after Take 2

Surviving Prompts (Current)

01-structured-comprehensive: Primary production prompt
03-single-list: Balanced alternative
05-detailed-elements: Detailed analysis option

Results Archive

The results/ directory contains complete evaluation data:

results-take1/: Initial comprehensive evaluation (6 models × 7 sizes × 6 prompts)
results-take2-system-prompt/: System prompt experiments (11 prompts)
results-take3-tweak-prompts/: Refined prompt evaluation
results-take4-bigger-models/: Large model comparison
results-take-5-emphasize-emotion/: Emotion-focused optimization
results-take6-gemma/: Gemma family evaluation
results-take7-no-prompt-oh-no/: Control test without prompts
results-mini-test-*/: Specialized validation tests

Contributing

This framework is designed for experimentation and can be extended with:

New prompt strategies (test carefully - complex prompts often fail)
Additional models
Different image preprocessing approaches
Advanced analysis scripts
Ground truth comparison tools

License

This project is part of the 1 Second Everyday ecosystem and is designed for internal research and development.

README.md Unescape Escape