mirror of
https://github.com/1SecondEveryday/image-analysis-eval.git
synced 2026-03-25 09:05:49 +00:00
4 KiB
4 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
Overview
This is an image analysis evaluation framework designed to test Vision-Language Models (VLMs) on their ability to extract meaningful tags from images. The project's goal is to enable natural language search for video diary snippets by finding the optimal combination of model, image size, and prompt strategy.
Architecture
The evaluation framework tests different combinations of:
- Models: Ollama-hosted vision models (currently focused on llava:7b and qwen2.5vl:3b)
- Image sizes: 768px and 1024px (optimized from testing 128px to 2048px)
- Prompts: Various strategies in
prompts/directory for extracting searchable tags
Key Scripts
Core Evaluation
extract_tags.rb: Main evaluation script./extract_tags.rb [options] -m, --models MODELS # Comma-separated list of models -t, --timeout SECONDS # Request timeout (default: 120) --max-images NUM # Limit number of images --no-unload # Keep models loaded between tests --single-prompt NAME # Test only one prompt
Data Preparation
resize_images.rb: Resizes images fromphotos-original/to various sizes./resize_images.rb
Results Processing
aggregate_results.rb: Combines individual CSV results into master files./aggregate_results.rb
Photo Organization
rename_photos.rb: Interactive photo renamingauto_rename_photos.rb: Automated renaming based on AI analysisexecute_rename.rb: Applies rename operations
Development Workflow
- Prepare images: Place original photos in
photos-original/and runresize_images.rb - Run evaluation: Use
extract_tags.rbwith desired models/prompts - Aggregate results: Run
aggregate_results.rbto combine outputs - Analyze: Check
results/directory for CSV outputs and performance metrics
Project Structure
├── photos-original/ # Source images
├── photo-768/ # Resized images (768px)
├── photo-1024/ # Resized images (1024px)
├── prompts/ # Prompt strategies (01-11)
├── results/ # Evaluation outputs (CSV)
├── expected-tags/ # Ground truth tags (optional)
└── *.rb # Ruby scripts
Technical Details
- API: Communicates with local Ollama server at
localhost:11434 - Output Format: CSV with columns: filename, model, size, prompt, tags, response_time, success
- No Dependencies: Pure Ruby using standard library only
- Error Handling: Continues on failures, logs errors in results
Current Focus
Based on git history, the project has narrowed from broad testing to:
- Models: llava:7b, qwen2.5vl:7b, and minicpm-v:8b
- Sizes: 768px (optimal balance of quality and performance)
- Prompts: Simplified to 01, 03, and 05 (complex prompts removed)
- Goal: Optimal tag extraction for video diary search functionality
Evaluation Priorities
When evaluating model performance, our priorities are (in order):
- People detection - Detecting human presence, emotions, expressions, moods, activities, and interactions
- Overall mood/atmosphere - Capturing the feeling and emotional tone of scenes
- Objects - Important items that provide context
- Scene details - Colors, lighting, setting/location, time of day
- Camera perspective - Identifying selfies and POV (first-person) shots
Key Insights from Testing
- Emotion focus: We prioritize understanding how people feel over precisely counting them
- Background matters: Details like "bicycles in distance" enable memory-based searches
- Simple prompts win: Complex prompts cause repetition without adding value
- Model strengths vary:
- Qwen2.5VL: Best for emotion keywords
- MiniCPM-V: Best for comprehensive scene understanding
- LLaVA:7b: Most reliable with minimal repetition