mirror of
https://github.com/zhayujie/chatgpt-on-wechat.git
synced 2026-03-03 17:05:04 +08:00
4.9 KiB
4.9 KiB
OpenAI Image Vision - Usage Examples
Setup
Set up your API credentials using the agent's env_config tool:
# Set your OpenAI API key
env_config(action="set", key="OPENAI_API_KEY", value="sk-your-api-key-here")
# Optional: Set custom API base URL (for proxy or compatible services)
env_config(action="set", key="OPENAI_API_BASE", value="https://api.openai.com/v1")
Example 1: Analyze a Local Image
bash scripts/vision.sh "/path/to/photo.jpg" "What's in this image?"
Expected Output:
{
"model": "gpt-4.1-mini",
"content": "The image shows a beautiful landscape with mountains in the background and a lake in the foreground. The sky is clear with some clouds, and there are trees along the shoreline.",
"usage": {
"prompt_tokens": 1234,
"completion_tokens": 45,
"total_tokens": 1279
}
}
Example 2: Analyze an Image from URL
bash scripts/vision.sh "https://example.com/image.jpg" "Describe this image in detail"
Example 3: Extract Text (OCR)
bash scripts/vision.sh "document.png" "Extract all text from this image"
Use Case: Extract text from screenshots, scanned documents, or photos of text.
Example 4: Identify Objects
bash scripts/vision.sh "scene.jpg" "List all objects you can identify in this image"
Example 5: Analyze Colors and Composition
bash scripts/vision.sh "artwork.jpg" "Describe the color palette and composition of this image"
Example 6: Count Items
bash scripts/vision.sh "crowd.jpg" "How many people are in this image?"
Example 7: Use Different Models
# Use gpt-4.1-mini (default, latest mini model)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1-mini"
# Use gpt-4.1 (most capable, best for complex analysis)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1"
# Use gpt-4o-mini (previous mini model)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4o-mini"
Example 8: Complex Analysis
bash scripts/vision.sh "product.jpg" "Analyze this product image. Describe the product, its features, colors, and suggest what kind of marketing copy would work well for it."
Example 9: Safety and Content Moderation
bash scripts/vision.sh "content.jpg" "Is there any inappropriate or unsafe content in this image?"
Example 10: Technical Analysis
bash scripts/vision.sh "diagram.png" "Explain what this technical diagram represents and how it works"
Integration with Agent
When the agent loads this skill, it will be available in the <available_skills> section. The agent can use it like:
bash "<base_dir>/scripts/vision.sh" "user_uploaded_image.jpg" "What's in this image?"
The <base_dir> will be automatically provided by the skill system.
Error Handling Examples
Missing API Key
$ bash scripts/vision.sh "image.jpg" "What is this?"
{"error": "OPENAI_API_KEY environment variable is not set", "help": "Visit https://platform.openai.com/api-keys to get an API key"}
File Not Found
$ bash scripts/vision.sh "nonexistent.jpg" "What is this?"
{"error": "Image file not found", "path": "nonexistent.jpg"}
Unsupported Format
$ bash scripts/vision.sh "file.bmp" "What is this?"
{"error": "Unsupported image format", "extension": "bmp", "supported": ["jpg", "jpeg", "png", "gif", "webp"]}
Missing Parameters
$ bash scripts/vision.sh
{"error": "Image path or URL is required", "usage": "bash vision.sh <image_path_or_url> <question> [model]"}
Tips for Best Results
- Be Specific: Ask clear, specific questions about what you want to know
- Image Quality: Higher quality images generally produce better results
- Model Selection:
- Use
gpt-4.1for complex analysis requiring highest accuracy - Use
gpt-4.1-mini(default) for most tasks - latest mini model with good balance
- Use
- Text Extraction: For OCR tasks, ensure text is clearly visible and not too small
- Multiple Aspects: You can ask about multiple things in one question
- Context: Provide context in your question if needed (e.g., "This is a medical scan, what do you see?")
Performance Notes
- Local Files: Automatically base64-encoded, adds ~33% size overhead
- URLs: Passed directly to API, no encoding overhead
- Timeout: 60 seconds for API calls
- Max Tokens: 1000 tokens for responses (configurable in script)
- Rate Limits: Subject to your OpenAI API plan
Supported Image Formats
✅ JPEG (.jpg, .jpeg)
✅ PNG (.png)
✅ GIF (.gif)
✅ WebP (.webp)
❌ BMP, TIFF, SVG, and other formats are not supported
Cost Considerations
Vision API calls cost more than text-only calls because they include image tokens. Costs vary by:
- Model used (gpt-4.1 vs gpt-4.1-mini)
- Image size and resolution
- Length of response
Check OpenAI's pricing page for current rates: https://openai.com/pricing