Files
chatgpt-on-wechat/skills/openai-image-vision/EXAMPLE.md

4.9 KiB

OpenAI Image Vision - Usage Examples

Setup

Set up your API credentials using the agent's env_config tool:

# Set your OpenAI API key
env_config(action="set", key="OPENAI_API_KEY", value="sk-your-api-key-here")

# Optional: Set custom API base URL (for proxy or compatible services)
env_config(action="set", key="OPENAI_API_BASE", value="https://api.openai.com/v1")

Example 1: Analyze a Local Image

bash scripts/vision.sh "/path/to/photo.jpg" "What's in this image?"

Expected Output:

{
  "model": "gpt-4.1-mini",
  "content": "The image shows a beautiful landscape with mountains in the background and a lake in the foreground. The sky is clear with some clouds, and there are trees along the shoreline.",
  "usage": {
    "prompt_tokens": 1234,
    "completion_tokens": 45,
    "total_tokens": 1279
  }
}

Example 2: Analyze an Image from URL

bash scripts/vision.sh "https://example.com/image.jpg" "Describe this image in detail"

Example 3: Extract Text (OCR)

bash scripts/vision.sh "document.png" "Extract all text from this image"

Use Case: Extract text from screenshots, scanned documents, or photos of text.

Example 4: Identify Objects

bash scripts/vision.sh "scene.jpg" "List all objects you can identify in this image"

Example 5: Analyze Colors and Composition

bash scripts/vision.sh "artwork.jpg" "Describe the color palette and composition of this image"

Example 6: Count Items

bash scripts/vision.sh "crowd.jpg" "How many people are in this image?"

Example 7: Use Different Models

# Use gpt-4.1-mini (default, latest mini model)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1-mini"

# Use gpt-4.1 (most capable, best for complex analysis)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1"

# Use gpt-4o-mini (previous mini model)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4o-mini"

Example 8: Complex Analysis

bash scripts/vision.sh "product.jpg" "Analyze this product image. Describe the product, its features, colors, and suggest what kind of marketing copy would work well for it."

Example 9: Safety and Content Moderation

bash scripts/vision.sh "content.jpg" "Is there any inappropriate or unsafe content in this image?"

Example 10: Technical Analysis

bash scripts/vision.sh "diagram.png" "Explain what this technical diagram represents and how it works"

Integration with Agent

When the agent loads this skill, it will be available in the <available_skills> section. The agent can use it like:

bash "<base_dir>/scripts/vision.sh" "user_uploaded_image.jpg" "What's in this image?"

The <base_dir> will be automatically provided by the skill system.

Error Handling Examples

Missing API Key

$ bash scripts/vision.sh "image.jpg" "What is this?"
{"error": "OPENAI_API_KEY environment variable is not set", "help": "Visit https://platform.openai.com/api-keys to get an API key"}

File Not Found

$ bash scripts/vision.sh "nonexistent.jpg" "What is this?"
{"error": "Image file not found", "path": "nonexistent.jpg"}

Unsupported Format

$ bash scripts/vision.sh "file.bmp" "What is this?"
{"error": "Unsupported image format", "extension": "bmp", "supported": ["jpg", "jpeg", "png", "gif", "webp"]}

Missing Parameters

$ bash scripts/vision.sh
{"error": "Image path or URL is required", "usage": "bash vision.sh <image_path_or_url> <question> [model]"}

Tips for Best Results

  1. Be Specific: Ask clear, specific questions about what you want to know
  2. Image Quality: Higher quality images generally produce better results
  3. Model Selection:
    • Use gpt-4.1 for complex analysis requiring highest accuracy
    • Use gpt-4.1-mini (default) for most tasks - latest mini model with good balance
  4. Text Extraction: For OCR tasks, ensure text is clearly visible and not too small
  5. Multiple Aspects: You can ask about multiple things in one question
  6. Context: Provide context in your question if needed (e.g., "This is a medical scan, what do you see?")

Performance Notes

  • Local Files: Automatically base64-encoded, adds ~33% size overhead
  • URLs: Passed directly to API, no encoding overhead
  • Timeout: 60 seconds for API calls
  • Max Tokens: 1000 tokens for responses (configurable in script)
  • Rate Limits: Subject to your OpenAI API plan

Supported Image Formats

JPEG (.jpg, .jpeg)
PNG (.png)
GIF (.gif)
WebP (.webp)

BMP, TIFF, SVG, and other formats are not supported

Cost Considerations

Vision API calls cost more than text-only calls because they include image tokens. Costs vary by:

  • Model used (gpt-4.1 vs gpt-4.1-mini)
  • Image size and resolution
  • Length of response

Check OpenAI's pricing page for current rates: https://openai.com/pricing