shen/chatgpt-on-wechat

Fork 0

mirror of https://github.com/zhayujie/chatgpt-on-wechat.git synced 2026-03-03 17:05:04 +08:00

Files

zhayujie a8d5309c90 feat: add skills and upgrade feishu/dingtalk channel

2026-02-02 00:42:39 +08:00

4.9 KiB

Raw Blame History

OpenAI Image Vision - Usage Examples

Setup

Set up your API credentials using the agent's env_config tool:

# Set your OpenAI API key
env_config(action="set", key="OPENAI_API_KEY", value="sk-your-api-key-here")

# Optional: Set custom API base URL (for proxy or compatible services)
env_config(action="set", key="OPENAI_API_BASE", value="https://api.openai.com/v1")

Example 1: Analyze a Local Image

bash scripts/vision.sh "/path/to/photo.jpg" "What's in this image?"

Expected Output:

{
  "model": "gpt-4.1-mini",
  "content": "The image shows a beautiful landscape with mountains in the background and a lake in the foreground. The sky is clear with some clouds, and there are trees along the shoreline.",
  "usage": {
    "prompt_tokens": 1234,
    "completion_tokens": 45,
    "total_tokens": 1279
  }
}

Example 2: Analyze an Image from URL

bash scripts/vision.sh "https://example.com/image.jpg" "Describe this image in detail"

Example 3: Extract Text (OCR)

bash scripts/vision.sh "document.png" "Extract all text from this image"

Use Case: Extract text from screenshots, scanned documents, or photos of text.

Example 4: Identify Objects

bash scripts/vision.sh "scene.jpg" "List all objects you can identify in this image"

Example 5: Analyze Colors and Composition

bash scripts/vision.sh "artwork.jpg" "Describe the color palette and composition of this image"

Example 6: Count Items

bash scripts/vision.sh "crowd.jpg" "How many people are in this image?"

Example 7: Use Different Models

# Use gpt-4.1-mini (default, latest mini model)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1-mini"

# Use gpt-4.1 (most capable, best for complex analysis)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4.1"

# Use gpt-4o-mini (previous mini model)
bash scripts/vision.sh "image.jpg" "Analyze this" "gpt-4o-mini"

Example 8: Complex Analysis

bash scripts/vision.sh "product.jpg" "Analyze this product image. Describe the product, its features, colors, and suggest what kind of marketing copy would work well for it."

Example 9: Safety and Content Moderation

bash scripts/vision.sh "content.jpg" "Is there any inappropriate or unsafe content in this image?"

Example 10: Technical Analysis

bash scripts/vision.sh "diagram.png" "Explain what this technical diagram represents and how it works"

Integration with Agent

When the agent loads this skill, it will be available in the <available_skills> section. The agent can use it like:

bash "<base_dir>/scripts/vision.sh" "user_uploaded_image.jpg" "What's in this image?"

The <base_dir> will be automatically provided by the skill system.

Error Handling Examples

Missing API Key

$ bash scripts/vision.sh "image.jpg" "What is this?"
{"error": "OPENAI_API_KEY environment variable is not set", "help": "Visit https://platform.openai.com/api-keys to get an API key"}

File Not Found

$ bash scripts/vision.sh "nonexistent.jpg" "What is this?"
{"error": "Image file not found", "path": "nonexistent.jpg"}

Unsupported Format

$ bash scripts/vision.sh "file.bmp" "What is this?"
{"error": "Unsupported image format", "extension": "bmp", "supported": ["jpg", "jpeg", "png", "gif", "webp"]}

Missing Parameters

$ bash scripts/vision.sh
{"error": "Image path or URL is required", "usage": "bash vision.sh <image_path_or_url> <question> [model]"}

Tips for Best Results

Be Specific: Ask clear, specific questions about what you want to know
Image Quality: Higher quality images generally produce better results
Model Selection:
- Use gpt-4.1 for complex analysis requiring highest accuracy
- Use gpt-4.1-mini (default) for most tasks - latest mini model with good balance
Text Extraction: For OCR tasks, ensure text is clearly visible and not too small
Multiple Aspects: You can ask about multiple things in one question
Context: Provide context in your question if needed (e.g., "This is a medical scan, what do you see?")

Performance Notes

Local Files: Automatically base64-encoded, adds ~33% size overhead
URLs: Passed directly to API, no encoding overhead
Timeout: 60 seconds for API calls
Max Tokens: 1000 tokens for responses (configurable in script)
Rate Limits: Subject to your OpenAI API plan

Supported Image Formats

✅ JPEG (.jpg, .jpeg)
✅ PNG (.png)
✅ GIF (.gif)
✅ WebP (.webp)

❌ BMP, TIFF, SVG, and other formats are not supported

Cost Considerations

Vision API calls cost more than text-only calls because they include image tokens. Costs vary by:

Model used (gpt-4.1 vs gpt-4.1-mini)
Image size and resolution
Length of response

Check OpenAI's pricing page for current rates: https://openai.com/pricing

4.9 KiB Raw Blame History