Batch AI Image Generation on Mac: Lessons Learned from Generating 10,000+ Illustrations

Batch AI image generation: lessons from local Mac generation to switching to cloud platforms
The author needed to batch-generate 10,000+ illustrations for a children's vocabulary app, going through the complete journey from local Mac mini generation using Draw Things to ultimately switching to Replicate's cloud platform. The article details prompt engineering iteration principles (simple to complex, concise negative prompts, weight syntax compatibility), common pitfalls (text appearing in images, bald characters, high-risk word replacement), and concludes that the local approach's slow speed and high resource consumption make cloud platforms more cost-effective.
Project Background: Why Batch AI Image Generation?
A children's vocabulary learning mini-app needed illustrations for over 10,000 words—cartoon-style, text-free images whose content relates to each word's meaning to aid memorization. This seemingly simple requirement turned out to be full of challenges in practice: the sheer volume of images, the strict no-text requirement, and the need for meaningful, stylistically consistent visuals.
The author chose a Mac mini local offline generation approach, using the Draw Things app combined with scripts for batch generation. Draw Things is a local AI image generation application optimized specifically for Apple Silicon (M-series chips), supporting local inference with Stable Diffusion series models. Unlike cloud APIs, local generation leverages Mac's Unified Memory Architecture, allowing the GPU to directly access model weights in system memory without data transfer overhead. Apple's M-series chips accelerate inference through the Metal Performance Shaders (MPS) framework—while not matching NVIDIA's CUDA ecosystem in performance, it's practical enough for lightweight models. Draw Things includes a built-in JS sandbox environment that allows users to write automation scripts calling the image generation pipeline, making batch generation possible.
The entire journey went from high confidence to ultimately "abandoning" the local approach in favor of a cloud platform, accumulating extensive practical experience along the way.
Complete Workflow: Four Steps
Step 1: Prepare Content
Taking Spanish vocabulary learning as an example, each word needs: the original word, translation, example sentence, and most critically—a scene description. For instance, a word meaning "name" might have the scene description "a person pointing at a name tag with a curious expression." This scene description becomes the core input for subsequent image generation.
Step 2: Generate Prompts
The scene description is only part of the prompt. A complete prompt needs to include:
- Scene description: Specific actions and objects (different for each word)
- Style specification: Illustration style, color tone definition
- Composition requirements: Subject occupying 80% of frame, pure white background, etc.
Step 3: Assemble and Optimize Prompts
Combine the universal style template with each word's scene description to form the final complete prompt.
Step 4: Batch Generation
Write scripts in Draw Things' JS sandbox environment to iterate through the prompt array and call the pipeline method for batch generation.
Prompt Engineering: The Iterative Approach from Simple to Complex
Core Principle: Small Iterative Steps
The author's most important takeaway: Start with fewer prompt elements and add gradually—don't write a massive prompt all at once. AI sometimes generates overly long and bloated prompts for you, but this makes subsequent maintenance and modification extremely costly.
The essence of prompt engineering is manipulating the embedding space of text encoders (like CLIP). The CLIP model tokenizes text and maps it to 768 or 1024-dimensional vectors, which guide the diffusion model's denoising direction through cross-attention mechanisms. Every token's position, weight, and semantic association in the prompt affects the final image. Understanding this principle helps explain why "less is more"—too many tokens dilute the attention mechanism's focusing ability, reducing the actual influence of each description.
The actual iteration process:
- Version 1: Just one sentence—"cartoon style, rounded corners, flat design, slight 3D effect"—already basically met requirements
- Versions 2-9: Gradually adjusted details
- Final version: Clear structure, divided into three modules: scene description, style specification, and composition requirements
The key principle: solve problems as they appear, rather than preemptively adding constraints you might never need.
The Negative Prompt Trap
A major lesson learned: Writing too many negative prompts causes dilution. To prevent text from appearing in images, the author wrote thirty to forty words of negative prompts (no text, no letter, no word, etc.), which paradoxically made them ineffective. The recommendation is to keep it down to a few core terms.
The technical principle behind this relates to the Classifier-Free Guidance (CFG) mechanism: the model simultaneously computes conditional and unconditional (or negatively conditioned) noise predictions, then enhances guidance in the direction away from the negative prompt. When negative prompts contain too many tokens, each token's weight in the cross-attention computation gets diluted, causing the actual suppression effectiveness of individual constraints (like "no text") to drop significantly. It's like giving the model dozens of "don't do this" instructions simultaneously—the model ends up not knowing what to prioritize avoiding.
Weight Syntax Compatibility Issues
Different models support different weight syntax. The SD Image Turbo model the author used doesn't support the "(xxx:1.5)" weighting syntax—including numbers causes the model to render the numbers into the image. GPT Image or other models might support it. Always verify whether the model is compatible with your syntax before use.
In the Stable Diffusion ecosystem, the '(keyword:1.5)' weighting syntax originated from the Automatic1111 WebUI implementation, which multiplies specific token embedding vectors by weight coefficients during cross-attention computation to enhance or diminish their influence. But this isn't a capability of the model itself—it's a preprocessing feature of the inference framework. Different inference backends (ComfyUI, Draw Things, Replicate, etc.) parse this syntax differently. Turbo-series models, due to their extremely few steps and CFG scale typically set to 1.0, see greatly reduced or even anomalous effects from weighting syntax. GPT Image (DALL-E 3) uses an entirely different architecture, parsing prompts through natural language understanding without supporting any special syntax markers, but with more precise comprehension of natural language descriptions.
Pitfall Log: Those Absurd Bad Cases
Text Appearing in Images
This was the biggest pain point. Beyond streamlining negative prompts, the solution also required avoiding words that trigger text rendering in positive prompts, such as book, label, sign, etc. Diffusion models encounter massive amounts of text-containing images in training data (book covers, road signs, labels, etc.), and these words activate feature channels related to text rendering in the model. Even with "no text" in negative prompts, positive prompt guidance is typically stronger.
Bald Character Problem
Generated children's illustrations frequently featured bald babies/toddlers, requiring explicit emphasis on "no bald" in prompts. This likely relates to the distribution of cartoon/illustration-style children's images in training data—many simple drawings and early-childhood illustrations do tend to use bald heads to simplify character design.
High-Risk Word Replacement Strategy
For example, the word "phone"—directly describing it causes the model to render a phone screen with text on it. The solution is to change the description to "a person holding a phone to their ear" to avoid rendering the phone screen head-on. The core idea behind this strategy: change composition and perspective to avoid scenarios where the model easily generates text, rather than trying to forcibly prohibit it.
Don't Switch Models or Dimensions Mid-Project
The same prompt produces noticeably different results across SD Turbo, Flux, GPT Image, MiniMax, and other models. Once you've committed to a model, don't switch mid-project—otherwise previously tuned prompts may completely fail. This is because different models use different text encoders (CLIP ViT-L, T5-XXL, proprietary encoders, etc.), with completely different semantic understanding and feature mapping of the same text. Combined with differences in training data distribution across models, the same prompt may activate entirely different visual concepts in different models.
Local Generation: Performance Optimization and Limitations
Resource Allocation Strategy
On a Mac mini, local generation takes 30 seconds to 2 minutes per image. The author identified several factors affecting performance:
- Daytime is faster than late night: Because browsers, development tools, etc. consume GPU resources at night
- Screen resolution affects GPU: High-resolution display itself consumes GPU resources
- Thermal issues: Extended operation causes thermal throttling
Mac's Unified Memory Architecture means CPU, GPU, and Neural Engine share the same physical memory pool. macOS's WindowServer process handles all screen rendering, and high-resolution external displays (like 4K/5K) continuously occupy GPU rendering pipelines and memory bandwidth. While the Metal framework supports GPU task priority scheduling, resource competition between AI inference tasks and system UI rendering is unavoidable. Additionally, the M-series chip's power limit design means sustained high loads trigger thermal throttling, reducing GPU frequency to control temperature. The Mac mini's limited cooling surface area makes this problem particularly pronounced—sustained full-load operation can cause 20-30% performance degradation.
Recommendation: When batch generating, close all unnecessary software, IDEs, and video players, reduce screen resolution, and let the machine dedicate all resources to image generation.
Why the Local Approach Was Ultimately Abandoned
The core problems with the local approach: slow speed, machine resource occupation, and environment maintenance. For a 10,000+ image requirement, even at 1 minute per image, it would take over a week of continuous operation. More critically, the development machine would be virtually unusable for other work during this time—the opportunity cost far exceeds cloud platform fees.
Cloud Platform: The Better Choice
The author ultimately switched to the Replicate platform, using the same SD Image Turbo open-source model:
- Extremely low cost: 1024×1024 images at just $0.0025/image, 512×512 even cheaper
- Fast: ~3 seconds per image (faster with caching)
- Zero maintenance: No local resource consumption
Replicate is a Serverless GPU inference platform that deploys open-source models in containers. Its pricing model is based on actual GPU compute time (billed per second) rather than per request. When a model container is in a "cold start" state, the first request needs to load model weights into GPU memory (typically 5-15 seconds); subsequent requests can directly reuse the loaded model (warm start, ~1-3 seconds). The $0.0025/image price corresponds to approximately 2-3 seconds of compute time on an NVIDIA A40 or similar GPU. Compared to self-hosted GPU servers, the Serverless approach eliminates operational overhead and supports auto-scaling, making it suitable for bursty batch workloads. Similar platforms include RunPod, Modal, Banana, etc., each with different pricing strategies and model ecosystems.
For illustration-style use cases where fine detail requirements are modest, a 6B parameter open-source model is more than sufficient. By reducing image dimensions and choosing open-source models, costs can be further compressed. For 10,000 images, the total cost is approximately $25—far less than the time cost and electricity of the local approach.
Summary and Recommendations
Scenarios suited for local generation: Having a high-performance GPU, extreme cost sensitivity, truly massive image quantities (where cost differences become significant).
Scenarios suited for cloud platforms: Limited image quantities (cost-manageable under a few thousand), no desire to maintain local environments, prioritizing generation speed.
General recommendations:
- Iterate prompts in small steps, starting simple
- Keep negative prompts concise to avoid dilution
- Manual tuning → small batch validation → large-scale production
- Don't switch models mid-project once decided
- Avoid high-risk words that trigger text rendering
Related articles
TutorialsCursor + Codex Dual-IDE Collaboration: A Practical Methodology for Open-Source Project Customization
A complete methodology for open-source project customization based on real-world experience, detailing the Cursor+Codex dual-IDE workflow, seven-stage process, MVP validation, and AI source code reading techniques.
TutorialsCursor Multi-Agent in Practice: Building a Full-Stack Next.js Blog in 50 Minutes
Build a full-stack blog in 50 minutes using Cursor IDE's multi-Agent mode with Next.js, Clerk auth, and Supabase. Learn the 4-phase AI Agent workflow and key integration pitfalls.
TutorialsBuilding an AI Software Factory from Scratch: A Cursor Engineer's Hands-On Experience with Multi-Agent Collaboration
Cursor engineer Eric shares practical insights on building an AI software factory: automation levels, guardrail design, parallel Agent management, and scaling to 1000+ Agents for 24/7 development.