Batch AI Image Generation on Mac: Lessons Learned from Generating 10,000+ Illustrations

Project Background: Why Batch AI Image Generation?

A children's vocabulary learning mini-app needed illustrations for over 10,000 words—cartoon-style, text-free images whose content relates to each word's meaning to aid memorization. This seemingly simple requirement turned out to be full of challenges in practice: the sheer volume of images, the strict no-text requirement, and the need for meaningful, stylistically consistent visuals.

The author chose a Mac mini local offline generation approach, using the Draw Things app combined with scripts for batch generation. Draw Things is a local AI image generation application optimized specifically for Apple Silicon (M-series chips), supporting local inference with Stable Diffusion series models. Unlike cloud APIs, local generation leverages Mac's Unified Memory Architecture, allowing the GPU to directly access model weights in system memory without data transfer overhead. Apple's M-series chips accelerate inference through the Metal Performance Shaders (MPS) framework—while not matching NVIDIA's CUDA ecosystem in performance, it's practical enough for lightweight models. Draw Things includes a built-in JS sandbox environment that allows users to write automation scripts calling the image generation pipeline, making batch generation possible.

The entire journey went from high confidence to ultimately "abandoning" the local approach in favor of a cloud platform, accumulating extensive practical experience along the way.

Complete Workflow: Four Steps

Step 1: Prepare Content

Taking Spanish vocabulary learning as an example, each word needs: the original word, translation, example sentence, and most critically—a scene description. For instance, a word meaning "name" might have the scene description "a person pointing at a name tag with a curious expression." This scene description becomes the core input for subsequent image generation.

Step 2: Generate Prompts

The scene description is only part of the prompt. A complete prompt needs to include:

Scene description: Specific actions and objects (different for each word)
Style specification: Illustration style, color tone definition
Composition requirements: Subject occupying 80% of frame, pure white background, etc.

Step 3: Assemble and Optimize Prompts

Combine the universal style template with each word's scene description to form the final complete prompt.

Step 4: Batch Generation

Write scripts in Draw Things' JS sandbox environment to iterate through the prompt array and call the pipeline method for batch generation.

Prompt Engineering: The Iterative Approach from Simple to Complex

Core Principle: Small Iterative Steps

The author's most important takeaway: Start with fewer prompt elements and add gradually—don't write a massive prompt all at once. AI sometimes generates overly long and bloated prompts for you, but this makes subsequent maintenance and modification extremely costly.

The essence of prompt engineering is manipulating the embedding space of text encoders (like CLIP). The CLIP model tokenizes text and maps it to 768 or 1024-dimensional vectors, which guide the diffusion model's denoising direction through cross-attention mechanisms. Every token's position, weight, and semantic association in the prompt affects the final image. Understanding this principle helps explain why "less is more"—too many tokens dilute the attention mechanism's focusing ability, reducing the actual influence of each description.

The actual iteration process:

Version 1: Just one sentence—"cartoon style, rounded corners, flat design, slight 3D effect"—already basically met requirements
Versions 2-9: Gradually adjusted details
Final version: Clear structure, divided into three modules: scene description, style specification, and composition requirements

The key principle: solve problems as they appear, rather than preemptively adding constraints you might never need.

The Negative Prompt Trap

A major lesson learned: Writing too many negative prompts causes dilution. To prevent text from appearing in images, the author wrote thirty to forty words of negative prompts (no text, no letter, no word, etc.), which paradoxically made them ineffective. The recommendation is to keep it down to a few core terms.

The technical principle behind this relates to the Classifier-Free Guidance (CFG) mechanism: the model simultaneously computes conditional and unconditional (or negatively conditioned) noise predictions, then enhances guidance in the direction away from the negative prompt. When negative prompts contain too many tokens, each token's weight in the cross-attention computation gets diluted, causing the actual suppression effectiveness of individual constraints (like "no text") to drop significantly. It's like giving the model dozens of "don't do this" instructions simultaneously—the model ends up not knowing what to prioritize avoiding.

Weight Syntax Compatibility Issues

Different models support different weight syntax. The SD Image Turbo model the author used doesn't support the "(xxx:1.5)" weighting syntax—including numbers causes the model to render the numbers into the image. GPT Image or other models might support it. Always verify whether the model is compatible with your syntax before use.

In the Stable Diffusion ecosystem, the '(keyword:1.5)' weighting syntax originated from the Automatic1111 WebUI implementation, which multiplies specific token embedding vectors by weight coefficients during cross-attention computation to enhance or diminish their influence. But this isn't a capability of the model itself—it's a preprocessing feature of the inference framework. Different inference backends (ComfyUI, Draw Things, Replicate, etc.) parse this syntax differently. Turbo-series models, due to their extremely few steps and CFG scale typically set to 1.0, see greatly reduced or even anomalous effects from weighting syntax. GPT Image (DALL-E 3) uses an entirely different architecture, parsing prompts through natural language understanding without supporting any special syntax markers, but with more precise comprehension of natural language descriptions.

Pitfall Log: Those Absurd Bad Cases

Text Appearing in Images

This was the biggest pain point. Beyond streamlining negative prompts, the solution also required avoiding words that trigger text rendering in positive prompts, such as book, label, sign, etc. Diffusion models encounter massive amounts of text-containing images in training data (book covers, road signs, labels, etc.), and these words activate feature channels related to text rendering in the model. Even with "no text" in negative prompts, positive prompt guidance is typically stronger.

Bald Character Problem

Generated children's illustrations frequently featured bald babies/toddlers, requiring explicit emphasis on "no bald" in prompts. This likely relates to the distribution of cartoon/illustration-style children's images in training data—many simple drawings and early-childhood illustrations do tend to use bald heads to simplify character design.

High-Risk Word Replacement Strategy

For example, the word "phone"—directly describing it causes the model to render a phone screen with text on it. The solution is to change the description to "a person holding a phone to their ear" to avoid rendering the phone screen head-on. The core idea behind this strategy: change composition and perspective to avoid scenarios where the model easily generates text, rather than trying to forcibly prohibit it.

Don't Switch Models or Dimensions Mid-Project

The same prompt produces noticeably different results across SD Turbo, Flux, GPT Image, MiniMax, and other models. Once you've committed to a model, don't switch mid-project—otherwise previously tuned prompts may completely fail. This is because different models use different text encoders (CLIP ViT-L, T5-XXL, proprietary encoders, etc.), with completely different semantic understanding and feature mapping of the same text. Combined with differences in training data distribution across models, the same prompt may activate entirely different visual concepts in different models.

Local Generation: Performance Optimization and Limitations

Resource Allocation Strategy

On a Mac mini, local generation takes 30 seconds to 2 minutes per image. The author identified several factors affecting performance:

Daytime is faster than late night: Because browsers, development tools, etc. consume GPU resources at night
Screen resolution affects GPU: High-resolution display itself consumes GPU resources
Thermal issues: Extended operation causes thermal throttling

Mac's Unified Memory Architecture means CPU, GPU, and Neural Engine share the same physical memory pool. macOS's WindowServer process handles all screen rendering, and high-resolution external displays (like 4K/5K) continuously occupy GPU rendering pipelines and memory bandwidth. While the Metal framework supports GPU task priority scheduling, resource competition between AI inference tasks and system UI rendering is unavoidable. Additionally, the M-series chip's power limit design means sustained high loads trigger thermal throttling, reducing GPU frequency to control temperature. The Mac mini's limited cooling surface area makes this problem particularly pronounced—sustained full-load operation can cause 20-30% performance degradation.

Recommendation: When batch generating, close all unnecessary software, IDEs, and video players, reduce screen resolution, and let the machine dedicate all resources to image generation.

Why the Local Approach Was Ultimately Abandoned

The core problems with the local approach: slow speed, machine resource occupation, and environment maintenance. For a 10,000+ image requirement, even at 1 minute per image, it would take over a week of continuous operation. More critically, the development machine would be virtually unusable for other work during this time—the opportunity cost far exceeds cloud platform fees.

Cloud Platform: The Better Choice

The author ultimately switched to the Replicate platform, using the same SD Image Turbo open-source model:

Extremely low cost: 1024×1024 images at just $0.0025/image, 512×512 even cheaper
Fast: ~3 seconds per image (faster with caching)
Zero maintenance: No local resource consumption

Replicate is a Serverless GPU inference platform that deploys open-source models in containers. Its pricing model is based on actual GPU compute time (billed per second) rather than per request. When a model container is in a "cold start" state, the first request needs to load model weights into GPU memory (typically 5-15 seconds); subsequent requests can directly reuse the loaded model (warm start, ~1-3 seconds). The $0.0025/image price corresponds to approximately 2-3 seconds of compute time on an NVIDIA A40 or similar GPU. Compared to self-hosted GPU servers, the Serverless approach eliminates operational overhead and supports auto-scaling, making it suitable for bursty batch workloads. Similar platforms include RunPod, Modal, Banana, etc., each with different pricing strategies and model ecosystems.

For illustration-style use cases where fine detail requirements are modest, a 6B parameter open-source model is more than sufficient. By reducing image dimensions and choosing open-source models, costs can be further compressed. For 10,000 images, the total cost is approximately $25—far less than the time cost and electricity of the local approach.

Summary and Recommendations

Scenarios suited for local generation: Having a high-performance GPU, extreme cost sensitivity, truly massive image quantities (where cost differences become significant).

Scenarios suited for cloud platforms: Limited image quantities (cost-manageable under a few thousand), no desire to maintain local environments, prioritizing generation speed.

General recommendations:

Iterate prompts in small steps, starting simple
Keep negative prompts concise to avoid dilution
Manual tuning → small batch validation → large-scale production
Don't switch models mid-project once decided
Avoid high-risk words that trigger text rendering