I ran the same test on both. One real App Store screenshot, target language Japanese, asked the model to "translate the text in the image and return the same screenshot in Japanese."
What both got right
The translation itself is good. ChatGPT and Gemini both produce natural Japanese, with appropriate kanji density. For pure copy, either one is enough.
Where they break
The image output is the problem.
- Resolution. ChatGPT returned 1024×1024. Gemini returned 1024×1536. Neither matches App Store specs (1320×2868 for iPhone 16 Pro Max).
- Phone shape. Both lost the rounded corners. Both shifted the safe area. The status bar drifted.
- Fonts. Both substituted a generic sans for San Francisco. Apple specifically watches for this in screenshots that show iOS UI.
- Text fit. Long Japanese phrases broke the line at the wrong character. CJK does not break on spaces, so a naive line-break gives unreadable output.
Why this happens
Foundation image models are tuned for general image generation. They are not constrained to a specific output resolution and layout. You can ask in the prompt — they will not enforce it.
The fix is to wrap the model in a pipeline that sets the exact output dimensions, picks a font that matches the target locale, and applies layout rules per language family. That is what lokal does, on top of the same models.
Bottom line
Use ChatGPT or Gemini for the translation step alone — they are excellent at it. Do not use them as one-shot screenshot generators. The output will be rejected.