Moondream3 Caption: Image Captioning in 6 Tests

Moondream3 Caption turns an image into a plain-language description. It is built for fast image understanding workflows like asset tagging, content moderation queues, and UI testing screenshots.

Model link

https://wiro.ai/models/moondream3-preview/caption

Inputs and settings used

Inputs: 3 sample images (flower, coffee, dog on leash)
Length: short, normal, long
Temperature: 0.0, 0.2, 0.7, 1.1
Top P: 0.9 or 0.95

Test results

Test 1: Short caption (flower)

Yellow flower close up used for Moondream3 caption test — Input image. Settings: length=short, temperature=0.2, top_p=0.9

The image features a close-up of a single, vibrant yellow flower with a dark center. The flower has six petals radiating outward, and its petals appear slightly curved. The flower is surrounded by lush green leaves, creating a striking contrast between the bright yellow of the flower and the deep green of the foliage. The background is blurred, drawing focus to the flower and enhancing its prominence in the frame. The image captures the flower in clear detail, highlighting its unique shape and color.

Even at short length, the caption stays specific and does not drift into unrelated details.

Test 2: Long caption (flower)

Yellow flower close up used for Moondream3 long caption test — Input image. Settings: length=long, temperature=0.7, top_p=0.95

A single, vibrant yellow flower with pointed petals and a dark center is the main focus of the image. The flower is surrounded by out-of-focus green leaves, creating a contrast in color and providing a natural backdrop. The lighting appears to be natural, highlighting the flower's bright yellow hue. The composition emphasizes the flower's simplicity and elegance, making it a visually appealing subject.

The longer setting adds composition language. It can help when captions feed search or tagging pipelines.

Test 3: Normal caption (coffee)

Coffee cup with milk being poured used for Moondream3 caption test — Input image. Settings: length=normal, temperature=0.7, top_p=0.95

A hand is pouring milk from a metallic pitcher into a white coffee cup. The cup is decorated with brown patterns resembling latte art. The background shows a blurred view of what appears to be a coffee machine with lights shining on it. The hand holding the cup is partially visible, wearing a dark blue sweater or jacket.

Good object naming and action description. This is the kind of caption that works well for auto-generated alt text.

Test 4: Normal caption with temperature=0.0 (coffee)

Coffee cup with latte art scene used for deterministic caption test — Input image. Settings: length=normal, temperature=0.0, top_p=0.95

A barista is skillfully pouring a frothy white milk into a cup, creating a beautiful latte art design. The cup has a small handle and is filled with espresso or coffee, with milk poured carefully on top to form the iconic pattern. The barista's hand is visible, holding the metal pitcher from which milk is being poured. The cup is held near a machine that appears to be part of a coffee machine or similar equipment. In the background, there are lights, adding to

This run shows two things: temperature=0.0 keeps the phrasing stable, but the output can still end mid-sentence. Downstream code should handle truncation.

Test 5: Normal caption (dog on leash)

Black and white dog on a leash used for Moondream3 caption test — Input image. Settings: length=normal, temperature=0.7, top_p=0.95

A black and white dog, possibly a whippet or similar breed, stands alert on a paved surface, attached to a black leash. The dog is wearing a collar with a brass-colored tag. The background features blurred wooden barrels and a dark wooden table. A portion of a person's leg and clothing is visible behind the table.

The caption adds plausible context like barrels and a table. It is accurate enough for indexing, but it can add details that are not always useful.

Test 6: Long caption with higher temperature (dog on leash)

Dog on leash used for long caption test — Input image. Settings: length=long, temperature=1.1, top_p=0.95

A black and white dog, possibly a whippet or similar breed, stands on an asphalt surface. The dog is wearing a tan collar and is attached to a black leash that extends across the frame. The background features a brown wooden structure, likely a fence or part of a wall, along with some out-of-focus green elements that may be plants or decorations.

Higher temperature keeps the core description but shifts background guesses. For strict labeling, lower temperature tends to be safer.

Where it works well

Fast captioning for image libraries and DAM systems
Generating starting-point alt text (with a human QA step)
Lightweight scene summaries for routing and tagging

Where it struggles

Output length can vary, and some runs can end abruptly
Background details can turn into guesses, especially at higher temperature

Try it

Run Moondream3 Caption on Wiro