Prompt tuning
A core intuition and opinion baked into the design of VLM Caption is that multi-turn conversation can help extract superior information by allowing different questions to be asked, then bring them together with a final summary request. Other VLM scripts or apps are likely just trying to "one shot"