Audio & Voice

What languages do you support?
We support 30+ major languages worldwide, including but not limited to: English, Chinese, Spanish, Arabic, Portuguese, Russian, Japanese, Punjabi, German, French, Korean, Turkish, Tamil, Vietnamese, Hindi, Bengali, Urdu, Persian, Italian, Indonesian, Thai, Marathi, Telugu, Ukrainian, Malay, Romanian, Polish, Dutch, Gujarati, and Kannada.
How many voices are available in VisionStory's voice library, and can I customize them?
VisionStory offers over 200 voices in its library, which can be filtered by gender, age, and use case. If you can't find a suitable voice, you can also create a custom AI voice clone by uploading or recording audio.
Why are there fewer voice options available in my language?
The limited voice options in certain languages are specifically fine-tuned for those languages. However, the underlying language support allows voices, such as those in English, to speak multiple languages, providing flexibility in voice selection.
What is voice cloning, and how can I clone a voice?
Voice cloning allows you to create a custom AI voice that mimics a specific voice by uploading or recording audio. To clone a voice, ensure the audio is recorded clearly in a quiet environment for optimal results.
Is voice cloning free?
Voice cloning is free for English, Spanish, Japanese, and Chinese, allowing you to test if the cloned voice resembles yours. However, to use the cloned voice in video generation, you must subscribe to the Pro Plan or higher. For voice cloning in languages other than these four, a Pro Plan or above is also required.
How many languages are supported in voice cloning?
Voice cloning is freely supported in four languages: English, Spanish, Japanese, and Chinese. Additional languages are available but require a Pro Plan or higher. The list of supported languages is subject to change, so please check the voice cloning function for the most current options.
What is preview audio, and what are its benefits?
Preview audio allows you to generate the speech for your talking video before the final video creation. This feature helps you check the voice, pronunciation, and pauses to ensure they meet your expectations. You can make adjustments to the voice before generating the video, which costs credits. This feature is free to use, with different plans offer varying amounts of preview quotas.
What does the stopwatch icon and +0.5s mean?
The stopwatch icon and +0.5s feature allow you to insert a 0.5-second pause in the generated voice. You can add multiple stopwatch icons consecutively to create longer pauses as needed in your video.
What is URL import, and which URLs are supported?
URL import allows you to import audio from a link by downloading and extracting the audio from the specified URL to use in video generation. Currently, it supports links from YouTube and TikTok. If you would like to see support for more sites, please contact us. Additionally, you can use the voice changer feature to modify the imported audio while keeping the original content.
What is the remove noise feature?
The remove noise feature helps eliminate background noise from audio when you import or record it, ensuring clearer audio quality for your videos. This feature is free to all users.
What is the voice changer feature?
The voice changer feature allows you to modify the voice in a speech, enabling you to create unique renditions of the audio while maintaining the original content. This feature is free to all users.
Can I control the emotion of the voice?
Emotion in the voice is conveyed through the text you provide. When you use different text, the text-to-speech (TTS) system naturally applies the appropriate emotion, so no additional control is needed.
What should I keep in mind when using the stopwatch (pause) feature?
When using the stopwatch feature, each stopwatch represents a 0.5-second pause, and you can use them consecutively to create longer pauses, up to a maximum of 3 seconds. However, avoid using more than two consecutive pauses within a single text segment, as this may cause the AI to produce unexpected sounds or artifacts.