Inquiries:
zaaktin.lam {at} outlook.com

On The Problem(s) Of Vocaloid

(And UTAU, and CeVIO, and Synthesizer V, ...)

Let's get this quickly over with.

Is Vocaloid AI? Is Vocaloid music AI music?

No (see below), and no. Vocaloid is not AI Music, is a vocal synthesizer. The easy explanation is that it's like a player piano, but instead of piano it's a synthesizer and a talkbox combined, and instead of having a physical scroll of notes you'd have a file on the computer. Vocaloid is NOT writing the music for you, you have to write your own music, and all Vocaloid music are written by real human beings. You can refuse to acknowledge it, but it's still not a magic box that spits music out of nothing, for you are that box. The different voices of Vocaloid (Vocaloid is technically a voice synthesis engine with different voices) themselves are also not created from absolutely nothing - they are all based on real voices of real human beings.

(Another way, which is probably the more correct and canon way to think about this, is to see it as a virtual singer - it's a singer (instead of a composer), which you would have to tell what to sing. It may not sound human, but it has certain qualities, which have been integrated as a part of the aesthetics by many music producers like wowaka)

It's understandable for people who lacks prior exposure to speech synthesis to have this kind of misconception, but if you're still going to insist that Vocaloid is AI music and thus is bad, then I genuinely don't understand why you're being this stubborn about this.

The problem with the term AI

— is that there are a plethora of very different things and they all got called AI at different and/or at the same time period. Everyone can see that the handwriting recognition input of Google Translate is not going to generate music or write your essay for you any time soon, but imagine - just because they are grouped together with diffusion models (i.e. the things that actually makes your images) and LLMs (i.e. the things that actually write your essays) as AI, you'll now have to emphasis that what you're using is a handwriting recognition software because there are people out there saying that handwriting recognition can write essays on its own and the essays you've written on your own are all machine-generated. Ridiculous, isn't it? But that's exactly the kind of cases that's happening to Vocaloid when people are saying that Vocaloid music is AI music.

Is Vocaloid anime?

It technically isn't. Voice banks do have their own characters, and most of the characters do have an anime style, but their existence is more similar to that of a mascot or 看板娘（かんばんむすめ） [Kanban Musume]. The characters themselves aren't from a set story and barely have any background informations, because it's something that you would want to give the people using it the most freedom with; i.e. because there is basically no canon, everything goes. (But remember, this also means that what people have created around them is not canon.)

(One thing they do have is a canonical age, but in practice it's often ignored, and the actual "canonical age" is often whatever age the creator of the work decides them to be of, much like one would write fanfics about characters that are set at a random time period after the happening of the canonical events of the story the characters are from.)

What is UTAU/CeVIO/Synthesizer V/...?

They are different vocal synthesizers made by different individual/groups. The inner workings may have been very different, but to the end user they are the same kind of tools, each with their own advantages and disadvantages. UTAU is the second most famous after Vocaloid, mainly due to it being Freeware and it's easy to create a voice bank for it (if I remembered correctly, UTAU (at least its earlier versions) uses concatenative synthesis, which is basically changing the pitch of different sound pieces and splicing them together.) Synthesizer V needs a bit of explanation: it indeed uses very recent AI technologies and it generates very realistic voices, but you would still need to write your own music.

Can it be so close to the real person counterpart that one may not be able to differentiate the two?

With the newest technologies, it can and it has been, but they will be stopped before they see the public, because making one such voice bank requires the consent of the voice provider. One of such instances is the Synthesizer V version of the voice bank KAFU, which was cancelled because it sounded way too close to its real human counterpart; it has never been shipped out, and all the pre-orders have been refunded.

Back