The developer of Modbox linked together Windows speech recognition, OpenAI’s GPT-3 AI, and Replica’s natural speech synthesis for a unique demo: arguably one of the first artificially intelligent virtual characters.
Modbox is a multiplayer game creation sandbox with SteamVR support. It officially launched late last year after years of public beta development, though it’s still marked as Early Access. We first tried it back in 2016 for HTC Vive. In some ways Modbox was, and is, ahead of its time.
The developer’s recent test using two state of the art machine learning services – OpenAI’s GPT-3 language model and Replica’s natural speech synthesis – is nothing short of mind-blowing. Start at roughly 4 minutes 25 seconds to see the conversations with two virtual characters.
Microsoft, which invested $1 billion in OpenAI, has exclusive rights to the source code & commercial use of GPT-3, so this feature is unlikely to be added to Modbox itself. But this video demo is the best glimpse yet at the future of interactive characters. Future language models could change the very nature of game design and enable entirely new genres.
There is an uncomfortably long delay between asking a question and getting a response because GPT-3 and Replica are both cloud-based services. Future models running on-device may eliminate the delay. Google & Amazon already include custom chips in some smart home devices to cut the response delay for digital assistants.
How Is This Possible?
Books, movies and television are character-centric. But in current video games & VR experiences you either can’t speak to characters at all, or can only navigate pre-written dialog trees.
Directly speaking to virtual characters – and getting convincing results no matter what you ask – was not thought possible until recently. But a recent breakthrough in machine learning makes this idea finally possible.
In 2017, Google’s AI division revealed a new approach to language models called Transformers. State of the art machine learning models had already been using the concept of attention to get better results, but the Transformer model is built entirely around it. Google titled the paper ‘Attention Is All You Need‘.
In 2018, Elon Musk backed startup OpenAI applied the Transformer approach to a new general language model called Generative Pre-Training (GPT), and found it was able to predict the next word in many sentences, and could answer some multiple choice questions.
In 2019, OpenAI scaled up this model by more than 10x in GPT-2. But they found that this “scaleup” dramatically increased the system’s capabilities. Given a few sentences of prompt, it was now able to write entire essays on almost any topic, or even crudely translate between languages. In some cases, it was indistinguishable from human. Due to the potential consequences, OpenAI initially decided not to release it, leading to widespread media coverage & speculation of the societal impacts of advanced language models.
GPT-2 had 1.5 billion parameters, but in June 2020 OpenAI again scaled up the idea to 175 billion in GPT-3 (used in this demo). GPT-3’s results are almost always indistinguishable from human writing.
Technically, GPT-3 has no real “understanding” – though the philosophy behind that word is debated. It can sometimes produce nonsensical or bigoted results – even telling someone to kill themself. Researchers will have to find solutions to these limitations, such as a “common sense check” mechanism, before they can be deployed in general consumer products.