What are deepfakes? AI that deceives

Deepfakes extend the idea of video compositing with deep learning to make someone appear to say or do something they didn’t really say or do

Contributor, InfoWorld |

What are deepfakes? AI that deceives — Thinkstock

Deepfakes are media — often video but sometimes audio — that were created, altered, or synthesized with the aid of deep learning to attempt to deceive some viewers or listeners into believing a false event or false message.

The original example of a deepfake (by reddit user /u/deepfake) swapped the face of an actress onto the body of a porn performer in a video – which was, of course, completely unethical, although not initially illegal. Other deepfakes have changed what famous people were saying, or the language they were speaking.

Deepfakes extend the idea of video (or movie) compositing, which has been done for decades. Significant video skills, time, and equipment go into video compositing; video deepfakes require much less skill, time (assuming you have GPUs), and equipment, although they are often unconvincing to careful observers.

How to create deepfakes

Originally, deepfakes relied on autoencoders, a type of unsupervised neural network, and many still do. Some people have refined that technique using GANs (generative adversarial networks). Other machine learning methods have also been used for deepfakes, sometimes in combination with non-machine learning methods, with varying results.

Autoencoders

Essentially, autoencoders for deepfake faces in images run a two-step process. Step one is to use a neural network to extract a face from a source image and encode that into a set of features and possibly a mask, typically using several 2D convolution layers, a couple of dense layers, and a softmax layer. Step two is to use another neural network to decode the features, upscale the generated face, rotate and scale the face as needed, and apply the upscaled face to another image.

Training an autoencoder for deepfake face generation requires a lot of images of the source and target faces from multiple points of view and in varied lighting conditions. Without a GPU, training can take weeks. With GPUs, it goes a lot faster.

GANs

Generative adversarial networks can refine the results of autoencoders, for example, by pitting two neural networks against each other. The generative network tries to create examples that have the same statistics as the original, while the discriminative network tries to detect deviations from the original data distribution.

Training GANs is a time-consuming iterative technique that greatly increases the cost in compute time over autoencoders. Currently, GANs are more appropriate for generating realistic single image frames of imaginary people (e.g. StyleGAN) than for creating deepfake videos. That could change as deep learning hardware becomes faster.

How to detect deepfakes

Early in 2020, a consortium from AWS, Facebook, Microsoft, the Partnership on AI’s Media Integrity Steering Committee, and academics built the Deepfake Detection Challenge (DFDC), which ran on Kaggle for four months.

The contest included two well-documented prototype solutions: an introduction, and a starter kit. The winning solution, by Selim Seferbekov, also has a fairly good writeup.

The details of the solutions will make your eyes cross if you’re not into deep neural networks and image processing. Essentially, the winning solution did frame-by-frame face detection and extracted SSIM (Structural Similarity) index masks. The software extracted the detected faces plus a 30 percent margin, and used EfficientNet B7 pretrained on ImageNet for encoding (classification). The solution is now open source.

Sadly, even the winning solution could only catch about two-thirds of the deepfakes in the DFDC test database.

Deepfake creation and detection applications

One of the best open source video deepfake creation applications is currently Faceswap, which builds on the original deepfake algorithm. It took Ars Technica writer Tim Lee two weeks, using Faceswap, to create a deepfake that swapped the face of Lieutenant Commander Data (Brent Spiner) from Star Trek: The Next Generation into a video of Mark Zuckerberg testifying before Congress. As is typical for deepfakes, the result doesn’t pass the sniff test for anyone with significant graphics sophistication. So, the state of the art for deepfakes still isn’t very good, with rare exceptions that depend more on the skill of the “artist” than the technology.

That’s somewhat comforting, given that the winning DFDC detection solution isn’t very good, either. Meanwhile, Microsoft has announced, but has not released as of this writing, Microsoft Video Authenticator. Microsoft says that Video Authenticator can analyze a still photo or video to provide a percentage chance, or confidence score, that the media is artificially manipulated.

Video Authenticator was tested against the DFDC dataset; Microsoft hasn’t yet reported how much better it is than Seferbekov’s winning Kaggle solution. It would be typical for an AI contest sponsor to build on and improve on the winning solutions from the contest.

Facebook is also promising a deepfake detector, but plans to keep the source code closed. One problem with open-sourcing deepfake detectors such as Seferbekov’s is that deepfake generation developers can use the detector as the discriminator in a GAN to guarantee that the fake will pass that detector, eventually fueling an AI arms race between deepfake generators and deepfake detectors.

On the audio front, Descript Overdub and Adobe’s demonstrated but as-yet-unreleased VoCo can make text-to-speech close to realistic. You train Overdub for about 10 minutes to create a synthetic version of your own voice; once trained, you can edit your voiceovers as text.

A related technology is Google WaveNet. WaveNet-synthesized voices are more realistic than standard text-to-speech voices, although not quite at the level of natural voices, according to Google’s own testing. You’ve heard WaveNet voices if you have used voice output from Google Assistant, Google Search, or Google Translate recently.

Deepfakes and non-consensual pornography

As I mentioned earlier, the original deepfake swapped the face of an actress onto the body of a porn performer in a video. Reddit has since banned the /r/deepfake sub-Reddit that hosted that and other pornographic deepfakes, since most of the content was non-consensual pornography, which is now illegal, at least in some jurisdictions.

Another sub-Reddit for non-pornographic deepfakes still exists at /r/SFWdeepfakes. While the denizens of that sub-Reddit claim they’re doing good work, you’ll have to judge for yourself whether, say, seeing Joe Biden’s face badly faked into Rod Serling’s body has any value — and whether any of the deepfakes there pass the sniff test for credibility. In my opinion, some come close to selling themselves as real; most can charitably be described as crude.

Banning /r/deepfake does not, of course, eliminate non-consensual pornography, which may have multiple motivations, including revenge porn, which is itself a crime in the US. Other sites that have banned non-consensual deepfakes include Gfycat, Twitter, Discord, Google, and Pornhub, and finally (after much foot-dragging) Facebook and Instagram.

In California, individuals targeted by sexually explicit deepfake content made without their consent have a cause of action against the content’s creator. Also in California, the distribution of malicious deepfake audio or visual media targeting a candidate running for public office within 60 days of their election is prohibited. China requires that deepfakes be clearly labeled as such.

Deepfakes in politics

Many other jurisdictions lack laws against political deepfakes. That can be troubling, especially when high-quality deepfakes of political figures make it into wide distribution. Would a deepfake of Nancy Pelosi be worse than the conventionally slowed-down video of Pelosi manipulated to make it sound like she was slurring her words? It could be, if produced well. For example, see this video from CNN, which concentrates on deepfakes relevant to the 2020 presidential campaign.

Deepfakes as excuses

“It’s a deepfake” is also a possible excuse for politicians whose real, embarrassing videos have leaked out. That recently happened (or allegedly happened) in Malaysia when a gay sex tape was dismissed as a deepfake by the Minister of Economic Affairs, even though the other man shown in the tape swore it was real.

On the flip side, the distribution of a probable amateur deepfake of the ailing President Ali Bongo of Gabon was a contributing factor to a subsequent military coup against Bongo. The deepfake video tipped off the military that something was wrong, even more than Bongo’s extended absence from the media.

More deepfake examples

A recent deepfake video of All Star, the 1999 Smash Mouth classic, is an example of manipulating video (in this case, a mashup from popular movies) to fake lip synching. The creator, YouTube user ontyj, notes he “Got carried away testing out wav2lip and now this exists...” It’s amusing, although not convincing. Nevertheless, it demonstrates how much better faking lip motion has gotten. A few years ago, unnatural lip motion was usually a dead giveaway of a faked video.

It could be worse. Have a look at this deepfake video of President Obama as the target and Jordan Peele as the driver. Now imagine that it didn’t include any context revealing it as fake, and included an incendiary call to action.

Are you terrified yet?