Vishing Attacks Increase by 442% via AI Cloning
Three seconds. That’s all a modern voice cloning model needs—just three seconds of your voice—to produce a replica convincing enough to fool your CFO, your IT helpdesk, or your mom.
I’ve been tracking this space since my DreamFlare days, when we were building entertainment products with generative AI. The speed at which offensive tooling has matured? Genuinely unsettling.
The numbers back it up. CrowdStrike’s threat intelligence team documented a 442% increase in voice phishing (vishing) attacks between the first and second halves of 2024. Not a typo. Four hundred and forty-two percent.
The Shape of the Spike #
Before September 2024, CrowdStrike’s OverWatch team had tracked exactly two vishing-related intrusions across their entire client base. Two.
Then September hit: 33 incidents. October: 55. November: 64. December: 93. That’s not a trend line; it’s a cliff.
What changed wasn’t the concept—vishing has existed since rotary phones. What changed was the economics. ElevenLabs, Resemble AI, and a growing ecosystem of open-source models have made real-time voice cloning accessible to anyone with a laptop and an internet connection. Combine that with caller ID spoofing (which carriers still haven’t meaningfully addressed) and you’ve got an attack surface that scales beautifully for the attacker.
The FBI issued a formal warning about AI-generated voice messages impersonating senior US officials. Not hypothetically. Actually happening, in the wild, targeting real people with real authority.
Why Social Engineering Won #
Security teams spent the last decade hardening perimeters, patching CVEs, and deploying zero-trust architectures. Good. Necessary. But while we were focused on technical exploits, threat actors pivoted to the softest target available: humans.
CrowdStrike tracks several groups exploiting this gap. CURLY SPIDER, CHATTY SPIDER, and PLUMP SPIDER—colorful names for groups doing genuinely nasty work—have all leveraged social engineering campaigns for credential theft. Their playbook isn’t sophisticated in the traditional sense. No zero-days. No custom malware. Just a convincing voice on the phone saying the right things to the right person at the right time.
I find that shift fascinating and terrifying in equal measure. We’ve built increasingly complex technical defenses, and the attackers responded by going around them entirely. It’s the security equivalent of building a ten-foot wall and watching someone walk through the unlocked gate.
The Economics of AI-Powered Vishing #
Here’s what makes this different from traditional social engineering: the marginal cost per attack approaches zero.
A human caller running vishing campaigns can make maybe 30-40 calls per day, needs training, gets tired, makes mistakes, and has a recognizable voice that can be traced. An AI-powered vishing operation can run thousands of simultaneous calls, each with a different cloned voice matched to the target’s expected caller. The voice never gets tired. It doesn’t have off days. And it can be swapped out in seconds if burned.
The global projections are staggering. Deloitte estimates $40 billion in losses to deepfake-enabled fraud by 2027. That number sounds implausible until you consider how many organizations still rely on voice verification for high-value transactions. “Call the CFO to confirm the wire transfer” stops working when the CFO’s voice can be synthesized from a conference talk posted on YouTube.
What Actually Works (And What Doesn’t) #
I’ll be honest: the defense landscape for AI-powered vishing is immature. We don’t yet have reliable real-time deepfake detection for phone calls. Carrier-level authentication standards for AI-generated voices don’t exist. Most enterprise voice biometric systems weren’t designed to handle synthetic speech.
So what do you do?
Out-of-band verification. If someone calls requesting credentials, a wire transfer, or access to anything sensitive—hang up and call them back on a known number. Not the number they called from. A number you already have. This is clunky and slow and it works.
Callback code words. Some organizations are implementing shared passphrases that change periodically. It feels almost absurdly low-tech, but it defeats voice cloning because the attacker doesn’t know the code word. (Unless they’ve also compromised your internal comms, in which case you have bigger problems.)
Skepticism training. Not the usual security awareness checkbox exercise; actual scenario-based training where employees hear AI-cloned voices and try to spot them. Most people are shockingly bad at this initially. They get better with practice.
Reducing public voice exposure. This one’s uncomfortable because it conflicts with the trend toward personal branding, podcasts, and video content. But every minute of your voice on YouTube is training data for an attacker. I’m not saying stop doing conference talks—I gave one at Google I/O in 2022 and I’d do it again. I’m saying be aware that your public voice is now an attack surface.
The Uncomfortable Reality #
The 442% increase isn’t the ceiling. It’s the floor.
We’re in the early innings of a fundamental shift in how social engineering attacks work. The tools will get cheaper. The voices will get more convincing. The attacks will get more targeted as threat actors combine voice cloning with social media scraping to build detailed impersonation pretexts.
I’ve been in security-adjacent roles long enough to know that the industry tends to react to these shifts about 18 months too late. We build detection for last year’s attacks. We write policies for yesterday’s threat model. And we consistently underestimate how quickly offensive tooling democratizes.
The organizations that will weather this well are the ones treating vishing defense as a process problem, not a technology problem. Technology will catch up eventually. In the meantime, the best defense is procedural: verify through separate channels, trust nothing that arrives uninvited, and train your people to be comfortable hanging up on someone who sounds exactly like their boss.
That last part is harder than it sounds. We’re socially wired to trust familiar voices. Attackers know this. At three seconds of audio and zero marginal cost, they’re going to exploit it relentlessly.