GPT-4: The New Frontier of Intelligent Reasoning

Felipe Hlibco

April 5, 2023

Three weeks ago, OpenAI released GPT-4. I’ve been using it almost daily since, and something feels different this time.

Not the “oh cool, it writes poems” kind of different. More like the “wait, it actually understood what I was asking” kind. I’ve spent years working with language models in various capacities — GPT-4 is the first one where I regularly forget I’m talking to a machine.

The Bar Exam Thing #

Let’s start with the number everyone’s quoting: GPT-4 scored in the top 10% on a simulated bar exam. GPT-3.5 scored in the bottom 10%. Same test; same format. The jump happened in roughly four months of iteration.

That’s not incremental improvement. That’s a qualitative shift.

I ran my own (far less rigorous) tests. Complex multi-step logic puzzles; code architecture decisions; nuanced writing feedback. GPT-3.5 would get halfway there, then hallucinate its way to a confident-sounding wrong answer. GPT-4 gets it right more often than not. When it doesn’t, it tends to acknowledge uncertainty — which is arguably more useful than confidently wrong.

The benchmarks tell a similar story across the board. AP exams; SAT; GRE quantitative. It’s not just memorizing patterns; the model demonstrates something that looks uncomfortably close to reasoning. Whether it actually is reasoning is a philosophical debate I’ll dodge for now.

Multimodal Input: Seeing Is Believing #

GPT-4 accepts images alongside text. This is new for the GPT family, and it opens up use cases that pure text models can’t touch.

OpenAI’s demo showed the model explaining a photo of the inside of a fridge and suggesting recipes. Cute, sure. But the real implications are bigger. Think about accessibility tools that describe visual content; technical support workflows where a user just photographs their error screen; medical imaging triage; architecture review from diagrams.

I haven’t gotten deep access to the image capabilities yet (it’s rolling out slowly), but the technical report shows strong performance on visual understanding tasks. The model can parse charts; read handwritten notes; reason about spatial relationships in photographs. It’s not perfect — it sometimes misreads text in images or invents details that aren’t there — but the baseline capability is genuinely impressive.

The Alignment Investment #

OpenAI spent six months on safety work before releasing GPT-4. Six months. In a field where “ship it and iterate” is the default posture, that’s a meaningful signal.

They used adversarial testing with domain experts (including red-teamers who tried to make the model produce harmful content) and applied RLHF at scale. The result: GPT-4 produces fewer hallucinations than GPT-3.5 and handles sensitive topics with more nuance. It’s still not reliable enough for high-stakes decisions without human oversight, but the gap between “fun toy” and “useful tool” narrowed considerably.

The honest thing OpenAI did in the technical report was acknowledge GPT-4’s limitations clearly. It still hallucinates. It still makes reasoning errors, sometimes with alarming confidence. It has a knowledge cutoff (September 2021 training data), so it doesn’t know about recent events. These aren’t footnotes in the report; they’re prominently stated.

I appreciate that transparency, even if I suspect it’s partly strategic. Better to set expectations than deal with the backlash of overpromising.

The Access Question #

GPT-4 is available through ChatGPT Plus at $20/month and via the API (with a waitlist). This creates a two-tier access model that didn’t exist before.

With GPT-3.5, everyone got the same model. Now there’s a free tier and a paid tier with a meaningfully better model behind it. I’m not sure how I feel about this. On one hand, API access and a subscription model make the economics sustainable. On the other hand, the best AI reasoning tool in the world being paywalled has implications for who gets to build with it.

Microsoft’s Bing Chat runs on GPT-4 and is free, which complicates the picture. Anthropic’s Claude (still in early access) offers a different approach to the same class of problems. The competitive landscape is heating up quickly; we’re past the era of one model to rule them all.

What Actually Changed #

Here’s what I keep coming back to: GPT-4 doesn’t feel like a bigger GPT-3.5. It feels like a different category.

The reasoning benchmarks; the multimodal capability; the alignment improvements — taken individually, each is impressive but evolutionary. Taken together, they represent something else. A model that can read an image, reason about its contents, connect that reasoning to a broader knowledge base, and articulate its conclusions in natural language. That’s a capability stack that didn’t exist six months ago.

I’m not in the “AGI is here” camp. Not even close. But I am in the “this changes what’s possible for builders” camp. If you’re a product engineer, an educator, a researcher, or anyone whose work involves processing and synthesizing information (so, basically everyone), GPT-4 is worth paying attention to.

Not because it’s perfect. Because it’s the first model where the gap between “what it can do” and “what we need it to do” is small enough to build real things in the middle.