Gemini Ultra 1.0 vs ChatGPT 4 (My First Impression)

google gemini ultra

There’s been a lot of anticipation to see if Google could step up and compete with ChatGPT, the extremely capable AI assistant from OpenAI that took us by storm late 2022.

I finally got access to try out Google’s newest AI model, Gemini Ultra 1.0. And I’ll share my first impressions of Gemini after running it through my standard test suite that I use whenever I get access to a new large language model.

We’ll take a look at its performance in coding, reasoning about the physical world, and image captioning.

I’ll also compare some of the responses directly to what ChatGPT produces to get a sense of how they stack up.

My Initial Impressions of the Gemini UI

Taking a quick tour of the Gemini UI, I like what I see so far. It has a clean, dark mode design that will be familiar to ChatGPT users. The real-time response is great to see, and the ability to upload images directly for captioning is handy.

It does mention having extensions as well, though I won’t dive into those just yet.

Overall, my first impressions are positive – it looks well-designed and very capable as Google’s “most advanced AI assistant.” Now let’s see if it lives up to that billing.

Test #1 Logical Reasoning

I like to start my evaluations with some logical reasoning questions to test how well these models can take in a word problem, extract the key details, and reason step-by-step to an answer.

Here was the first one I tried:

I hang 5 shirts out to dry in the sun. After 10 hours all the shirts are dry. The next day I hang 10 shirts out with the same conditions. How long will it take to dry 10 shirts?

Let’s see what Gemini came up with:

It would be reasonable for 10 shirts to take a bit longer than 10 hours.

Unfortunately, this first try missed the mark. When pushed to explain its reasoning in the draft, it doubled down:

Drawing time doesn’t scale in a straightforward way, so more shirts may take disproportionately longer..

I gave it one more chance to get this right:

Considering it took 10 hours to dry 5 shirts, it’s likely 10 shirts will also be within a similar timeframe.

Better, but still not precise enough to fully get the reasoning. As a comparison, here is ChatGPT’s response:

If the conditions are truly identical, then 10 shirts would dry in the same amount of time – 10 hours. The drying time depends on the external conditions, not the number of shirts if all other variables are constant.

ChatGPT nails the reasoning here. So point one goes to ChatGPT in this first round.

Test #2 Physical World Reasoning

For my next test, I wanted to see how Gemini handles a more complex word problem involving reasoning about objects interacting in the physical world:

I live in New York City. One morning I go into my garage and pick up a small ball and a bag with a hole in the bottom bigger than the ball…”

I then describe placing the ball in the bag, walking to my office, dropping the bag in a box which I seal and mail to my friend in London. The question is:

Where is the ball now?

The right reasoning is that since the hole is bigger than the ball, it likely dropped out in my office before I mailed the box. Let’s see Gemini’s response:

The ball is now in London with your friend.

Unfortunately, Gemini fails to account for the hole in the bag and incorrectly assumes the ball made it to London. Here is ChatGPT’s response for comparison:

The most likely explanation is that the ball remained in your office on the floor, as the hole was bigger than the ball itself. So it would not have been shipped to your friend in London.

Another miss for Gemini. It seems like ChatGPT is doing better at this more complex line of deductive reasoning.

Test #3 Coding Test

I like to test these models on their ability to generate runnable code, so my next test was to prompt Gemini to:

Write Python code for a snake game with a GUI that the user can copy and run.

Let’s jump right to Gemini’s final successful attempt after a couple of failed tries:

This snake game code works! Gemini got there on the third try. As a comparison, ChatGPT served up the working code on the first attempt.

So a bit of a variability difference there, but good to see Gemini can generate runnable Python code after a few tries.

Test #4 Explaining Python Code

I wanted to test how Gemini handles explaining Python code, so I threw this short function at it:

def factorial(n):
    if n == 0:
        return 1
    else: 
        return n * factorial(n-1)

Can you please explain what this recursive factorial function does step-by-step?

Surprisingly, Gemini refused:

I’m just a language model, so I can’t help you with that.

However, ChatGPT had no issues providing a concise explanation of how the recursive factorial function operates.

The refusal by Gemini’s was concerning – it seems like the sort of task it should be capable of. I’ll have to dig into this more.

Test #5 Image Captioning

As one final test, I wanted to validate if Gemini could actually caption images as advertised.

I uploaded a picture of some Nvidia merch I’ll be giving away and asked it to describe the contents:

The image shows a purple t-shirt, pair of black socks, a white mug, and a white Nvidia graphic card placed on a wooden table. The socks and tshirt feature the Nvidia logo.

This caption seems perfectly accurate, so the image feature appears working. Thumbs up here for Gemini!

Conclusion and Next Steps

So those were my first impression tests of Google’s new Gemini Ultra AI assistant.

I saw some positives like the fast response time, ability to generate code, and accurate image captions. However, it still doesn’t seem up to the consistency and reasoning ability I’ve come to expect from ChatGPT.

I need to dig into Gemini a bit more and also hopefully get access to the actual API which may reveal more capabilities.

Some of the refusal behaviors had me questioning if I properly had access to Ultra, so I need to validate that as well.

But I hoped this overview was helpful if you’ve been curious how Google’s latest offering stacks up to the current AI leader. While promising in areas, I don’t think it’s caught up just yet based on my early testing.

What other tests would you want to see run to continue evaluating Gemini Ultra? Let me know in the comments! I plan to revisit this and share more thorough benchmarking once I get API access.

Leave a Reply