Width.ai

You’ll Be Surprised: Bard vs. GPT-4 - Which LLM Gets You Better Results?

Matt Payne
·
September 8, 2023
bard architecture

Bard is Google's alternative to OpenAI's GPT-4 and ChatGPT services. But how well does it work for business use cases? Can you use it for your applications? Does it provide application programming interfaces (APIs)? And how does it fare against GPT-4 on language tasks?

Find out all that and more as we pit Bard vs. GPT-4 in this analysis.

Introduction to Bard and PaLM

Bard is Google's general-purpose chatbot powered by large language models (LLMs) and meant for all users. Like OpenAI's ChatGPT, it's a web application where users can enter complex prompts or questions on any subject and get back helpful results.

PaLM 2 is the LLM powering Bard. We explore some of their capabilities and differences below.

Bard vs. PaLM Capabilities

Both Bard and PaLM are capable of searching for images and interpreting your uploaded images. This is a key difference between what ChatGPT and GPT-4 offer, as they do not currently support images as inputs or outputs.

Bard finds and displays an image that matches the prompt
Bard finds and displays an image that matches the prompt

PaLM finds and returns an image link (in Markdown) that matches the prompt
PaLM finds and returns an image link (in Markdown) that matches the prompt

Bard correctly identified most of the food items and ingredients in this photo we uploaded:

Bard describes photos (Original Photo by Thomas Park on Unsplash)
Bard describes photos (Original Photo by Thomas Park on Unsplash)

Compare Bard's results to the photo's description: "Top down view of a delicious grilled chicken fajita family mexican food dinner with peppers, onions, grated cheese, flour tortillas, limes and salsa on a table."

You can't upload photos to the PaLM API but you can include links to images in the prompts. However, PaLM’s results are not as accurate or specific as Bard's and sometimes pretty hallucinatory (it's probably just using the URL's text rather than the image at the URL):

response from Bard

Bard's and PaLM's results sometimes differ significantly. The differences suggest that the PaLM instance behind Bard is being actively updated with recent information while the PaLM instance available to you through the APIs is frozen with information from 2021 or 2022.

For example, asking Bard for the latest news from the EU (in July 2023) returns accurate current affairs:

Bard returns up-to-date information
Bard returns up-to-date information

In contrast, the PaLM API (accessed via Google Cloud's generative AI studio) responds with accurate but old news from two years ago:

PaLM is not as actively updated and returns old information
PaLM is not as actively updated and returns old information

A potential problem you should be aware of is hallucination. Neither Bard nor PaLM are immune from hallucinations, which may surprise some users since Bard is often used like a search engine and sometimes even provides source links. In the example below, both were asked to list the differences between two phones, one real and the other fake. We created this very cool doc that outlines some prompting frameworks that help remove hallucinations with LLMs (check it out!)

Both invented features for the latter instead of warning that the phone didn't exist or they didn't know anything about it:

Bard hallucinates the features of a non-existent Samsung S40 phone
Bard hallucinates the features of a non-existent Samsung S40 phone

PaLM also hallucinates the features of a non-existent Samsung S40 phone
PaLM also hallucinates the features of a non-existent Samsung S40 phone

For the rest of the article, we'll use Bard and PaLM interchangeably. Let's see how Bard performs on common language tasks.

Evaluation of Bard vs. GPT-4 on Language Tasks

In the forthcoming sections, we evaluate PaLM against GPT-4 on common production language use cases like:

  • Legal summarization
  • Ad copy generation

We first evaluate select examples manually and qualitatively.

After that, we run both through an extensive LLM benchmarking suite that evaluates the results using another LLM. Combined, these should give you a pretty good idea of what to expect from these two LLMs in production.

Use Case: Legal Clause Extractive Summarization

In this test, we asked PaLM and GPT-4 to accurately create extractive summaries for this snippet from a legal contract with the following prompt:

"Pick two key sentences from each section below:"

key sentence extraction

Bard came up with this:

BARD results on extraction

PaLM API produced this:

parameters with PALM API

And GPT-4 generated this summary:

GPT-4 results

Evaluation of Legal Summaries

Both LLMs created strictly extractive summaries in response to this prompt. In our experience, prompts like "pick sentences" or "select sentences" have worked better than instructions like "create an extractive summary."

But correctness aside, in terms of quality, Bard and PaLM did a better job here. Their selection of key sentences as well as their presentation in sections are much better than GPT-4's. I absolutely love the reasoning section Bard provided as that information is critical for output evaluation and iteration of the models ability. They also chose sentences that fit a bit better with the key ideas than GPT-4 did. While we could have provided a topic based extractive summarization prompt (like the one I used in dialogue summarization), it’s interesting to evaluate the outputs with a wide data variance prompt to see the models ability to produce granular results on its own with little guidance.

Use Case: Bard vs. GPT-4 Ad Copy Generation

In this section, we try out both LLMs for ad copy generation and similar use cases.

Example #1: Catchy Headline for an Article

In the example below, we instructed both LLMs with this prompt:

"Help me construct a catchy, yet scientifically accurate, headline for an article on the latest discovery in renewable bio-energy, while carefully handling the ethical dilemmas surrounding bio-energy sources. Propose 4 options."

Although this specific example asks for headlines, the same prompts are ideal for generating ad copy too.

The LLMs generated the results shown below:

generated headlines for ads

Qualitative Evaluation

We observe that:

  • GPT-4 restricts itself to the concepts mentioned in the prompt — renewable bio-energy and ethics.
  • In contrast, PaLM seems to have made a conceptual leap to the climate crisis but it's not entirely clear if the ethical dilemmas are related to that or not.
  • The same problem is seen in the results of the follow-up prompt.

Example #2: Attention-Grabbing Ad Title for a Product

We asked both LLMs to generate appealing ad titles for a product with the following product information:

BARD generated results

The prompt given was:

"Write five jaw-dropping ad headlines based on the above product information that makes you want to buy this product."

Bard generated the following titles:

BARD generated titles results

GPT-4 came up with these:

GPT-4 results for headline generation

Qualitative Evaluation

We notice that:

  • PaLM tends towards rather abstract benefits like "taking your look to the next level" or "ultimate hair care package."
  • GPT-4 is much more specific about the benefits of this particular product.

GPT-4’s headlines are much better. They show a better understanding of the product details and products value points to users. Saying things like “salon-worthy hair” shows the model understands how to portray the product's relevance to customers' interest and draw them in.

Evaluate PaLM vs. GPT-4 Results Using MT-Bench

For production use cases, qualitative evaluation of ad hoc examples is not the best approach. It's neither scalable nor repeatable.

A better system is to maintain a suite of test prompts and contexts that are relevant to your business use case. Run them regularly against your LLMs under test. Then have another LLM compare the results from your LLM with those from a baseline LLM or with manually curated results as shown below. This enables you to automate the evaluation to be repeatable and scalable.

The FastChat MT-Bench web application is built for that very purpose. It comes with a suite of 80 prompts under eight categories, like these examples:

MT Bench results for PALM

You can also add your own custom prompts and contexts.

It submits each prompt to the selected LLMs and gets their results:

prompt results

The results are then evaluated by another LLM (in our case, GPT-4 itself, but it could also be a third LLM like Anthropic's Claude). It uses a set of judging prompts like these:

prompt outline with a system prompt

The evaluation results for a prompt are shown below:

evaluation results with model judgement

MT-Bench Results

The final results, as evaluated by GPT-4, are as follows:

MT Bench results
  • PaLM was outright better than GPT-4 on only 5 tests or only about 3% of the time.
  • It tied with GPT-4 on 28 tests, meaning its results were as good as GPT-4's.
  • So, overall, PaLM gave good results about 20% of the time and GPT-4 80% of the time.

In the next section, we bring out some technical differences between PaLM and GPT-4 that you must keep in mind if you decide to use them in production.

PaLM vs. GPT-4 Technical Differences

Both PaLM and GPT-4 are causal LLMs with decoder-only transformer architectures. We go into some of their details below.

PaLM vs. GPT-4 Models

Some of the model details that are publicly available are shown below:

palm vs gpt-4 model comparison

Surprisingly, according to some sources, PaLM 2 manages to be better than its predecessor PaLM with only 340 billion parameters against the latter's 540 billion.

PaLM vs. GPT-4 API Options

PaLM API options
PaLM API options

There's no direct API for Bard. However, the PaLM 2 LLM is accessible through two routes:

PaLM API

The PaLM API is part of the Google Generative AI suite. It's currently available as a beta preview on requesting access but isn't production-ready yet. The suite also consists of a simple web application called MakerSuite to manage your prompts and few-shot examples.

MakerSuite for prototyping with generative ai
MakerSuite

Vertex AI

Vertex AI is a paid service that's part of Google Cloud and is available to you if you have a Google Cloud subscription. The PaLM LLM is available behind a Vertex AI API that is production-ready. Note that though the underlying LLM is the same, this API does not have the same structure and semantics as the PaLM API. Vertex AI also provides a Generative AI Studio playground where you can test your prompts and responses.

Vertex AI's Generative AI Studio playground
Vertex AI's Generative AI Studio playground

PaLM vs. GPT-4 Fine-Tuning

As of August 2023, neither GPT-4 nor GPT-3.5 are fine-tunable. You can only fine-tune the legacy GPT-3 base models or GPT-3 Instruct models. GPT-3 fine-tuning is a simple workflow. You just upload a training file with pairs of prompts and completions, and call the fine-tuning API.

PaLM fine-tuning isn't available via PaLM API, only via the Vertex AI fine-tuning API for enterprise users. But it's more complicated in terms of storage and authentication:

  • First, upload your training file to a Cloud Storage bucket.
  • Next, create a fine-tuning job using the Vertex AI pipelineJobs API and wait for it to finish.
  • Load the tuned model and use the same text generation or chat APIs as PaLM.

Overall, GPT-3 seems easier to fine-tune than PaLM.

Want to productionalize GPT-4 or BARD?

Fortunately, both LLMs can be fine-tuned by API users. Contact us to help you integrate LLMs that are carefully fine-tuned for your specific business needs.