This past week has been a remarkable week for AI announcements.
- On Monday, March 13th, Stanford released Alpaca 7B. On that same day, Google announced Med-PaLM 2, a new medical large language model (LLMs)
- On Tuesday, March 14th, OpenAI released GPT-4 (more on this later). Anthropic released Claude, its GPT competitor. Google announced the PaLM API, the MakerSuite, and its plan to add generative AI to its productivity suite (Google Docs, Gmail, Sheets, etc.)
- On Wednesday, March 15th, Midjourney released its latest model, the V5, which outputs hyper-realistic images
- On Thursday, March 16th, Microsoft released its GPT-4 integration into Microsoft 365’s productivity suite
- On Friday, March 17th, Baidu showcased its Chat-GPT competitor, Ernie
There was a big announcement every day of the week. In this article, we discuss how Microsoft and OpenAI have a lead on capabilities and cost and how the very strong demos of these two firms may represent the next evolution in human-computer interaction. We also discuss the reaction from competitors.
High-level learning from the announcements
OpenAI and Microsoft are still the leaders in the field
Of all the announcements this past week, GPT-4 and Copilot (GPT-4’s implementation in the Office 365 suite) are the most impressive. The collaboration between the two has been impressive and continues to pave the way in showcasing what is possible. The Microsoft demo specifically showcased building a platform with this AI (it is by far one of the coolest demos I’ve seen).
A more symbiotic era of computing
Throughout the past century, human-computer interaction, whether to provide input or read output, has slowly evolved into a more natural way of interacting.
First, it was the punch cards. In the late 1800s, Herman Hollerith invented a contraption that was arguably the first computing machine. The machine was purpose-built to process census data. At the time, the UC Census Bureau had just completed calculating the latest census data, which took a whopping eight years to finish. Since the US Constitution requires the government to conduct a census once every 10 years, the bureau knew it needed a new approach to solving the problem.
PS: Hollerith created a company that he named after himself, which eventually became IBM
Then came the wires. Fast forward a few decades to 1946: the ENIAC came along, weighing in at an astounding 30 tons and taking up a large room. This beast was used to calculate ballistic trajectories and to study the feasibility of thermonuclear weapons. To enter programs into the ENIAC, programmers manipulate switches and cables, a process that can take days.
PS: These early programmers were drawn from about two hundred women at the Moore School of Electrical Engineering at the University of Pennsylvania. At the time, programming was primarily dominated by women. This shifted towards being male-dominated (more so in the US) at the end of the century for many reasons outlined here.
Enter the era of graphical user interfaces (GUIs). The world’s first glimpse of the GUI was shown in a demo event called the mother of all demos (you can watch it here) by Douglas Engelbart (1968). In the demo, Engelbart showed, for the first time ever, all the elements of a modern computer in a single system: windows, hypertext, graphics, the computer mouse, word processing, dynamic file linking, and many more.
Xerox PARC then took Engelbart’s work and developed it into the modern GUI (windows and icon system). The system was called Alto (1973), but it was never commercialized. It wasn’t until Apple’s Lisa (1983) and Macintosh (1984) came onto the scene that the GUI became popular.
With GUIs came new applications: Microsoft Word emerged in 1983, followed by Adobe Photoshop in 1988.
PS: The Lisa would’ve cost more than $27,000 in 2021 dollars, the price of a brand-new Honda Civic.
The Internet and Web Browsers. In 1991, Tim Berners-Lee unleashed the World Wide Web upon humanity. Web browsers unlocked new possibilities in computer-to-computer connectivity like never before and abstract away the PC hardware from the user experience. Web browsers such as Mosaic (1993) and Netscape Navigator (1994) came to the market, which led to the browser wars with Internet Explorer, culminating in the dot com bubble.
Mobile Devices and Touch Interfaces. With the introduction of the iPhone (2007), came the multitouch (swipe, pinch, and zoom) UI. This transformed how users interacted with their computers.
Then came voice commands (or the era of “dumb” AI assistants). Once again, Apple was the pioneer. It introduced Siri in 2011, three years before Amazon’s Alexa and Microsoft’s Cortana. Despite the initial hype, these voice assistants didn’t really change how we worked or interacted with computers. These assistants were limited in functionality: they work by transcribing your voice into commands, but you often have to say specific words to trigger different functions. Furthermore, they do not have context awareness. Your conversations are limited to one-to-two chats deep.
The era of natural language computing. Fast forward to today. Microsoft’s Copilot and GPT-4 showcased the next (big) step in this evolution.
Microsoft finally announced its long-awaited integration of GPT-4 with its Office suite. You can watch the demo in full here. In the demo, the company introduced Copilot, its AI-powered productivity system.
Through Copilot, the company is transforming Office 365 into a unified productivity platform that can understand natural language. Here are a few examples to show you what this means.
You can create one Office document from another. In the example below, they demonstrated the creation of a Word document by synthesizing content from two other documents.
You can then command the AI to update the styling using a chat-like interface.
The system that underpins this setup is Copilot, a sophisticated processing and orchestration engine that connects three platforms (see Figure 4 below):
- Office 365 apps
- Microsoft graphs (all the data Microsoft has about your work: meetings, emails, contacts, calendars, files, and chats)
- LLM AI (GPT-4) that is capable of parsing all the information through natural language
In Figure 4, we illustrate how Microsoft’s Copilot works:
- Step 1: The user enters prompts using natural language through Microsoft’s application. This prompt is modified and enriched with Microsoft’s data about you (your emails, files, meetings, chats, contacts, etc.) so that the input to the LLM is relevant. Remember that LLMs output is probabilistic. It can give you incorrect answers. This preprocessing step is called grounding, and it minimizes the chances of the model hallucinating
- Step 2: The modified prompt is sent to the LLM (GPT-4 in this case)
- Step 3: The LLM response is sent back to Copilot to be verified against Microsoft’s data about you (to check for security, compliance, privacy, and command generation – this is an enterprise tool, after all)
- Step 4: The response is then sent back to the user and commands to the apps to do useful things
Microsoft’s demo offers a glimpse into the promise of productivity gains from these new breeds of AI. Beyond a simple chat application, it showcases how users can more easily interact with computers using natural language.
OpenAI also had a separate demo showcasing the standalone GPT-4. It comes in the form of Chat and API (although, as of this writing, the API version is not yet available). You can watch the demo here.
Having used GPT-3.5 and 4, I can truly say that the capability gap between models is large. GPT-4 can carry much more context throughout the conversation. This is likely because GPT-4 can carry twice the length of tokens, 8,192 tokens, which means it can receive much larger input for processing. There’s also a 32,768 max token version, which can receive much more input (equivalent to roughly 50 pages of prompts!).
PS: Token is a measure of a fraction of a word. OpenAI’s pricing and model are measured in tokens.
To illustrate the capabilities of ChatGPT-4, over the weekend, I asked it to teach me how to create a ChatGPT-powered chat application.
PS: The irony is not lost on me. I’m asking an AI chatbot to help me create a replica of itself.
Mind you; I’m not a programmer. I have some basic knowledge of Python to write scripts and data analysis, but I have never created a web application before. In less than an hour, I completed the web app. Throughout the process, Chat-GPT:
- Provided me with a technology stack that is beginner-friendly (Python and Flask)
- Gave me the code to complete the installation and create the necessary files. It even altered the code it gave me based on my setup (I had Anaconda installed for Python)
- Was able to walk me through step by step, introducing new concepts just at the right time
A screenshot of the final application is shown in Figure 5 above. Ugly, right? So I asked ChatGPT to “give it a more premium feel, similar to iMessage.” It interpreted my request by outputting an HTML file that makes the following changes: message bubbles of different colors (depending on the sender), blue color text bubbles, rounder corners, and automatic scroll to the bottom when new messages are added.
I cannot overstate how powerful it is to interact with the computer in a natural language. Everything is 100x faster.
Ok, let’s shift gears and discuss the other announcements.
The large language AI model is unlikely to be a long-term moat
Stanford’s Alpaca 7B is a model created by researchers from Stanford. Rather than creating a model from scratch, they took an open-sourced LLM model from Meta (Facebook) and tuned it.
They showcased that models can be reversed and engineered extremely cheaply. To illustrate this, Stanford researchers did the following:
- Grabbed an open-source high-quality model to be used as a baseline
- Fed the baseline model with synthetic data generated by OpenAI’s generative AI (Text-DaVinci-003, to be specific)
The flowchart of operation looks like this:
This means that the researchers: (1) didn’t have to acquire their own data, which is hard, and (2) didn’t have to spend a lot of money. Overall, it cost the researcher $500 on OpenAI API to generate the synthetic data and $100 for training using Nvidia’s GPU in the cloud. This totals $600, a far cry from the original cost of $4.6 million to train the original GPT-3 several years ago.
The outcome is a generative text model that performs similarly to GPT-3. This suggests that, while OpenAI currently has the most advanced model, over time, progress will likely slow down, and others will catch up. And the cost of catching up might be lower than one might think.
High-quality baseline models can be translated into other domains
The Med-PaLM is Google’s large language model based on its generic PaLM model, specifically trained for the medical domain.
There have been two Med-PaLM models. The first model, announced in December 2022, performs at the level of the average medical professional. The second version, PaLM-2, announced last week, is much improved.
To benchmark Med-PaLM’s medical know-how, Google used the US Medical License examination (USMLE). This exam measures a doctor’s ability to recall knowledge and apply medical logic. The passing score is typically 60%. Figure 8 below summarizes the performances of different AI models against this exam.
As you can see above, PaLM-2 model scored 85.4%, while the first version, which came out just three months ago, scored 67.2%. According to Google, this second model achieved an “expert” level outcome.
Google is approaching the deployment of its medical AI via the partnership model. It deploys its AI model and works with partners to solve specific problems in different verticals. For example:
- It’s partnering with and helping medical professionals to read ultrasound images, augmenting in areas where there’s a shortage of experts
- The company is also partnering with Mayo Clinic to help plan cancer treatment, specifically, combining AI’s ability to process image segmentation (separating healthy tissues from cancerous tissue on CT scans) and to provide general medical knowledge to plan the treatment course
Overall, the company believes that, to graduate beyond a toy model, applications of generative AI in the real world will encompass a multi-modal approach.
That said, many companies have failed to penetrate the AI healthcare space. In the 2010s, IBM Watson collaborated with multiple large hospitals to diagnose and recommend cancer treatments in real-time. After years of development and billions of investments, the projects were shelved, and Watson was sold to a private equity firm. The issue was that IBM used a narrow technology (this was before the advent of the LLMs) and pursued too difficult a problem (complex cancer diagnosis).
Perhaps this time, it will be different. After all, Google’s LLM is more broad-based and able to tackle easier problems.
Others are playing catch up
Two other companies announced their ChatGPT competitors.
Claude by Anthropic
Anthropic, a company last valued at $3 billion after a $300 million investment from Google, announced Claude. Similar to ChatGPT, the service is offered through an API service and a chat application. The company believes that Claude is more steerable and is less likely to produce harmful content. It has been piloted by Quora (for its chatbot Poe), Notion, and DuckDuckGo.
The jury is still out on whether or not Claude is better than GPT-3.5, which is more than a year old now. It’s unlikely to be better than GPT-4, however.
One thing to note is Claude’s pricing, which is based on character count. This is different from that of Chat-GPT, which uses tokens (fragments of a word). Perhaps this is intentional to obfuscate that Claude is far more expensive. Here’s the price comparison (from Ars Technica):
While it’s difficult to make an apples-to-apples comparison on the pricing between the two, it seems that Claude is more expensive to generate output with. This is a testament to OpenAI’s and Microsoft’s advantage in capabilities and cost.
Ernie by Baidu
Baidu, China’s search engine giant, also introduced its AI chat. Access is still invite-only, so it’s difficult to assess its capabilities. The presentation itself was not a live demo but rather a pre-recording of answers. Given Chat-GPT has been out for five months, this was disappointing. To make matters worse, the company had the unfortunate timing of announcing Ernie after GPT-4’s strong demo. Investors were disappointed, and Baidu’s share dropped 10% after the announcement (but has since recovered).