Published: Oct 10, 2024

Application of foundation models in robotics

Introduction

The need for adaptive solutions that can handle dynamic environments has led to a paradigm shift towards robotic systems with increased autonomy and the capability to thrive in unpredictable settings. This shift is most evident in the rise in popularity of autonomous mobile robots (AMRs) over automated guided vehicles (AGVs) that normally operate on fixed routes and have difficulty adjusting to changing environments. According to The Robot Report^[1], it is forecasted that 50% of all mobile robot shipments will be AMRs and less than 25% will be AGVs by 2025.

Figure 1. Forecast Share Shipments of AGVs vs AMRs. Source: https://www.therobotreport.com/mobile-robots-rapidly-mainstreaming-by- 2025-agvs-and-amrs-could-be-deployed-in-53k-facilities/

Despite this push towards autonomy, robotics task planning is still dominated by classical methods that are based on rigid, hard-coded instructions. However, there is a growing recognition of the limitations inherent in this method given the demands of modern use cases.

Narrow AI

Narrow deep learning model use in robotics applications has gained momentum across applications for use cases ranging from perception tasks that use object detection/segmentation models for finding objects of interest, to training specific robot skills using deep reinforcement learning techniques. In the context of robot task planning involving human-robot interactions, the most common approach typically employs hand-engineered rules paired with deep learning models.

These models are trained to find contextual relationships between the user’s input (the task) and pre-determined keywords (commonly referred to as ‘intents’) that act as triggers for numerous pre-programmed sequences. While offering determinism to the system (as all behaviours are known to the designer beforehand), this approach often struggles to generalise when system alterations to location, time, or logistics occur, and with the introduction of new robot functions. Additionally, scaling proves challenging, particularly with multi-step instructions, due to the intricacies of crafting nested rules for all possible permutations.

Foundation models

Foundation models, on the other hand, are pre-trained on massive internet-scale data and can be adapted to suit diverse tasks. These models have demonstrated notable advancements such as OpenAI’s Dall-E^[2], an AI model that can generate realistic images and art, and GPT-33^[3], the language model that serves as the foundation for ChatGPT. The advent of foundation models has created new options, as researchers start to explore their use in equipping robots with common sense knowledge and the ability to interpret human instructions in a more intuitive manner. A simple task such as “Get me a can of soda” may appear trivial to humans but has proven to be challenging for robots thus far. This challenge arises from the need for robots to not only execute complex algorithms to complete the task, but to also comprehend human instructions and translate them into actionable steps that fall within the robot’s capabilities.

In this article, we will discuss recent advancements in the application of foundation models and how these models could transform the way robots are programmed, moving towards a new paradigm of vision and text-based training.

GPT for robotics

In the SayCan paper published by GoogleAI^[4] last year, researchers used a large language model (LLM) to interpret user instructions and rank the likelihood of success for each available skill that the robot would eventually use to accomplish a given task. Their proposed decision-making method achieved an 84% success rate in overall planning.

Figure 2. Images of SayCan scoring relevant functions needed to accomplish the task. Source: https://sites.research.google/palm-saycan

The release and use of open-source foundation models gave robots advantages beyond task planning, and could empower them to seamlessly engage in natural language understanding, generation, and interaction, broadening the scope for communication and collaboration on various general-purpose tasks. Peter Chen of Covariant AI referred to the application of principles similar to ChatGPT as “GPT for robotics”^[5] signalling the upcoming frontier for foundation models. These models, designed to tackle broader tasks instead of domain-specific problems, are trained on extensive datasets and provide robots with intelligence that can be generalisable across multiple tasks.

Beyond text inputs, robots could also benefit from multi-modal foundation models that not only facilitate word comprehension, but also combine various senses as inputs to enhance robotic understanding of an environment. In September of 2023, OpenAI released an update for ChatGPT that allows the model to “see, hear and speak.” This new feature set allowed ChatGPT users to upload photos and ask the model questions that are related to the image.

Vision-language-action model

Recently, researchers have been experimenting with the same multi-modal approach (particularly using images) to aid robots with visual perspective for situational awareness and to learn new generalisable skills. In a recent TechCrunch interview^[6], Ken Goldberg, a professor at UC Berkeley and the Chief Scientist of the robotics parcel startup, Ambidextrous, remarked that “2023 will be remembered as the year when Generative AI transformed Robotics” as roboticists discover that large vision-language-action (VLA) models can be trained to allow robots to see and control their motions.

Google Deepmind researchers recently trained RT-2^[7], a VLA model, with web-scale data and a robot’s past experience to predict the robot’s actions and represented those actions as strings. The key objective of this work was to develop an AI model that can learn how to map what it sees (robot’s camera view) into robot-specific actions. Compared to SayCan, which acts as a decision maker by ranking the most suitable action based on contextual reasoning and feasibility, RT-2 directly controls the robot based on its visual and language interpretations.

The Google Deepmind team’s work could unlock reasoning abilities, that had previously been challenging or nonexistent, for robots. For instance, in ‘worst-case’ situations when a robot encounters navigation errors, or gets stuck after avoiding an obstacle, traditional methods often struggle to recover, especially on edge cases that were not accounted for during development. In fact, there are companies that provide human-in-the-loop (HITL) services to remotely enable robots to resume operating in these situations, in order to increase the robot’s operating time to meet service level agreements (SLAs). With VLAs, operators could teach robots safe and effective recovery behaviours through hours of tele-operation videos and past HITL data.

Tasks such as navigating through an HBD (Housing Development Board) void deck to locate lifts, which usually requires an intuitive understanding based on past experience, can also be made easier for robots. Robots could be trained to make decisions about where to turn, and to choose paths that have a higher probability of leading to a lift lobby, based on visual cues. This skill would be extremely useful for last-mile delivery robots when transiting from outdoor environments to indoor HDB units.

Research performed by a team at UC Berkeley (“Navigation with large language models”)^[8] demonstrated that robots can leverage the semantic information in large language models (LLMs) for navigating unfamiliar environments using the images from the robot’s camera. The LLM’s reasoning skills were used as the decision-making mechanism for determining the direction the robot should take to reach the desired goal or location, or to find the object of interest, based on what the robot perceived.

Figure 3. “Navigation with large language models”

The LLM directing the robot towards the refrigerator and microwave as it has higher likelihood of finding the gas stove nearby. Source: https://sites.google.com/view/lfg-nav/

Simulation for data generation

To meet the huge data demands for training AI models, Generative AI could serve a unique function by generating data and simulating robot experiences. Dhruv Batra, the Research Director at Meta’s FAIR (Fundamental AI Research) lab, highlighted a compelling application of Generative AI^[10] to produce 2D images, videos, and 3D scenes. This could be done using simulators to accelerate scene building and asset generation (quoting one of Deepu Talla’s use-cases^[11] on how Generative AI could contribute to the future of robotics).

A great practical example of this is the Isaac Simulator plugin^[9], developed by Nvidia engineers, that enables users to create digital twin environments with a simple prompt that describes the desired 3D scene and the assets of interest. The plugin leverages ChatGPT to identify suitable furniture that is available in the simulator’s database and to arrange that furniture in a way that fulfills the desired simulation setup.

Figure 4. Furniture that was selected and arranged by NVIDIA’s LLM-based Isaac Simulator plugin to create a simulated environment of a living room. Source: https://github.com/NVIDIA-Omniverse/kit-extension-sample-airoomgenerator

Humanoid race

For years, Boston Dynamics’ Atlas has been at the forefront of humanoid robotics, showcasing impressive feats of agility and balance. However, the field is rapidly evolving with new companies like 1X, Figure, and Tesla’s Optimus project joining the race to develop advanced humanoid robots. Having the shared goal of building general-purpose robots, these companies have been pushing the boundaries of what’s possible to create machines that can seamlessly integrate into human environments and perform a wide variety of tasks. Figure^[13] and 1X^[14] have recently partnered with OpenAI, underscoring the growing importance of foundation models in accelerating humanoid development.

Figure 5. NVIDIA GTC GR00T release. Source: https://www.1x.tech/discover/1x-humanoid-robot-neo-featured-in-nvidia-gtc-keynote

Adding to this momentum, NVIDIA announced Project GR00T^[15] during GTC in March 2024. This initiative aims to create a general-purpose foundation model for humanoid robots, capable of processing multi-modal instructions and previous interactions. Leveraging NVIDIA’s advanced cloud-based GPU infrastructure optimised to run computer graphics suitable for simulation and training, Project GR00T enables robots to perform diverse tasks through both high-level reasoning and low-level motion control. The model’s ability to integrate self-observation with multi-modal learning techniques allows humanoid robots to react dynamically to their environment, potentially accelerating skill acquisition and development processes.

The use of foundation models in humanoid robotics presents major advantages as these AI models have been trained on vast amounts of internet data, allowing them to map seen videos and observations into useful robot actuations to perform tasks. By leveraging the diverse and extensive data available on the internet, foundation models can “understand” a wide range of scenarios, objects, and actions. When applied to humanoid robots, this understanding translates into more adaptable and capable machines. For instance, a robot equipped with a foundation model might observe a human performing a new task – like folding a particular type of clothing or operating an unfamiliar device – and be able to replicate that action without explicit programming for that specific task.

Moreover, this ability to map diverse data to physical actions enhances the robot’s interaction with its environment and with humans. It can lead to more intuitive human-robot communication, as the robot can better interpret natural language instructions or even non-verbal cues, translating them into appropriate actions. This paves the way for humanoid robots that can assist in various settings—from homes to hospitals to factories—with greater flexibility and understanding of context.

Multi-robot coordination

Early in 2024, Google Deepmind released AutoRT^[12], a system that combines large foundation models like LLMs and VLMs with robot control models like RT-2. This integration allows the system to deploy robots equipped with cameras and manipulators for various tasks in diverse environments. The VLM interprets the robot’s surroundings and identifies objects, while the LLM suggests creative tasks for the robot to perform. AutoRT can control multiple robots simultaneously, orchestrating them safely in real-world settings. The new system was mainly designed to collect data for robot training and worked by assessing the robot’s performance on executed tasks against previously recorded videos. Over seven months of extensive evaluations, the system successfully managed up to 20 robots at once and a total of 52 unique robots. Deepmind conducted 77,000 trials across 6,650 different tasks in office buildings.

Figure 6. AutoRT’s 5-step robot episode data collection process. Source: https://auto-rt.github.io

Although AutoRT is a data collection system, it demonstrates the potential of foundation models to coordinate robot fleets and offers a powerful tool for efficiently managing robots that can adapt in dynamic environments. The utilisation of this system could greatly benefit platforms like Robotmanager by allowing users the flexibility to transition from manual task assignments to an automated, intelligence-driven approach. Such an approach promises to help organisations optimise resources, streamline operations, and reduce the amount of human involvement currently required to control and manage robot fleets.

Summary

The application of Generative AI in robotics holds immense potential for revolutionising the field by addressing key current challenges in human-robot interaction, task planning, multi-robot coordination, and data augmentation. The use of foundation models, such as vision-language models (VLMs) for enhanced perception, and large language models (LLMs) for task planning, will enable robots to better comprehend human instructions and to generate diverse plans for accomplishing user tasks. The multi-modal capabilities of VLMs, in particular, can play a pivotal role in augmenting robot perception, allowing for a more comprehensive understanding of the environment. Additionally, the integration of systems like AutoRT will facilitate effective multi-robot coordination and enhance collaborative efforts in complex scenarios.

References

^[1] Ash Sharma, Mobile Robots Rapidly Mainstreaming by 2025, AGVs and MARS Deployed in 53K Facilities. https://www.therobotreport.com/mobile-robots-rapidly-mainstreaming-by-2025-agvs-and-amrs-
could-be-deployed-in-53k-facilities

^[2] Aditya Ramesh, Mikhail Pavlov, Gabriel Goh, Scott Gray, Chelsea Voss, Alec Radford, Mark Chen, Ilya Sutskever. Zero-Shot Text-to-Image Generation. https://arxiv.org/abs/2102.12092

^[3] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler,
Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, Dario Amodei. Language models are few-shot learners. https://arxiv.org/abs/2005.14165

^[4] Michael Ahn , Anthony Brohan , Noah Brown , Yevgen Chebotar , Omar Cortes , Byron David , Chelsea Finn , Chuyuan Fu , Keerthana Gopalakrishnan , Karol Hausman , Alex Herzog , Daniel Ho , Jasmine Hsu , Julian Ibarz , Brian Ichter , Alex Irpan, Eric Jang , Rosario Jauregui Ruano , Kyle Jeffrey , Sally Jesmonth , Nikhil J Joshi , Ryan Julian , Dmitry Kalashnikov , Yuheng Kuang , Kuang-Huei Lee , Sergey Levine , Yao Lu , Linda Luu , Carolina Parada , Peter Pastor , Jornell Quiambao , Kanishka Rao , Jarek Rettinghouse , Diego Reyes , Pierre Sermanet , Nicolas Sievers , Clayton Tan , Alexander Toshev , Vincent Vanhoucke , Fei Xia , Ted Xiao , Peng Xu , Sichun Xu , Mengyuan Yan , Andy Zeng. Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. https://sites.research.google/palm-saycan

^[5] Peter Chen. AI Robotics’ “GPT Moment” is near. https://techcrunch.com/2023/11/10/ai-robotics-gpt-moment-is-near/

^[6] Brian Heater. Robotics Q&A with UC Berkeley’s Ken Goldberg. https://techcrunch.com/2023/12/16/robotics-qa-with-with-uc-berkeleys-ken-goldberg/

^[7] Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Lisa Lee, Tsang-Wei Edward Lee, Sergey Levine, Yao Lu, Henryk Michalewski, Igor Mordatch, Karl Pertsch, Kanishka Rao, Krista Reymann, Michael Ryoo, Grecia Salazar, Pannag Sanketi, Pierre Sermanet, Jaspiar Singh, Anikait Singh, Radu Soricut, Huong Tran, Vincent Vanhoucke, Quan Vuong, Ayzaan Wahid, Stefan Welker, Paul Wohlhart, Jialin Wu, Fei Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and Brianna Zitkovich. RT-2: Vision-Language- Action Models Transfer Web Knowledge to Robotic Control. https://deepmind.google/discover/blog/rt-2-new-model- translates-vision-and-language-into-action/

^[8] Dhruv Shah, Michael Equi, Blazej Osinski, Fei Xia, Brian Ichter, Sergey Levine. Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning. https://sites.google.com/view/lfg-nav/

^[9] Mario Viviani. AI Room Generator Extension Sample. https://github.com/NVIDIA-Omniverse/kit-extension-sample-airoomgenerator

^[10] Brian Heater. Robotics Q&A with Meta’s Dhruv Batra. https://techcrunch.com/2023/12/02/robotics-qa-with-metas-dhruv-batra/

^[11] Brian Heater. Robotics Q&A with Nvidia’s Deepu Talla. https://techcrunch.com/2023/12/16/robotics-qa-with-nvidias-deepu-talla/

^[12] Michael Ahn, Debidatta Dwibedi, Chelsea Finn, Montse Gonzalez Arenas,Keerthana Gopalakrishnan, Karol Hausman, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian,Sean Kirmani, Isabel Leal, Edward Lee, Sergey Levine, Yao Lu, Sharath Maddineni, Kanishka Rao,Dorsa Sadigh, Pannag Sanketi, Pierre Sermanet, Quan Vuong, Stefan Welker, Fei Xia, Ted Xiao,Peng Xu, Steve Xu, Zhuo Xu.
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents. https://auto-rt.github.io

^[13] Figure AI Inc. Figure Raises $675M at $2.6B Valuation and Signs Collaboration Agreement with OpenAI. https://www.prnewswire.com/news-releases/figure-raises-675m-at-2-6b-valuation-and-signs-collaboration-agreement- with-openai-302074897.html

^[14] 1x. 1X Raises $23.5M in Series A2 Funding led by OpenAI. https://www.1x.tech/discover/1x-rasies-23-5m-in-series-a2-funding-led-by-open-ai

^[15] Nvidia. Nvidia Announces Project GR00T Foundation Model for Humanoid Robots and Major Isaac Robotics Platform Update. https://nvidianews.nvidia.com/news/foundation-model-isaac-robotics-platform

Application of foundation models in robotics

Introduction

Narrow AI

Foundation models

GPT for robotics

Vision-language-action model

Simulation for data generation

Humanoid race

Multi-robot coordination

Summary

References

How will AI shape solutions for innovation?

Contact us

what are you looking for?

Contact Us

You can drop us a call or email

Thank you for your interest.