Neural networks learn to see, hear, and feel. AI is becoming multimodal

Encord's EBIND enables multimodal AI with five data types.

*The EBIND model enables AI teams to use multimodal data. Source: StockBuddies, AI, via Adobe Stock

Imagine that your self-driving car not only sees a pedestrian on the road, but also hears the sound of an ambulance siren approaching from around a bend, and combines this data to make a decision. Or that a robot in a warehouse can find a part not only by its appearance, but also by the sound it makes when working. This is no longer fiction, but reality, which is being brought closer by the Encord startup with its new EBIND model, a multimodal embedding model that processes text, images, video, audio and 3D data from lidars.

This breakthrough is interesting not only for its capabilities, but also for its approach: the company claims that its methodology allows training powerful multimodal models on a single GPU, challenging the well-established rule that huge computing resources are needed to create strong AI. It seems that a new era is beginning in the world of artificial intelligence, where the quality of data may be more important than its quantity, and powerful tools are becoming available not only to giants like Google and OpenAI.

What is EBIND and why is it important?

Multimodal AI is the next big milestone for the entire industry. Unlike standard text-only chatbots or computer vision models that understand only images, multimodal systems are capable of processing different types of data simultaneously, which allows them to solve more complex tasks and draw more subtle conclusions. Simply put, they learn to perceive the world the way people do — through a set of sensations.

EBIND is an embedding model, a kind of "universal translator" for different types of data. Its main task is to transform information from different "universes" (text, sound, picture, 3D model) into a single numerical space where a computer can understand the semantic connections between them. Using EBIND, you can, for example, find a video based on a voice description, imagine the three-dimensional shape of a car based on its flat photo, or accurately recreate the sound of an airplane based on its position relative to the listener.

Eric Landau, founder and CEO of Encord, described the complexity of creating such a system as follows: "Data from the internet is often paired, such as text and data, or perhaps with some sensory data. It was quite difficult to find such "fives" in the wild, so we had to go through a very painstaking process of creating a dataset that feeds EBIND." The result of this work was the E-MM1 dataset, which Encord calls the world's largest open multimodal dataset, surpassing the next largest by 100 times.

A technological miracle: How did they do it?

The key magic of EBIND lies in its ability to create a common embedding space (vector representations) for five different modalities. This allows downstream models, for example, to understand the position, shape of an object, and its relationship to other objects in the physical environment.

  • Data solves everything: Encord's approach focuses on data quality rather than computing power. Ulrik Stig Hansen, President of the company, stated: "The speed, performance and functionality of the model are made possible by the high-quality E-MM1 dataset ... it proves once again that AI teams do not need to be limited by computing power to push the boundaries of what is possible in this area."
  • Democratization of AI: The most impressive statement of the company is the effectiveness of their methodology. According to internal research, they managed to train a model with 1.8 billion parameters, which surpassed competitor models with 17 times more parameters, and all this in a few hours on a single GPU. If this data is confirmed, it will radically change the rules of the game for startups and research labs that do not have access to giant server clusters.
  • Open source code: In the spirit of true democratization, Encord has released EBIND as an open-source model. This step allows universities, startups, and even large companies to quickly and cost-effectively expand the capabilities of their multimodal models.

Where can this be applied? From drones to quality control

EBIND's capabilities open the door to a variety of applications, especially in the field of Physical AI, where algorithms interact with the real world.

  • Autonomous vehicles and robots: The model can empower robots and self-driving cars with the ability not only to see, but also to hear. "You want your autonomous car to not only see and feel through lidar, but also hear if there is a siren in the background... so that your car knows that a police car is approaching, which may not be visible," Eric Landau gives an example.
  • Cross-modal learning: The technology allows you to use examples of one type of data (for example, images) to help models recognize patterns in others (for example, audio) . This can speed up learning and improve the generalizing ability of AI.
  • Quality control and safety: EBIND can detect cases where audio does not match the generated video, or find biases in datasets. In the future, such systems will be able to automatically detect deep fakes or inconsistencies in multimedia content.

Interestingly, as such technologies become more accessible, a new class of business challenges is emerging. For example, how can I find a specialist who can manage this complex multimodal "beast"? After all, we need not just an engineer, but a kind of data "conductor" who understands how to coordinate the operation of all sensors. Perhaps in the near future, talent search platforms such as jobtorob.com Increasingly, there will be requests for "multimodal perception architects" for robots, and this will become the new norm in the high-tech labor market.

What is the result? The future belongs to multimodality

Encord already cooperates with companies such as Toyota, Zipline and Synthesia, which indicates a serious industrial interest in their development. The company's vision for the future is clear: "Our point of view is that all physical systems will be multimodal in some sense in the future."

It remains to be seen whether their approach, which focuses on data quality, will become a new paradigm as opposed to the race for computing power. If so, it could lead to an explosive growth in AI innovation, making cutting-edge technologies available to a much wider range of creators. And then, perhaps, in a couple of years we will be surprised, remembering that once robots could understand the world only through one "window" of perception.

Write and read comments only authorized users.

You may be interested in

Read the recent news from the world of robotics. Briefly about the main.

Matician offers its household robots by subscription

The robotics-as-a-service model has gained steam in commercial settings.

ExaOne: How a Korean Giant is Building an AI That Knows Everything

LG's ExaOne: Corporate AI that knows too much

Private equity firm acquires robotics integrator Acieta

An affiliate of private investment firm Angeles Equity Partners has acquired Acieta.

Share with friends