Many recent breakthroughs in AI have been driven by the availability of internet-scale datasets, which enabled the development of powerful generative models behind applications like ChatGPT and Midjourney. Inspired by this success, the robotics community is now investing in large-scale data collection to train end-to-end foundation models that can generalize across tasks and robot embodiments.
Industry leaders like Physical Intelligence, Figure, and Hugging Face have already shown promising results with data-driven approaches, and academic initiatives like Open X-Embodiment are laying the groundwork with open-source, large-scale robotics datasets.
While these advances suggest a "ChatGPT moment" for robotics may be approaching, key challenges remain for real-world deployments.
Data curation: The success of large language and vision models has relied not just on scale, but on carefully curated, high-quality data. Robotics data, however, presents much greater complexity due to the wide range of robot embodiments, sensors, action/state spaces, environments, and file formats. This heterogeneity significantly complicates the creation of clean, consistent datasets essential for training generalist robot policies.
Unexpected edge cases in the real world: Unlike hallucinations in text models, errors in robotics can directly cause real-world harm, including injury or physical damage. These failures often stem from noisy or insufficient training data. It’s essential to identify edge cases not only during training but also in live operation, surface similar instances, and assess how widespread an issue is.
Addressing these challenges requires better tools to search, analyze, and curate robotics data. At Roboto, we're developing methods to identify patterns and edge cases to achieve this. One promising approach that we're exploring is signal similarity search. In this blog post, we'll demonstrate how it can be used to find and retrieve similar events in large datasets.
Imagine you're investigating model performance and you need to identify specific high-quality manipulation sequences within thousands of episodes. Or perhaps you've spotted a concerning failure pattern where the gripper slips at a critical moment. In both cases, you face the same fundamental challenge: how do you efficiently find other episodes with similar events across massive datasets?
Signal search transforms this challenge from tedious manual review into an automated discovery process, helping you to:
(1) Identify patterns: Find all events containing the same motion signature; whether it's a successful maneuver worth replicating or a problematic failure mode that needs investigation.
(2) Analyze frequency: Understand how often specific behaviors appear in your dataset, informing decisions about data curation, model training, and deployment safety.
(3) Automate labeling: Systematically tag episodes by behavioral patterns, whether you're building curated training sets from high-quality demonstrations or flagging anomalous events for further review.
In the example below, we'll label a single manipulation sequence in Roboto, and then use the Roboto SDK to find similar movements in other episodes that we've imported. Each dataset contains actuator data in Parquet files and camera data in MP4 videos.
First, we'll identify and label a representative event in the aloha_mobile_cabinet dataset.
Using Roboto, we can mark the precise timeframe when the ALOHA robot picks up a pot and places it in a cabinet. This multidimensional sequence will serve as our query event: the pattern that we want to find in other episodes.
Next, we’ll use the Roboto SDK to fetch all topic records from every episode in the dataset. We’ll use roboto_search.find_topics with the RoboQL syntax to do this. The search could be further refined by filtering for specific metadata, such as task type or robot embodiment.
from roboto import RobotoSearch
rs = RobotoSearch()
topics_to_search = rs.find_topics("topic.name = 'data' AND dataset.tags CONTAINS 'aloha'")
Now that we've defined our search space (the topics we just retrieved), we can perform our signal search using the event we created above.
To do this, we'll fetch the signals that were captured by the event:
from roboto import analytics, Event
event = Event.from_id("ev_4verio1wphkcabpw")
query_signals = event.get_data_as_df()
We can confirm we're using the right event by calling .plot()
on query_signals
. This one-liner will bring up a plot of the multi-dimensional event data in a notebook environment:
That certainly looks like the data we annotated in Roboto earlier! It's the time-series of observed joint angles and gripper states from both the left and right robot arms.
Next, to run the signal search, we can call find_similar_signals as follows:
matches = roboto.analytics.find_similar_signals(
query_signals,
topics_to_search,
max_matches_per_topic=1,
normalize=False
)
With this input, the function will retrieve the most similar events from the other episodes we want to search.
We could also experiment with the signals we provide to the query by using the message_paths_include
parameter of get_data_as_df. However, in this case, we'll just use all 10 of the action signals from both the left and right robot arms.
Once the results are returned, we can run a utility function to visualize them:
from viz_tools_aloha import print_match_results
print_match_results(matches[:5], image_topic="image")
The first column shows the distance scores of the matches. The second column shows the plots of the matched events, and the third column shows the corresponding video for each match.
Example 1: Find stowing events in the aloha_mobile_cabinet
dataset
In this example, we’re analyzing data from a Mobile ALOHA platform (14 actuators total) tasked with stowing an object in a cabinet. The first row represents our query event, and the subsequent rows represent the top matches found. We limit the results to the top 4 for brevity in this post. The results are great! We managed to find the other times when the robot was picking up objects and putting them in a cabinet.
Example 2: Find folding events in the unitreeh1_fold_clothes
dataset
In this example, we're analyzing data from a different dataset and robot. The Unitree H1 humanoid (40 actuators total) performing a sweater folding task. We're searching for the specific folding sequence where the robot brings the bottom of the sweater up and over to the top. The first row represents our new query event, and the subsequent rows represent the top matches found. The results also look great; the movements are similar and the action performed is the same.
Example 3: Find lid opening events in the aloha_static_cups_open
dataset
In this example, we’re analyzing data from a different static ALOHA platform (14 actuators total) tasked with opening a little cup used for sauces. The first row represents another new query event, and the subsequent rows represent the top matches found. Notice how some of the attempts are successful, and others fail, even though the movement patterns are almost identical. It would now be easy to review these short clips (instead of the full episodes) and label them as success or failure.
In this post, we explored how signal search automates the discovery of similar manipulation events in robotics data. From surfacing high-quality demonstrations to pinpointing subtle failure modes, it turns days of manual review into minutes of insight.
While this post focused on manipulation data, the underlying primitives apply broadly to finding patterns in time-series data from any sensor or actuator e.g. IMUs and joint encoders to motors and more.
As robotics begins to leverage foundation models trained on massive datasets, tools like signal search become essential for scalable data curation and quality control. We’re already working with leading robotics companies and labs using this to refine their datasets and automate event discovery. If you’re interested in trying it on your own data, we’d love to collaborate — reach out!
Big thanks to the research teams who collected the data used in this post and published it to Hugging Face: aloha_mobile_cabinet, unitreeh1_fold_clothes, aloha_static_cups_open