Tesla's FSD V12 has potential for hierarchical planning and predictive capabilities, but there is a need for improvement in the 20-30 second planning range for full self-driving
Questions to inspire discussion
-
What is the problem with Tesla's FSD V12 planning?
—The Tesla FSD V12 has a significant architectural problem with its midterm planning, particularly in the 20-30 second planning range for full self-driving.
-
Why is hierarchical planning necessary for complex actions?
—Hierarchical planning is necessary for complex actions, such as traveling from New York to Paris, and requires a specific architecture to allow for it.
-
What is the new architecture for self-supervised learning in Tesla's FSD V12?
—The Tesla FSD V12 has a new architecture for self-supervised learning, pre-trained on video data to interpret a three-dimensional world and create a world model, despite the limitations of only having access to 2D data.
-
How does the predictive architecture in Tesla's FSD V12 work?
—Tesla's FSD V12 has a new predictive architecture that focuses on understanding rather than generating, which is a crucial step in advancing machine intelligence.
-
What is the potential improvement being discussed for Tesla's FSD V12?
—Tesla Engineers are discussing a potential improvement to the 20-30 second planning range for full self-driving, which is currently the Achilles heel of the system.
Key Insights
- 🚗 Tesla's full self-driving 12.3 still has an Achilles heel in the 20 to 30 second planning regime, which is a significant architectural problem.
- 🍺 The challenge lies in streamlining high-level commands like "grab me a beer" down to the micro-level activity of the joints, a problem that Tesla keeps coming up against.
- 🤖 New architecture for self-supervised learning aims to create highly intelligent machines that can learn as efficiently as humans, pre-trained on video data to interpret a three-dimensional world.
- 🧠 "Create a latent space that will allow you to understand these videos and many many many of them right not just a few but create this Laten space where you're able to then fill in the blanks and predict what's going to happen."
- 🚗 Complex tasks like driving to the airport require multiple levels of hierarchical planning, which current self-driving technology struggles to efficiently handle.
- 🧠 The loss and the gradient descent requires that this stuff be differentiable, an actual interesting aspect of v jeppa and it's something that Yan and Lex go into in some detail in the discussion.
- 🚗 The abstract latent space approach could be incredibly important for Tesla's full self-driving and humanoid robots, potentially revolutionizing the planning process.
- 🚗 The Tesla FSD V12 could excel at hierarchically planning actions to set itself up for success 20 or 30 seconds from now, bridging the gap between Google Maps and millisecond-by-millisecond planning.
#Tesla #FSD12.3 #DrKnowItAll
Clips
-
00:00 🚗 Tesla's FSD V12 has a significant architectural problem with its midterm planning, but there may be a solution with visual joint embedded predictive architecture, as discussed in the Lex Friedman Pod.
- Tesla's full self-driving 12.3 has a significant architectural problem with its midterm planning, but there may be a solution with visual joint embedded predictive architecture.
- The speaker discusses the release of Tesla FSD V12 and the article published by Meta a month ago, expressing their initial hesitation to talk about it but now feeling comfortable to do so.
- Hierarchical planning is necessary for complex actions, such as traveling from New York to Paris, and requires a specific architecture to allow for it.
- The speaker discusses the process of minimizing distance to Paris by decomposing it into sub goals and achieving them step by step.
- Planning every action in detail is not feasible, so hierarchical planning is necessary, but AI systems still struggle to learn and implement this effectively.
-
04:29 🚗 Tesla's FSD V12 has a problem with 20-30 second planning, as it struggles with the hierarchical level of planning and abstraction needed for full self-driving.
- Planning a trip involves many steps and uncertainties, such as transportation, lodging, and unexpected expenses.
- Tesla's FSD 12.3 has top-down planning and second-to-second planning for driving, but it cannot plan for every detail at a millisecond level.
- The Tesla FSD V12 has a problem with the 20 to 30 second planning section, which is not being addressed effectively due to computational expense.
- Mapping out millisecond to millisecond movements for a humanoid robot is feasible for industrial robots, but not for tasks that require adaptability and unpredictability, such as driving a car or grocery shopping.
- The hierarchical level of planning and abstraction needed for full self-driving and humanoid robots is an unsolved problem that Tesla and other companies keep coming up against.
- JEA and llm can provide detailed plans for tasks, but may struggle with new or unique situations.
-
11:05 🚗 Tesla's FSD V12 has a problem with fine-tuning and non-generative masking in the training process, highlighting the importance of embodied AI for achieving artificial general intelligence.
- LM may be able to solve complex problems with training, but physical actions require experience in the physical world beyond what can be expressed in words.
- The speaker discusses the need for interaction with physical reality in robotics and the importance of embodied AI for achieving artificial general intelligence.
- Embodied AI is seen as the way to achieve artificially intelligent agents, with the Transformer architecture and large language models being criticized for being disembodied and requiring large amounts of data.
- Tesla's FSD V12 has a new architecture for self-supervised learning, pre-trained on video data to interpret a three-dimensional world and create a world model, despite the limitations of only having access to 2D data.
- The Tesla FSD V12 has a problem with fine-tuning and non-generative masking in the training process.
- Andre Karpathy discussed using generative AI to fill in missing parts of images, such as stop signs, by training the AI to regenerate the original image.
-
18:57 🚗 Tesla's FSD V12 has a problem with over-focusing on small details, but it excels at detecting and understanding detailed interactions between objects, which is crucial for advancing machine intelligence.
- Humans interact with the world by focusing on the overall picture rather than individual details, similar to how Tesla's FSD V12 has a problem with over-focusing on small details.
- The Tesla FSD V12 has a problem with the computational expense of filling in missing parts of a video in an abstract representation space, which is important because it involves the latent space of reality.
- The speaker discusses compressing physical space into a representation and being able to reconstruct it in a general sense but not in every detail.
- Create a latent space to understand and predict videos, unlike generative approaches, VJEA can discard irrelevant information for more efficient training.
- Tesla's FSD V12 has a new predictive architecture that focuses on understanding rather than generating, which is a crucial step in advancing machine intelligence.
- Machine learning is trying to regenerate physical world models, which is a difficult task that requires immense training, but Tesla's FSD V12 excels at detecting and understanding detailed interactions between objects.
-
24:22 🚗 Tesla FSD V12 struggles with dynamically adapting to changing circumstances, while Meta releases open-source material and Yan discusses the goal of building advanced machine intelligence.
- Meta is releasing a lot of open-source material, which is a big advantage to the world, and Lex and Yan discuss the business case for it in an interview.
- Yan discusses the goal of building advanced machine intelligence that can learn like humans, but points out that Tesla's full self-driving is currently poor at dynamically adapting to changing circumstances.
- V JEA is a non-generative model that learns by predicting missing parts of a video in an abstract representation space, leading to improved training and sample efficiency.
- The Tesla FSD V12 uses labeled examples for fine-tuning and reinforcement learning, but the architecture is more efficient in terms of the number of labeled examples needed and the total amount of effort put into learning.
-
28:01 🚗 Tesla FSD V12 training involves leaving masked parts in videos, using self-supervised learning and neural networks, with a focus on spaciotemporal regions, but the number of demonstrations is unclear.
- The process involves training a predictor to fill in missing information in a representation space, using joint embedding and neural network models, and fine-tuning to communicate with humans, with a focus on spaciotemporal regions in videos.
- Training the Tesla FSD V12 involves leaving a masked part as a giant hole throughout the training, which requires differentiability for loss and gradient descent, and the discussion goes into detail about how this is achieved.
- Self-supervised training on videos teaches a model about the world, and considering the masking strategy is important to avoid making the task too easy, especially for videos with quick cuts.
- Masking out portions of the video in both space and time forces the model to learn and develop an understanding of the scene, allowing for efficient predictions in the abstract representation space.
- Neural networks are trained and then adapted to learn new skills by adding small, efficient layers on top of the pre-trained model, allowing for rapid training with relatively few labeled data sets.
- The number of demonstrations for Tesla's FSD V12 is unclear, as the table does not show that information.
-
33:05 🚗 Tesla's FSD V12 has potential for hierarchical planning and predictive capabilities, but there is a need for improvement in the 20-30 second planning range for full self-driving.
- A self-supervised approach for learning representations from video can be applied to various downstream image and video tasks without adaptation of the model parameters, outperforming previous video representation learning approaches in frozen evaluation on image classification, action classification, and spatiotemporal action detection tasks.
- Yan and meta have developed a predictive architecture that is claimed to be better at hierarchical planning than current state-of-the-art models, which could be important for Tesla's full self-driving and humanoid robots.
- Tesla's FSD V12 has the potential to plan actions in advance, allowing for hierarchical planning and success in the future.
- Tesla's FSD V12 can learn from billions of miles of data and predict human behavior in complex corner cases, allowing the car to behave more like a human by utilizing data rather than generating scenes.
- Tesla's FSD V12 has the ability to predict the future and take action, but generative capabilities are not necessary for full self-driving.
- Tesla Engineers are discussing a potential improvement to the 20-30 second planning range for full self driving, which is currently the Achilles heel of the system.
-
38:28 🚗 Tesla's FSD V12 has a new architecture with potential, and the speaker asks for feedback and likes.
------------------------------------- 0:38:57 2024-03-24T10:40:32Z