This project integrates object detection, 3D reconstruction, and GPT-4-based scene reasoning to allow natural language interaction with 3D driving environments. Built as an expert-level AI Engineering portfolio project.
- 3D Scene Reconstruction from RGB-D (TUM Dataset)
- Object Detection using Detectron2
- Scene Graph Creation for spatial reasoning
- Natural Language Question Answering via GPT-4
- Real-Time 3D Visualization with Panda3D
- Highlight Objects based on user queries like:
- "Which cars are behind the truck?"
- "Where is the pedestrian near the stop sign?"
VisualGroundingAutonomy/
├── data/ # RGB-D images and depth maps
├── reconstruction/ # Backproject RGB-D → 3D point cloud
├── scene_graph/ # Scene graph builder
├── grounding/ # LLM interface (GPT or CLIP)
├── utils/ # Mapping objects to points
├── visualizer/ # Panda3D real-time viewer
├── main.py # Full pipeline
├── requirements.txt
└── README.md
User: "Which objects are behind car_0?"
🧠 GPT-4: "Pedestrian_2 and car_3 are behind car_0 based on spatial relationships."
✅ Viewer: Highlights those objects in red
- 🔴
scene.ply— Reconstructed scene - 📄
scene_graph.json— Full graph with relations - 🎥
demo.gif— Panda3D video output (recorded)
Python · PyTorch · Detectron2 · Panda3D · GPT-4 · NumPy · Trimesh · LangChain
MIT License
