Skip to content

youcefgheffari3/VisualGroundingAutonomy

Repository files navigation

🚗 Real-Time LLM-based Visual Grounding in 3D Driving Scenes

This project integrates object detection, 3D reconstruction, and GPT-4-based scene reasoning to allow natural language interaction with 3D driving environments. Built as an expert-level AI Engineering portfolio project.


🧠 Features

  • 3D Scene Reconstruction from RGB-D (TUM Dataset)
  • Object Detection using Detectron2
  • Scene Graph Creation for spatial reasoning
  • Natural Language Question Answering via GPT-4
  • Real-Time 3D Visualization with Panda3D
  • Highlight Objects based on user queries like:
    • "Which cars are behind the truck?"
    • "Where is the pedestrian near the stop sign?"

🖼️ Pipeline Diagram

pipeline.png


📂 Project Structure

VisualGroundingAutonomy/
├── data/                 # RGB-D images and depth maps
├── reconstruction/       # Backproject RGB-D → 3D point cloud
├── scene_graph/         # Scene graph builder
├── grounding/           # LLM interface (GPT or CLIP)
├── utils/               # Mapping objects to points
├── visualizer/          # Panda3D real-time viewer
├── main.py              # Full pipeline
├── requirements.txt
└── README.md

🧪 Example Query

User: "Which objects are behind car_0?"
🧠 GPT-4: "Pedestrian_2 and car_3 are behind car_0 based on spatial relationships."
✅ Viewer: Highlights those objects in red

📦 Output Samples

  • 🔴 scene.ply — Reconstructed scene
  • 📄 scene_graph.json — Full graph with relations
  • 🎥 demo.gif — Panda3D video output (recorded)

💻 Technologies

Python · PyTorch · Detectron2 · Panda3D · GPT-4 · NumPy · Trimesh · LangChain


🧾 License

MIT License

About

Visual Grounding for Autonomous Agents: linking language and vision for robotics or autonomous navigation

Topics

Resources

Stars

Watchers

Forks

Languages