Velocity is an adaptive LLM inference optimization and serving engine that dynamically optimizes Hugging Face models using JAX, Flash Attention 2, and Q-learning. It automatically creates an optimized inference pipeline based on the user-provided Hugging Face model card, and the progress is displayed in a Streamlit UI.
- β Enable high-performance LLM inference on edge devices & Apple Silicon.
- β Reduce latency and memory footprint using advanced optimizations.
- β Automate inference pipeline creation from Hugging Face model cards.
- β Leverage RL-based Q-learning for dynamic batch size & precision tuning.
- β Provide a Streamlit UI for real-time progress tracking.
- π₯ Dynamic Model Optimization: Accepts any Hugging Face model card as input.
- β‘ JAX-based Inference Engine: Uses JIT compilation for accelerated execution.
- π Flash Attention 2 Acceleration: Reduces memory load & improves speed.
- π― Q-learning for Adaptive Optimization: Dynamically selects best batch & precision.
- π FastAPI Backend: Optimized model serving via API.
- π Streamlit UI: Displays pipeline progress and shows inference results.
π οΈ Roadmap
- Dynamic Model Card Inference
- Flash Attention 2 Integration
- Q-Learning for Adaptive Optimization
- GPU/TPU Support for Faster Execution
- Real-time Monitoring & Metrics in Streamlit
- Docker & Cloud Deployment
π§ Contact For questions or collaborations, reach out to [email protected]
π§ Project Status: Very Early Development π§
This repository is in its early stages of development. Features are subject to change, and some functionalities may not be fully implemented yet.