Extracts Human Services Data Specification (HSDS) structured data from community services flyer images using BAML and GPT-4 Vision API.
This tool automatically converts flyer images into structured, machine-readable HSDS-compliant data. Simply provide an image of a community services flyer, and the system will extract:
- Organization information
- Service details and descriptions
- Location data with addresses
- ServiceAtLocation relationships
Perfect for digitizing community resource information and making it accessible through standardized APIs.
- Python 3.10+
- OpenAI API key - Set as
OPENAI_API_KEYenvironment variable - Dependencies - Listed in
requirements.txt
Using GPT-4 Vision API:
- Typical cost: ~$0.01-0.05 per image
- Based on image size and complexity
Looking for a free alternative? Check out the
deepseekOCRbranch which uses fully local, open-source models with no API costs.
git clone https://github.com/yourusername/image-to-hsds.git
cd image-to-hsdspython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activatepip install -r requirements.txtOption A: Using a .env file (recommended)
echo "OPENAI_API_KEY=your_key_here" > .envOption B: Export in your shell
export OPENAI_API_KEY=your_key_hereGet your API key at platform.openai.com
Run with the default example image:
python extract_hsds.pyRun with your own flyer:
python extract_hsds.py path/to/your_flyer.jpgThe script performs the following steps:
- Loads the image from the specified path
- Calls GPT-4 Vision API via BAML function
ExtractHSDSFromImage - Extracts structured data following HSDS specification
- Prints a summary to the console for quick review
- Saves JSON output to
hsds_outputs/extracted_hsds_data.json
- Console: Human-readable summary of extracted data
- JSON File:
hsds_outputs/extracted_hsds_data.json- Complete HSDS-compliant structured data
{
"organization": {
"name": "Teen Feed",
"description": "Meals, connections & resources for youth..."
},
"services": [
{
"name": "U-District Dinner",
"description": "365 days a year..."
}
],
"locations": [...],
"service_at_locations": [...]
}image-to-hsds/
├── extract_hsds.py # Main extraction script
├── baml_src/ # BAML definitions
│ ├── clients.baml # OpenAI GPT-4 Vision client config
│ ├── hsds_types.baml # HSDS type definitions
│ ├── extraction_function.baml # Extraction function and prompts
│ └── generators.baml # Python/Pydantic code generation
├── baml_client/ # Auto-generated Python client
├── assets/ # Branding and documentation images
│ ├── LOGO.png
│ └── README_IMAGE.png
├── images/ # Sample input flyers
├── hsds_outputs/ # Extracted JSON output
├── requirements.txt # Python dependencies
├── .env # API keys (create this)
└── README.md # This file
extract_hsds.py- Main script that orchestrates the extractionbaml_src/extraction_function.baml- Contains the prompt and extraction logicbaml_src/hsds_types.baml- Defines the HSDS data structurebaml_client/- Auto-generated from BAML files (don't edit directly)
The extraction produces HSDS-compliant JSON objects including:
- Organization - Details about the service provider
- Service - Specific programs or services offered
- Location - Physical addresses and accessibility info
- ServiceAtLocation - Relationships linking services to locations
Note: Some fields may be null if the information is not present in the flyer or is ambiguous.
BAML generates a Python client (baml_client/) from the definitions in baml_src/. After editing any .baml files, regenerate the client:
# Install BAML CLI (if not already installed)
npm install -g @boundaryml/baml
# Regenerate the Python client
baml-cli generate- Documentation: docs.boundaryml.com
- BAML Language Guide: Learn about types, prompts, and clients
- Examples: Check
baml_src/for working examples
Edit baml_src/extraction_function.baml to:
- Adjust extraction instructions
- Add/remove HSDS fields
- Change the model (e.g.,
gpt-4o,gpt-4-turbo) - Modify temperature or other parameters
Before running on production data:
- Test with sample images in the
images/directory - Verify API key is set correctly
- Check output in
hsds_outputs/for quality - Iterate on prompts if extraction quality needs improvement
Check out the deepseekOCR branch which uses:
- ✅ DeepSeek OCR - State-of-the-art document understanding
- ✅ Ollama (gpt-oss-20b) - Local LLM for extraction
- ✅ No API costs - Runs completely offline
- ✅ Privacy-first - Data never leaves your machine
Trade-offs:
- Slower processing (~1-2 minutes vs 5-10 seconds)
- Requires more local resources (16GB+ RAM recommended)
- Slightly lower accuracy on complex layouts
Contributions are welcome! Please feel free to:
- 🐛 Report bugs or issues
- 💡 Suggest new features or improvements
- 📝 Improve documentation
- 🔧 Submit pull requests
This project is licensed under the MIT License - see the LICENSE file for details.
- BAML - For structured LLM outputs
- OpenAI - For GPT-4 Vision API
- HSDS - Human Services Data Specification
- Connect 211 - For supporting community resource data digitization
Built by Connect211 with ❤️ for making community services data more accessible

