Use the BAAI/bge-small-zh-v1.5 model locally through an OpenAI-compatible /v1/embeddings endpoint powered by FastAPI.
python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install -r requirements.txtThe service defaults to the Hugging Face China mirror (https://hf-mirror.com). Override it before the first run if you prefer a different mirror.
export HF_ENDPOINT=https://hf-mirror.com # or your preferred mirror URL
export EMBEDDING_CACHE_DIR=$(pwd)/model_cache # persistent local cacheYou can pre-download the model once (optional but recommended):
python - <<'PY'
from sentence_transformers import SentenceTransformer
SentenceTransformer("BAAI/bge-small-zh-v1.5", cache_folder="$EMBEDDING_CACHE_DIR", device="cpu")
PYuvicorn app.main:app --host 0.0.0.0 --port 8000 --reloadFastAPI will serve interactive docs at http://localhost:8000/docs.
curl -X POST "http://localhost:8000/v1/embeddings" \
-H "Content-Type: application/json" \
-d '{
"model": "bge-small-zh-v1.5",
"input": ["今天天气很好", "自然语言处理"],
"user": "demo-user"
}'Response excerpt:
{
"object": "list",
"data": [
{"object": "embedding", "index": 0, "embedding": [...]},
{"object": "embedding", "index": 1, "embedding": [...]}
],
"model": "bge-small-zh-v1.5",
"usage": {"prompt_tokens": 9, "total_tokens": 9}
}EMBEDDING_MODEL_NAME: switch to a different SentenceTransformer checkpoint.EMBEDDING_DEVICE: set tocuda,mps, etc. Defaults to CPU.EMBEDDING_BATCH_SIZE: control batch size forencode().EMBEDDING_CACHE_DIR: persistent model/cache directory (also reused for Hugging Face cache when provided).
⚠️ Token usage in the response is a simple heuristic (character count based). Integrate your tokenizer if you require exact counts.