Concrete motivation: If we are injecting things into the system prompt that are severely reducing performance, then we should have tests that would capture this behaviour.
These should be small but domain specific semantic eval tests that check the performance of the effectful LLM api against the actual litellm/openai API.
These should test specific semantic features/functionalities that our downstream projects rely on, for example:
- generating instances of custom classes
- generating terms in effectful sublanguages
- making use of lexical variables, or tools.
etc.