UTF-8 and / or quality filter chunks

```
def should_embed(chunk):
    # Skip low-content chunks
    if punct_ratio > 0.4: return False  # Lists
    if alpha_ratio < 0.3: return False  # Not enough text
    if len(chunk) < 20: return False    # Too short
    return True
```

Agent thinks UTF-8 was less than 0.1% cause, but we could still unicodedata NLTK filter those.