def should_embed(chunk):
# Skip low-content chunks
if punct_ratio > 0.4: return False # Lists
if alpha_ratio < 0.3: return False # Not enough text
if len(chunk) < 20: return False # Too short
return True
Agent thinks UTF-8 was less than 0.1% cause, but we could still unicodedata NLTK filter those.
Agent thinks UTF-8 was less than 0.1% cause, but we could still unicodedata NLTK filter those.