Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 57 additions & 3 deletions pr_agent/algo/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -945,12 +945,66 @@ def clip_tokens(text: str, max_tokens: int, add_three_dots=True, num_input_token
"""
Clip the number of tokens in a string to a maximum number of tokens.

This function limits text to a specified token count by calculating the approximate
character-to-token ratio and truncating the text accordingly. A safety factor of 0.9
(10% reduction) is applied to ensure the result stays within the token limit.

Args:
text (str): The string to clip.
text (str): The string to clip. If empty or None, returns the input unchanged.
max_tokens (int): The maximum number of tokens allowed in the string.
add_three_dots (bool, optional): A boolean indicating whether to add three dots at the end of the clipped
If negative, returns an empty string.
add_three_dots (bool, optional): Whether to add "\\n...(truncated)" at the end
of the clipped text to indicate truncation.
Defaults to True.
num_input_tokens (int, optional): Pre-computed number of tokens in the input text.
If provided, skips token encoding step for efficiency.
If None, tokens will be counted using TokenEncoder.
Defaults to None.
delete_last_line (bool, optional): Whether to remove the last line from the
clipped content before adding truncation indicator.
Useful for ensuring clean breaks at line boundaries.
Defaults to False.

Returns:
str: The clipped string.
str: The clipped string. Returns original text if:
- Text is empty/None
- Token count is within limit
- An error occurs during processing

Returns empty string if max_tokens <= 0.

Examples:
Basic usage:
>>> text = "This is a sample text that might be too long"
>>> result = clip_tokens(text, max_tokens=10)
>>> print(result)
This is a sample...
(truncated)

Without truncation indicator:
>>> result = clip_tokens(text, max_tokens=10, add_three_dots=False)
>>> print(result)
This is a sample

With pre-computed token count:
>>> result = clip_tokens(text, max_tokens=5, num_input_tokens=15)
>>> print(result)
This...
(truncated)

With line deletion:
>>> multiline_text = "Line 1\\nLine 2\\nLine 3"
>>> result = clip_tokens(multiline_text, max_tokens=3, delete_last_line=True)
>>> print(result)
Line 1
Line 2
...
(truncated)

Notes:
The function uses a safety factor of 0.9 (10% reduction) to ensure the
result stays within the token limit, as character-to-token ratios can vary.
If token encoding fails, the original text is returned with a warning logged.
"""
if not text:
return text
Expand Down
301 changes: 295 additions & 6 deletions tests/unittest/test_clip_tokens.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,302 @@

# Generated by CodiumAI

import pytest
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Takser..God λ‹˜, μ½”λ“œ λ³΄λ©΄μ„œ 개인적으둜 κΆκΈˆν•œ 점이 μƒκ²Όμ–΄μš” :)
test_clip_tokens.py처럼 κΌΌκΌΌν•œ ν…ŒμŠ€νŠΈ μΌ€μ΄μŠ€λ“€μ„ 보면
κ΅¬μƒν•˜μ‹€ λ•Œ μ–΄λ–€ 것뢀터 μ‹œμž‘ν•΄μ„œ 살을 λΆ™μ—¬λ‚˜λŠ”μ§€ κ·Έ 과정이 κΆκΈˆν•©λ‹ˆλ‹€.

ν•¨μˆ˜ 핡심 κΈ°λŠ₯을 λ¨Όμ € 작고,
κ·Έλ‹€μŒμ— μž…λ ₯ κ°’ λ– μ˜¬κ³ ,
μ˜΅μ…˜λ³„ λ™μž‘μ΄λ‚˜ μ˜ˆμ™Έ 처리 순으둜 μ²΄κ³„μ μœΌλ‘œ λ§Œλ“€μ–΄κ°€μ‹œλŠ” 건지
μ•„λ‹ˆλ©΄ 개인이 주둜 μ‚¬μš©ν•˜λŠ” 방식이 μžˆλŠ”μ§€ λ“±λ“±...
λΉŒλ“œμ—… 과정에 λŒ€ν•œ μƒκ°μ˜ 흐름 νŒμ„ 쑰금만 κ³΅μœ ν•΄ μ£Όμ‹€ 수 μžˆμ„κΉŒμš”?
(μ•„λ¬΄λž˜λ„ μ œκ°€ μ •λŸ‰μ μΈ κ²½ν—˜μ΄ μ λ‹€λ³΄λ‹ˆ... μƒκ°μ˜ 흐름도가 κΆκΈˆν•©λ‹ˆλ‹€!)

항상 많이 λ°°μ›λ‹ˆλ‹€πŸ™πŸ”₯πŸ”₯πŸ”₯

Copy link
Author

@TaskerJang TaskerJang May 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1. 일단 ν•¨μˆ˜ 보고 "λ­ν•˜λŠ” ν•¨μˆ˜μ§€?"

def clip_tokens(text: str, max_tokens: int, add_three_dots=True...)

토큰 자λ₯΄λŠ” ν•¨μˆ˜κ΅¬λ‚˜. 그럼 일단 기본적인 κ±°λΆ€ν„° ν…ŒμŠ€νŠΈν•΄λ³΄μž.

2. κΈ°μ‘΄ ν…ŒμŠ€νŠΈ 파일 λ΄€λ”λ‹ˆ 2κ°œλ°–μ— μ—†μŒ

def test_clip(self):
    text = "line1\nline2..."
    # μ΄κ²ƒλ§Œ μžˆλ„€?

이거 λ„ˆλ¬΄ λΆ€μ‘±ν•˜λ‹€ μ‹Άμ–΄μ„œ Issue λ‚΄μš© λ‹€μ‹œ λ΄€μ–΄μš”.

3. Issueμ—μ„œ μš”κ΅¬ν•œ 것듀 체크

  • "edge cases" β†’ μ•„ 빈 λ¬Έμžμ—΄, 음수 같은 κ±° ν…ŒμŠ€νŠΈν•΄μ•Όκ² λ‹€
  • "parameter combinations" β†’ add_three_dots, delete_last_line μ‘°ν•©λ“€
  • "error handling" β†’ μ˜ˆμ™Έ 상황듀

4. μ‹€μ œλ‘œλŠ” ν•˜λ‚˜μ”© λ§Œλ“€λ©΄μ„œ "μ•„ 이것도 ν…ŒμŠ€νŠΈν•΄μ•Όκ² λ„€" ν•˜λ©΄μ„œ μΆ”κ°€

def test_empty_input_text(self):  # 일단 빈 λ¬Έμžμ—΄
def test_negative_max_tokens(self):  # 음수 λ„£μœΌλ©΄ μ–΄λ–»κ²Œ 될까?

μ΄λ ‡κ²Œ ν•˜λ‚˜μ”© λ§Œλ“€λ‹€κ°€ "μ•„ μœ λ‹ˆμ½”λ“œλŠ”?" "0으둜 λ‚˜λˆ„κΈ°λŠ”?" 이런 μ‹μœΌλ‘œ 계속 λ– μ˜¬λΌμ„œ μΆ”κ°€ν•œ κ±°μ˜ˆμš”.

5. μ†”μ§νžˆ 21κ°œλ‚˜ 될 쀄 λͺ°λžμŒ γ…‹γ…‹

μ²˜μŒμ—” 10개 정도 μƒκ°ν–ˆλŠ”λ° ν•˜λ‹€λ³΄λ‹ˆ "이것도 ν…ŒμŠ€νŠΈν•΄μ•Όκ² λ„€" ν•˜λ©΄μ„œ 계속 λŠ˜μ–΄λ‚¬μ–΄μš”.

특히 mock ν…ŒμŠ€νŠΈ 뢀뢄은... 사싀 μ’€ ν—€λ§Έμ–΄μš”. TokenEncoder μ–΄λ–»κ²Œ mock ν•΄μ•Ό ν•˜λ‚˜ κ³ λ―Όν•˜λ©΄μ„œ μ—¬λŸ¬ 번 μ‹œλ„ν•΄λ΄€κ³ μš”.

μ§„μ§œ μ™„λ²½ν•˜κ²Œ κ³„νš μ„Έμš°κ³  ν•œ 게 μ•„λ‹ˆλΌ λ§Œλ“€λ©΄μ„œ "μ•„ 이것도, 저것도" ν•˜λ©΄μ„œ 덧뢙인 거라 πŸ˜…

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

κ³ μŠ€νŠΈλ°”λ‘‘μ™•μ²˜λŸΌ
1->2->3->4->5 μ΄λ ‡κ²Œ λΉ™μ˜ν•˜λ©΄μ„œ λ”°λΌκ°€λ‹ˆκΉŒ 이해가 κ°”μŠ΅λ‹ˆλ‹€..
Issueμ—μ„œ μš”κ΅¬ν•œ 것듀 체크 이 뢀뢄도 μ€‘μš”ν•˜κ³ 
κ²°κ΅­ ν•˜λ‚˜ ν•˜λ‚˜ λΆ™μ—¬λ‚˜κ°€λŠ”κ²Œ λ°©λ²•μ΄μ—ˆκ΅°μš”!
(킬링 ν¬μΈνŠΈγ…‹γ…‹) - > μ†”μ§νžˆ 21κ°œλ‚˜ 될 쀄 λͺ°λžμŒ

ν…μŠ€νŠΈ μ•ˆμž˜λ¦¬κ³  μ œλŒ€λ‘œ λ‚˜μ˜€λŠ”μ§€, ν…μŠ€νŠΈ 잘리고+ λΆ™λŠ”μ§€, 0으둜 λ‚˜λˆ μ§€λŠ” 였λ₯˜ ν…ŒμŠ€νŠΈ λ“±...
λ°”λ‘œ λ‚˜μ˜¨κ²Œ μ•„λ‹ˆλΌ ν•œλ•€ ν•œλ•€, μ—¬λŸ¬ 번 κ³ λ―Όμ΄λž‘ μ‹œλ„κ°€ λ“€μ–΄κ°€μ„œ λ‚˜μ˜¨κ±°κ΅°μš”
덕뢄에 많이 λ°°μ›Œκ°‘λ‹ˆλ‹€! ✍️

(ν™œλ™ 이후에도 λΈ”λ‘œκ·Έ κΈ€ λͺ°λž˜ ν™•μΈν•˜κ² μŠ΅λ‹ˆλ‹€..)


from unittest.mock import patch, MagicMock
from pr_agent.algo.utils import clip_tokens
from pr_agent.algo.token_handler import TokenEncoder


class TestClipTokens:
def test_clip(self):
"""Comprehensive test suite for the clip_tokens function."""

def test_empty_input_text(self):
"""Test that empty input returns empty string."""
assert clip_tokens("", 10) == ""
assert clip_tokens(None, 10) == None

def test_text_under_token_limit(self):
"""Test that text under the token limit is returned unchanged."""
text = "Short text"
max_tokens = 100
result = clip_tokens(text, max_tokens)
assert result == text

def test_text_exactly_at_token_limit(self):
"""Test text that is exactly at the token limit."""
text = "This is exactly at the limit"
# Mock the token encoder to return exact limit
with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 10 # Exactly 10 tokens
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, 10)
assert result == text

def test_text_over_token_limit_with_three_dots(self):
"""Test text over token limit with three dots addition."""
text = "This is a longer text that should be clipped when it exceeds the token limit"
max_tokens = 5

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 20 # 20 tokens
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens)
assert result.endswith("\n...(truncated)")
assert len(result) < len(text)

def test_text_over_token_limit_without_three_dots(self):
"""Test text over token limit without three dots addition."""
text = "This is a longer text that should be clipped"
max_tokens = 5

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 20 # 20 tokens
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens, add_three_dots=False)
assert not result.endswith("\n...(truncated)")
assert len(result) < len(text)

def test_negative_max_tokens(self):
"""Test that negative max_tokens returns empty string."""
text = "Some text"
result = clip_tokens(text, -1)
assert result == ""

result = clip_tokens(text, -100)
assert result == ""

def test_zero_max_tokens(self):
"""Test that zero max_tokens returns empty string."""
text = "Some text"
result = clip_tokens(text, 0)
assert result == ""

def test_delete_last_line_functionality(self):
"""Test the delete_last_line parameter functionality."""
text = "Line 1\nLine 2\nLine 3\nLine 4"
max_tokens = 5

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 20 # 20 tokens
mock_encoder.return_value = mock_tokenizer

# Without delete_last_line
result_normal = clip_tokens(text, max_tokens, delete_last_line=False)

# With delete_last_line
result_deleted = clip_tokens(text, max_tokens, delete_last_line=True)

# The result with delete_last_line should be shorter or equal
assert len(result_deleted) <= len(result_normal)

def test_pre_computed_num_input_tokens(self):
"""Test using pre-computed num_input_tokens parameter."""
text = "This is a test text"
max_tokens = 10
num_input_tokens = 15

# Should not call the encoder when num_input_tokens is provided
with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_encoder.return_value = None # Should not be called

result = clip_tokens(text, max_tokens, num_input_tokens=num_input_tokens)
assert result.endswith("\n...(truncated)")
mock_encoder.assert_not_called()

def test_pre_computed_tokens_under_limit(self):
"""Test pre-computed tokens under the limit."""
text = "Short text"
max_tokens = 20
num_input_tokens = 5

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_encoder.return_value = None # Should not be called

result = clip_tokens(text, max_tokens, num_input_tokens=num_input_tokens)
assert result == text
mock_encoder.assert_not_called()

def test_special_characters_and_unicode(self):
"""Test text with special characters and Unicode content."""
text = "Special chars: @#$%^&*()_+ Ñéíóú δΈ­λ¬Έ πŸš€ emoji"
max_tokens = 5

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 20 # 20 tokens
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens)
assert isinstance(result, str)
assert len(result) < len(text)

def test_multiline_text_handling(self):
"""Test handling of multiline text."""
text = "Line 1\nLine 2\nLine 3\nLine 4\nLine 5"
max_tokens = 5

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 20 # 20 tokens
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens)
assert isinstance(result, str)

def test_very_long_text(self):
"""Test with very long text."""
text = "A" * 10000 # Very long text
max_tokens = 10

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 5000 # Many tokens
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens)
assert len(result) < len(text)
assert result.endswith("\n...(truncated)")

def test_encoder_exception_handling(self):
"""Test handling of encoder exceptions."""
text = "Test text"
max_tokens = 10

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_encoder.side_effect = Exception("Encoder error")

# Should return original text when encoder fails
result = clip_tokens(text, max_tokens)
assert result == text

def test_zero_division_scenario(self):
"""Test scenario that could lead to division by zero."""
text = "Test"
max_tokens = 10

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [] # Empty tokens (could cause division by zero)
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens)
# Should handle gracefully and return original text
assert result == text

def test_various_edge_cases(self):
"""Test various edge cases."""
# Single character
assert clip_tokens("A", 1000) == "A"

# Only whitespace
text = " \n \t "
with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 10
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, 5)
assert isinstance(result, str)

# Text with only newlines
text = "\n\n\n\n"
with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 10
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, 2, delete_last_line=True)
assert isinstance(result, str)

def test_parameter_combinations(self):
"""Test different parameter combinations."""
text = "Multi\nline\ntext\nfor\ntesting"
max_tokens = 5

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 20
mock_encoder.return_value = mock_tokenizer

# Test all combinations
combinations = [
(True, True), # add_three_dots=True, delete_last_line=True
(True, False), # add_three_dots=True, delete_last_line=False
(False, True), # add_three_dots=False, delete_last_line=True
(False, False), # add_three_dots=False, delete_last_line=False
]

for add_dots, delete_line in combinations:
result = clip_tokens(text, max_tokens,
add_three_dots=add_dots,
delete_last_line=delete_line)
assert isinstance(result, str)
if add_dots and len(result) > 0:
assert result.endswith("\n...(truncated)") or result == text

def test_num_output_chars_zero_scenario(self):
"""Test scenario where num_output_chars becomes zero or negative."""
text = "Short"
max_tokens = 1

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 1000 # Many tokens for short text
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens)
# When num_output_chars is 0 or negative, should return empty string
assert result == ""

def test_logging_on_exception(self):
"""Test that exceptions are properly logged."""
text = "Test text"
max_tokens = 10

# Patch the logger at the module level where it's imported
with patch('pr_agent.algo.utils.get_logger') as mock_logger:
mock_log_instance = MagicMock()
mock_logger.return_value = mock_log_instance

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_encoder.side_effect = Exception("Test exception")

result = clip_tokens(text, max_tokens)

# Should log the warning
mock_log_instance.warning.assert_called_once()
# Should return original text
assert result == text

def test_factor_safety_calculation(self):
"""Test that the 0.9 factor (10% reduction) works correctly."""
text = "Test text that should be reduced by 10 percent for safety"
max_tokens = 10

with patch.object(TokenEncoder, 'get_token_encoder') as mock_encoder:
mock_tokenizer = MagicMock()
mock_tokenizer.encode.return_value = [1] * 20 # 20 tokens
mock_encoder.return_value = mock_tokenizer

result = clip_tokens(text, max_tokens)

# The result should be shorter due to the 0.9 factor
# Characters per token = len(text) / 20
# Expected chars = int(0.9 * (len(text) / 20) * 10)
expected_chars = int(0.9 * (len(text) / 20) * 10)

# Result should be around expected_chars length (plus truncation text)
if result.endswith("\n...(truncated)"):
actual_content = result[:-len("\n...(truncated)")]
assert len(actual_content) <= expected_chars + 5 # Some tolerance

# Test the original basic functionality to ensure backward compatibility
def test_clip_original_functionality(self):
"""Test original functionality from the existing test."""
text = "line1\nline2\nline3\nline4\nline5\nline6"
max_tokens = 25
result = clip_tokens(text, max_tokens)
Expand All @@ -16,4 +305,4 @@ def test_clip(self):
max_tokens = 10
result = clip_tokens(text, max_tokens)
expected_results = 'line1\nline2\nline3\n\n...(truncated)'
assert result == expected_results
assert result == expected_results
Copy link

@Kkan9ma Kkan9ma May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

μ†Œμ†Œν•˜μ§€λ§Œ EOF 도 챙겨주십사.. μ½”λ©˜νŠΈ λ‚¨κ²¨λ΄…λ‹ˆλ‹€ :)