Merge pull request #569 from huggingface/inference-client-update

sergiopaniego · web-flow · commit 918ec0094362 · 2025-07-08T12:55:18.000+02:00
Updated InferenceClient calls
diff --git a/units/en/unit1/dummy-agent-library.mdx b/units/en/unit1/dummy-agent-library.mdx
@@ -12,7 +12,7 @@ You probably wouldn't use these in production, but they will serve as a good **s
 
 After this section, you'll be ready to **create a simple Agent** using `smolagents`
 
-And in the following Units we will also use other AI Agent libraries like `LangGraph`, `LangChain`, and `LlamaIndex`.
+And in the following Units we will also use other AI Agent libraries like `LangGraph`, and `LlamaIndex`.
 
 To keep things simple we will use a simple Python function as a Tool and Agent. 
 
@@ -29,45 +29,13 @@ import os
 from huggingface_hub import InferenceClient
 
 ## You need a token from https://hf.co/settings/tokens, ensure that you select 'read' as the token type. If you run this on Google Colab, you can set it up in the "settings" tab under "secrets". Make sure to call it "HF_TOKEN"
-os.environ["HF_TOKEN"]="hf_xxxxxxxxxxxxxx"
+# HF_TOKEN = os.environ.get("HF_TOKEN")
 
-client = InferenceClient(provider="hf-inference", model="meta-llama/Llama-3.3-70B-Instruct")
-# if the outputs for next cells are wrong, the free model may be overloaded. You can also use this public endpoint that contains Llama-3.2-3B-Instruct
-# client = InferenceClient("https://jc26mwg228mkj8dw.us-east-1.aws.endpoints.huggingface.cloud")
+client = InferenceClient(model="meta-llama/Llama-4-Scout-17B-16E-Instruct")
 ```
 
-```python
-output = client.text_generation(
-    "The capital of France is",
-    max_new_tokens=100,
-)
-
-print(output)
-```
-output:
-```
-Paris. The capital of France is Paris. Paris, the City of Light, is known for its stunning architecture, art museums, fashion, and romantic atmosphere. It's a must-visit destination for anyone interested in history, culture, and beauty. The Eiffel Tower, the Louvre Museum, and Notre-Dame Cathedral are just a few of the many iconic landmarks that make Paris a unique and unforgettable experience. Whether you're interested in exploring the city's charming neighborhoods, enjoying the local cuisine.
-```
-As seen in the LLM section, if we just do decoding, **the model will only stop when it predicts an EOS token**, and this does not happen here because this is a conversational (chat) model and **we didn't apply the chat template it expects**.
-
-If we now add the special tokens related to the <a href="https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct">Llama-3.3-70B-Instruct model</a> that we're using, the behavior changes and it now produces the expected EOS.
+We use the `chat` method since is a convenient and reliable way to apply chat templates:
 
-```python
-prompt="""<|begin_of_text|><|start_header_id|>user<|end_header_id|>
-The capital of France is<|eot_id|><|start_header_id|>assistant<|end_header_id|>"""
-output = client.text_generation(
-    prompt,
-    max_new_tokens=100,
-)
-
-print(output)
-```
-output:
-```
-The capital of France is Paris.
-```
-
-Using the "chat" method is a much more convenient and reliable way to apply chat templates:
 ```python
 output = client.chat.completions.create(
     messages=[
@@ -78,11 +46,14 @@ output = client.chat.completions.create(
 )
 print(output.choices[0].message.content)
 ```
+
 output:
+
 ```
-The capital of France is Paris.
+Paris.
 ```
-The chat method is the RECOMMENDED method to use in order to ensure a smooth transition between models, but since this notebook is only educational, we will keep using the "text_generation" method to understand the details.
+
+The chat method is the RECOMMENDED method to use in order to ensure a smooth transition between models.
 
 ## Dummy Agent
 
@@ -133,29 +104,19 @@ Final Answer: the final answer to the original input question
 Now begin! Reminder to ALWAYS use the exact characters `Final Answer:` when you provide a definitive answer. """
 ```
 
-Since we are running the "text_generation" method, we need to apply the prompt manually:
-```python
-prompt=f"""<|begin_of_text|><|start_header_id|>system<|end_header_id|>
-{SYSTEM_PROMPT}
-<|eot_id|><|start_header_id|>user<|end_header_id|>
-What's the weather in London ?
-<|eot_id|><|start_header_id|>assistant<|end_header_id|>
-"""
-```
+We need to append the user instruction after the system prompt. This happens inside the `chat` method. We can see this process below:
 
-We can also do it like this, which is what happens inside the `chat` method :
 ```python
-messages=[
+messages = [
     {"role": "system", "content": SYSTEM_PROMPT},
-    {"role": "user", "content": "What's the weather in London ?"},
-    ]
-from transformers import AutoTokenizer
-tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.3-70B-Instruct")
+    {"role": "user", "content": "What's the weather in London?"},
+]
 
-tokenizer.apply_chat_template(messages, tokenize=False,add_generation_prompt=True)
+print(messages)
 ```
 
-The prompt now is :
+The prompt now is:
+
 ```
 <|begin_of_text|><|start_header_id|>system<|end_header_id|>
 Answer the following questions as best you can. You have access to the following tools:
@@ -196,15 +157,17 @@ What's the weather in London ?
 <|eot_id|><|start_header_id|>assistant<|end_header_id|>
 ```
 
-Let's decode!
+Let's call the `chat` method!
+
 ```python
-output = client.text_generation(
-    prompt,
-    max_new_tokens=200,
+output = client.chat.completions.create(
+    messages=messages,
+    stream=False,
+    max_tokens=200,
 )
-
-print(output)
+print(output.choices[0].message.content)
 ```
+
 output:
 
 ````
@@ -222,19 +185,22 @@ Final Answer: The current weather in London is partly cloudy with a temperature
 ````
 
 Do you see the issue?
+
 > At this point, the model is hallucinating, because it's producing a fabricated "Observation" -- a response that it generates on its own rather than being the result of an actual function or tool call.
 > To prevent this, we stop generating right before "Observation:". 
 > This allows us to manually run the function (e.g., `get_weather`) and then insert the real output as the Observation.
 
 ```python
-output = client.text_generation(
-    prompt,
-    max_new_tokens=200,
+# The answer was hallucinated by the model. We need to stop to actually execute the function!
+output = client.chat.completions.create(
+    messages=messages,
+    max_tokens=150,
     stop=["Observation:"] # Let's stop before any actual function is called
 )
 
-print(output)
+print(output.choices[0].message.content)
 ```
+
 output:
 
 ````
@@ -249,8 +215,9 @@ Action:
 Observation:
 ````
 
-Much Better! 
-Let's now create a dummy get weather function.  In a real situation, you would likely call an API.
+Much Better!
+
+Let's now create a **dummy get weather function**. In a real situation you could call an API.
 
 ```python
 # Dummy function
@@ -259,23 +226,33 @@ def get_weather(location):
 
 get_weather('London')
 ```
+
 output:
+
 ```
 'the weather in London is sunny with low temperatures. \n'
 ```
 
-Let's concatenate the base prompt, the completion until function execution and the result of the function as an Observation and resume generation.
+Let's concatenate the system prompt, the base prompt, the completion until function execution and the result of the function as an Observation and resume generation.
 
 ```python
-new_prompt = prompt + output + get_weather('London')
-final_output = client.text_generation(
-    new_prompt,
-    max_new_tokens=200,
+messages=[
+    {"role": "system", "content": SYSTEM_PROMPT},
+    {"role": "user", "content": "What's the weather in London ?"},
+    {"role": "assistant", "content": output.choices[0].message.content + get_weather('London')},
+]
+
+output = client.chat.completions.create(
+    messages=messages,
+    stream=False,
+    max_tokens=200,
 )
 
-print(final_output)
+print(output.choices[0].message.content)
 ```
+
 Here is the new prompt:
+
 ```text
 <|begin_of_text|><|start_header_id|>system<|end_header_id|>
 Answer the following questions as best you can. You have access to the following tools:
diff --git a/units/es/unit1/dummy-agent-library.mdx b/units/es/unit1/dummy-agent-library.mdx
@@ -12,7 +12,7 @@ Probablemente no usarías estos en producción, pero servirán como un buen **pu
 
 Después de esta sección, estarás listo para **crear un Agente simple** usando `smolagents`
 
-Y en las siguientes Unidades también utilizaremos otras bibliotecas de Agentes de IA como `LangGraph`, `LangChain` y `LlamaIndex`.
+Y en las siguientes Unidades también utilizaremos otras bibliotecas de Agentes de IA como `LangGraph` y `LlamaIndex`.
 
 Para mantener las cosas simples, utilizaremos una función simple de Python como Herramienta y Agente.
 
diff --git a/units/ko/unit1/dummy-agent-library.mdx b/units/ko/unit1/dummy-agent-library.mdx
@@ -12,7 +12,7 @@
 
 이 섹션을 마치면 `smolagents`를 사용하여 **간단한 에이전트를 만들** 준비가 될 것입니다.
 
-이어지는 Unit에서는 `LangGraph`, `LangChain`, `LlamaIndex`와 같은 다른 AI 에이전트 라이브러리도 사용해 볼 예정입니다.
+이어지는 Unit에서는 `LangGraph`, `LlamaIndex`와 같은 다른 AI 에이전트 라이브러리도 사용해 볼 예정입니다.
 
 간단하게 하기 위해 도구와 에이전트로 단순한 Python 함수를 사용할 것입니다.
 
diff --git a/units/ru-RU/unit1/dummy-agent-library.mdx b/units/ru-RU/unit1/dummy-agent-library.mdx
@@ -12,7 +12,7 @@
 
 После этого раздела вы будете готовы **создать простого агента** с использованием `smolagents`.
 
-В следующих разделах мы также будем использовать другие библиотеки AI Агентов, такие как `LangGraph`, `LangChain` и `LlamaIndex`.
+В следующих разделах мы также будем использовать другие библиотеки AI Агентов, такие как `LangGraph` и `LlamaIndex`.
 
 Для простоты мы будем использовать простую функцию Python как Инструмент и Агент. 
 
diff --git a/units/vi/unit1/dummy-agent-library.mdx b/units/vi/unit1/dummy-agent-library.mdx
@@ -12,7 +12,7 @@ Những công cụ này có thể không dùng cho production, nhưng sẽ là *
 
 Sau phần này, bạn sẽ sẵn sàng **tạo Agent đơn giản** bằng `smolagents`.
 
-Ở các chương tiếp theo, ta cũng sẽ dùng các thư viện AI agent khác như `LangGraph`, `LangChain` và `LlamaIndex`.
+Ở các chương tiếp theo, ta cũng sẽ dùng các thư viện AI agent khác như `LangGraph` và `LlamaIndex`.
 
 Để đơn giản hóa, ta sẽ dùng hàm Python cơ bản làm Tool và Agent.
 
diff --git a/units/zh-CN/unit1/dummy-agent-library.mdx b/units/zh-CN/unit1/dummy-agent-library.mdx
@@ -12,7 +12,7 @@
 
 在本节之后，你将准备好**使用 `smolagents` 创建一个简单的智能体**。
 
-在接下来的单元中，我们还将使用其他 AI 智能体库，如 `LangGraph`、`LangChain` 和 `LlamaIndex`。
+在接下来的单元中，我们还将使用其他 AI 智能体库，如 `LangGraph` 和 `LlamaIndex`。
 
 为了保持简单，我们将使用一个简单的 Python 函数作为工具和智能体。