Boomm-shakalaka
diff --git a/‎Dockerfile‎
Lines changed: 20 additions & 11 deletions b/‎Dockerfile‎
Lines changed: 20 additions & 11 deletions
diff --git a/‎README.md‎
Lines changed: 41 additions & 8 deletions b/‎README.md‎
Lines changed: 41 additions & 8 deletions
diff --git a/‎__pycache__/crawler_modules.cpython-39.pyc‎
100 Bytes b/‎__pycache__/crawler_modules.cpython-39.pyc‎
100 Bytes
diff --git a/‎crawler_modules.py‎
Lines changed: 15 additions & 5 deletions b/‎crawler_modules.py‎
Lines changed: 15 additions & 5 deletions
diff --git a/‎requirements.txt‎
Lines changed: 6 additions & 2 deletions b/‎requirements.txt‎
Lines changed: 6 additions & 2 deletions
diff --git a/‎web_pages/__pycache__/chat_page.cpython-39.pyc‎
-259 Bytes b/‎web_pages/__pycache__/chat_page.cpython-39.pyc‎
-259 Bytes
diff --git a/‎web_pages/__pycache__/online_chat_page.cpython-39.pyc‎
-59 Bytes b/‎web_pages/__pycache__/online_chat_page.cpython-39.pyc‎
-59 Bytes
diff --git a/‎web_pages/__pycache__/pdf_page.cpython-39.pyc‎
-69 Bytes b/‎web_pages/__pycache__/pdf_page.cpython-39.pyc‎
-69 Bytes
diff --git a/‎web_pages/__pycache__/url_page.cpython-39.pyc‎
-128 Bytes b/‎web_pages/__pycache__/url_page.cpython-39.pyc‎
-128 Bytes
diff --git a/‎web_pages/about_page.py‎
Lines changed: 2 additions & 2 deletions b/‎web_pages/about_page.py‎
Lines changed: 2 additions & 2 deletions
@@ -1,25 +1,34 @@
-# 设置基础镜像，这里选择 Python 3.8
-FROM python:3.8.19
+# 使用 Ubuntu 22.04 作为基础镜像
+FROM ubuntu:22.04
 
 # 设置工作目录
 WORKDIR /app
 
 # 复制项目文件到容器中的工作目录
 COPY . /app
 
-# 安装 Python 依赖
-RUN pip install --no-cache-dir -r requirements.txt   
+# 安装系统依赖项
+RUN apt-get update && \
+    apt-get install -y libgl1-mesa-glx libpython3-dev
 
-# 安装 Node.js 和 npm
-RUN curl -fsSL https://deb.nodesource.com/setup_14.x | bash - \
-    && apt-get install -y nodejs \
-    && rm -rf /var/lib/apt/lists/*
+# 安装 Python 3.9
+RUN apt-get install -y python3.9 
 
-# 安装 npm 依赖和 Playwright 浏览器
-RUN npm install && npx playwright install
+# 安装 pip
+RUN apt-get install -y python3-pip
+
+# 安装 Python 依赖项
+RUN pip3 install --no-cache-dir -r requirements.txt
+
+# 安装 Playwright 及其依赖项
+RUN playwright install --with-deps chromium 
 
 # 暴露端口
 EXPOSE 8501
 
+# 设置环境变量以指定操作系统
+ENV OS_TYPE="linux"
+
 # 运行 Streamlit 应用
-CMD ["streamlit", "run", "web_ui.py", "--server.port", "8501"]
+CMD ["python3", "-m", "streamlit", "run", "web_ui.py", "--server.port", "8501"]
+
@@ -1,4 +1,4 @@
-### 基于LLM大模型的AI机器人
+# 基于LLM大模型的AI机器人
 一款开源的AI语言模型机器人，集成人机对话，信息检索生成，PDF和URL解析对话等功能。该平台优势为全部采用免费开源API，以最低成本实现LLM定制化功能。
 
 ## 工具和平台
@@ -7,8 +7,6 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
 ## DEMO链接
 [Link](http://168.138.28.54:8501)
 
-## DEMO链接
-[Link](http://168.138.28.54:8501)
 ## 文件结构描述
 <pre>
 .
@@ -39,12 +37,12 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
 ├── web_ui.py   # main interface
 </pre>
 
-以下是优化后的Markdown写法：
 
 ## 功能描述
 
 ### Crawler爬虫模块
 
+
 *  该模块主要包含三种爬虫方法: [Selenium](https://selenium-python.readthedocs.io/)，[Playwright](https://playwright.dev/python/docs/intro)，[基于Langchain的DuckDuckGo](https://api.python.langchain.com/en/latest/tools/langchain_community.tools.ddg_search.tool.DuckDuckGoSearchResults.html)。
 
 *  实验显示，Playwright的耗时只有Selenium的一半：
@@ -53,7 +51,8 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
     | selenium_url_crawler   | 27s       |
     | playwright_url_crawler | 11s       |
 
-*  由于Streamlit和Playwright的同步方式会产生冲突，所以应使用异步方法。 [参考](https://discuss.streamlit.io/t/using-playwright-with-streamlit/28380/5)
+*  由于Streamlit和Playwright的同步方式会产生冲突，所以应使用异步方法。[参考](https://discuss.streamlit.io/t/using-playwright-with-streamlit/28380/5)
+
 
 ### Chat模块 (在线和离线)
 
@@ -92,6 +91,13 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
     3. 根据问题检索top_k个相关文档。
     4. 基于文档内容回答问题。
 
+### PDF解析模块
+1. 基于[Streamlit-PDF-API](https://discuss.streamlit.io/t/display-pdf-in-streamlit/62274)和[Langchain-PDFMinerLoader](https://api.python.langchain.com/en/latest/document_loaders/langchain_community.document_loaders.pdf.PDFMinerLoader.html)
+2. 使用流程:
+    1. 上传PDF
+    2. 解析PDF内容大模型基于prompt总结PDF
+    3. 根据问题和PDF内容进行回答
+
 ## 使用教程
 
 ### 本地部署
@@ -125,11 +131,37 @@ Langchain, Streamlit, Oracle Cloud, Groq,Google cloud, Baidu Cloud, Docker
     streamlit run web_ui.py
     ```
 ### 服务器部署
-1. [Docker链接](https://hub.docker.com/repository/docker/jiyuanc1/aibot/general)
-2. 服务器部署教程：[wiki链接](https://github.com/Boomm-shakalaka/AIBot-LLM/wiki/Oracle%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%90%AD%E5%BB%BA%E6%95%99%E7%A8%8B)
+方法一:  Linux环境本地安装和执行Docker
+* 服务器拉取github仓库
+* 构建镜像
+
+方法二:  Docker Hub拉取和执行镜像
+* [Docker Hub链接](https://hub.docker.com/repository/docker/jiyuanc1/aibot/general)
+
+部署教程
+* 服务器部署教程：[wiki链接](https://github.com/Boomm-shakalaka/AIBot-LLM/wiki/Oracle%E6%9C%8D%E5%8A%A1%E5%99%A8%E6%90%AD%E5%BB%BA%E6%95%99%E7%A8%8B)
+
+## Docker构建镜像已知问题
+1. Google-genai打包失败,没有找到该问题原因
+    ```bash
+    ERROR: Could not find a version that satisfies the requirement langchain-google-genai (from -r requirements.txt (line 11)) (from >versions: none) 
+    ERROR: No matching distribution found for langchain-google-genai (from -r requirements.txt (line 11))
+    ```
+2. 对于windows和linux 不同操作系统，异步方法也不同 [参考](https://stackoverflow.com/questions/67964463/what-are-selectoreventloop-and-proactoreventloop-in-python-asyncio)
+    ```python
+    if sys.platform == "win32":
+        loop = asyncio.ProactorEventLoop() #windows系统
+    else:
+        loop = asyncio.SelectorEventLoop()#linux系统
+    ```
+3. playwright无法直接打包进Docker! 需要基于Ubuntu镜像环境[参考](https://stackoverflow.com/questions/72181737/issue-running-playwright-python-in-docker-container)
+
 
 ## 版本更新记录
-v1.0.0 (oracle cloud)
+v1.0.1 (oracle)
+1. 解决Docker构建镜像问题，解决不同操作系统存在的异步方法
+
+v1.0.0 
 1. 优化pdf chat功能中的简历评估功能，增加对话
 2. 新增playwright爬虫模块，优化异步调用
 3. 新增url chat爬虫模块调用和来源检索选择功能
@@ -139,6 +171,7 @@ v1.0.0 (oracle cloud)
 7. 整合prompt配置内容
 8. 页面美化
 9. 新增about页面
+10. 更新Dockerfile
 
 v0.0.5
 1. 新增百度千帆大模型(ERNIE-Lite-8K和ERNIE-Speed-128K免费开放)
 
@@ -1,5 +1,6 @@
 import asyncio
 import re
+import sys
 from bs4 import BeautifulSoup
 import requests
 from langchain_core.documents import Document
@@ -169,10 +170,13 @@ async def playwright_crawler_async(url):
 
 def selenium_url_crawler(url):
     options = Options()
-    options.add_argument('--headless')
+    options.add_argument("--headless")  # Run Chrome in headless mode
+    options.add_argument("--no-sandbox")  # Bypass OS security model
+    options.add_argument("--disable-dev-shm-usage")  # Overcome limited resource problems
     # options.add_argument('--window-size=1920x1080')
 
-    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
+    # driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
+    driver = webdriver.Chrome(options=options)
     driver.get(url)
     # time.sleep(2) 
 
@@ -241,16 +245,22 @@ def duckduck_search(question):
     # print(data_playwright)
 
     '''playwright_async'''
-    # loop = asyncio.ProactorEventLoop()
+    # if sys.platform == "win32":
+    #     loop = asyncio.ProactorEventLoop()
+    # else:
+    #     loop = asyncio.SelectorEventLoop()
     # data_playwright_async = loop.run_until_complete(playwright_crawler_async('https://www.google.com/search?q=墨尔本天气'))
     # print(data_playwright_async)
 
     '''google_search_sync'''
     # data_sync = google_search_sync(question)
     # print(data_sync)
 
-    '''google_search_async'''
-    loop = asyncio.ProactorEventLoop()
+    '''google_search_async'''                
+    if sys.platform == "win32":
+        loop = asyncio.ProactorEventLoop()
+    else:
+        loop = asyncio.SelectorEventLoop()
     data_async = loop.run_until_complete(google_search_async(question))
     print(data_async)
 
 
@@ -8,6 +8,10 @@ BeautifulSoup4
 langchain_cohere
 chromadb
 duckduckgo-search
-langchain-google-genai
+qianfan
+asyncio
+webdriver-manager
+# langchain-google-genai
 pdfminer.six
-selenium
+selenium
+playwright
@@ -40,9 +40,9 @@ def about_page():
         * [Bootstrap官网](https://getbootstrap.com/)
 
         作者：Boomm-shakalaka  
-        版本：1.0  
+        版本：1.1  
         Github项目地址：[AIBot-LLM](https://github.com/Boomm-shakalaka/AIBot-LLM)  
-        '报告Bug'：[Github Issues](https://github.com/Boomm-shakalaka/AIBot-LLM/issues)
+        报告Bug：[Github Issues](https://github.com/Boomm-shakalaka/AIBot-LLM/issues)
         """
     )