Skip to content

Commit 6dafebe

Browse files
committed
Update the README and add English version
1 parent 3205d24 commit 6dafebe

File tree

4 files changed

+238
-62
lines changed

4 files changed

+238
-62
lines changed

.github/workflows/nuitka.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,7 @@ jobs:
2525
run: |
2626
python -m pip install --upgrade pip
2727
pip install -r requirements.txt
28+
pip install imageio
2829
2930
- name: Install Nuitka
3031
run: pip install nuitka
@@ -57,6 +58,7 @@ jobs:
5758
run: |
5859
python -m pip install --upgrade pip
5960
pip install -r requirements.txt
61+
pip install imageio
6062
6163
- name: Install Nuitka
6264
run: pip install nuitka

README.adoc

Lines changed: 71 additions & 62 deletions
Original file line numberDiff line numberDiff line change
@@ -1,25 +1,25 @@
11
= Pubmedsoso =
22
:toc:
33

4-
image:assets/icon.png[Pubmedsoso]
5-
6-
一个自动批量提取pubmed文献信息和下载免费文献的小工具
4+
*Language*: link:README.adoc[English] | link:README_CN.adoc[简体中文]
75

8-
== 主要功能 ==
6+
image:assets/icon.png[Pubmedsoso]
97

10-
自己写的基于aiohttp pandas和xpath 的pubmed文献信息爬取和下载的工具,按照你给定的参数,按你的要求获取相关的文献信息,并且下载对应文献的PDF原文
8+
A small tool for automatically extracting PubMed literature information and downloading free literature in bulk.
119

12-
下载速度大概是1s一篇,同时能够提取文献的大部分信息,并自动生成excel文件,包括文献标题,摘要,关键词,作者名单,作者单位,是否免费,是不是review类型等信息。
10+
== Features ==
11+
A tool for crawling and downloading PubMed literature information, written based on `aiohttp`, `pandas`, and `xpath`. It retrieves relevant literature information according to the parameters you provide and downloads the corresponding PDF originals.
1312

14-
自动下载后,会将部分信息储存在本地的文本文件中,供参考,检索数据会储存在sqlite3数据库中,最后执行完成后,自动导出所有信息,生成一个Excel文件。
13+
The download speed is approximately 1 second per article. It can extract most of the literature information and automatically generate an Excel file, including details such as the title, abstract, keywords, author list, author affiliations, whether it is free, and whether it is a review type.
1514

16-
== 依赖模块 ==
15+
After automatic downloading, some information is stored in local text files for reference, and search data is stored in an `sqlite3` database. Finally, after execution, all information is automatically exported and an Excel file is generated.
1716

18-
* 基于python3.9,主要使用了pandas, xpath, asyncio, aiohttp
17+
== Dependencies ==
18+
_Based on Python 3.9, also supports higher versions. Mainly uses `pandas`, `xpath`, `asyncio`, `aiohttp`._
1919

20-
* 项目页面右边的Release当中有Nuitka打包成的exe执行文件,自带所有依赖,可以直接下载后在windows下命令行执行使用。
20+
On the project page, there is an executable file packed with `Nuitka` in the Releases section, which includes all dependencies and can be directly downloaded and executed in the Windows command line.
2121

22-
* 如果需要自己在python环境运行请根据requirements.txt文件安装对应的依赖
22+
_If you need to run it in a Python environment, please install the corresponding dependencies according to the `requirements.txt` file._
2323

2424
[source, bash, indent=2]
2525
----
@@ -31,9 +31,8 @@ pandas~=2.2.3
3131
openpyxl~=3.1.5
3232
----
3333

34-
== 使用方法 ==
35-
36-
. 克隆项目,在命令行环境下安装项目所需的依赖文件
34+
== Usage ==
35+
. Clone the project and install the required dependencies in the command line environment. It is recommended to use Python virtual environment tools such as `anaconda` or `miniconda3`.
3736

3837
[source, bash]
3938
----
@@ -42,96 +41,106 @@ cd Pubmedsoso
4241
pip install -r requirements.txt
4342
----
4443

45-
当然,不方便安装git工具的话,直接直接下载Release当中的ZIP解压执行
44+
If it is inconvenient to install the `git` tool, you can directly download the ZIP file from the Releases section on the right side of the page and unzip it for execution.
45+
4646
image:assets/pubmed_release.png[Pubmedsoso, 600]
4747

48-
. 在Windows terminal中切换到项目文件夹,执行 `python main.py` 带上关键词参数 或者在直接执行exe可执行文件 `pubmedsoso.exe` + 关键词
48+
. Switch to the project folder in the terminal and execute `python main.py` with keyword parameters, or directly execute the executable file `pubmedsoso.exe` + keywords. For example:
4949

50-
比如
51-
[souce, bash]
50+
[source, bash]
5251
----
5352
python main.py headache -n 5 -d 10
5453
----
5554

56-
`headache` 参数是此次运行输入的检索关键词
57-
58-
`-n` 参数后面数字指的是需要检索的页数, 每页会有50篇
55+
`headache` is the search keyword (keyword) input for this run. If your keyword contains spaces, please enclose the keyword with double quotes. It supports PubMed advanced query boxes, such as "headache AND toothache", which include logical expressions like AND, NOT, OR.
5956

60-
`-d` 参数后面表示需要下载的文献的份数
57+
`-n` parameter followed by a number specifies the number of pages to search, with 50 articles per page.
6158

62-
`-y` 可选参数,后面表示你想检索的信息的年份范围,以年为单位,比如 -y 5表示近五年的文献
59+
`-d` parameter followed by a number specifies the number of articles to download.
6360

64-
* 输入页数时,每页会检索50个,数字不用设置得太大,否则需要较长时间执行
61+
`-y` optional parameter followed by a number specifies the year range for the information you want to search, in years. For example, -y 5 means literature from the last five years.
6562

66-
* 然后输入需要下载的文献数量,程序会从搜索结果中找到free pmc 免费文献,自动下载,这里下载速度取决你的网络状况。
63+
When entering the number of pages, each page contains 50 search results. There is no need to set a large number, otherwise, it will take a long time to execute.
6764

68-
* 每个文献下载超过30s自动超时跳过,下载下一个。
65+
Then enter the number of articles you need to download. The program will find free PMC articles from the search results and download them automatically. The download speed depends on your network condition. Each article download will automatically timeout and skip after 30 seconds, downloading the next one.
6966

7067
[source, bash]
7168
----
7269
PS > python main.py --help
73-
7470
usage: python main.py keyword
75-
pubmedsoso is a python program for crawler article information and download pdf file
71+
pubmedsoso is a python program for crawling article information and downloading pdf file
7672
7773
positional arguments:
78-
keyword specify the keywords to search pubmed For example "headache"
74+
keyword specify the keywords to search pubmed e.g. "headache"
7975
8076
optional arguments:
81-
--help, -h show this help message and exit
82-
--version, -v use --version to show the version
83-
--page_num, -n add --number or -n to specify the page number to
84-
--year -y add --year or -y to specify year scale you would to
85-
--download_num, -d add --download_num or -d to specify the number to download
86-
77+
--help, -h show this help message and exit
78+
--version, -v use --version to show the version
79+
--pagenum, -n add --pagenum or -n to specify the page number to
80+
--year -y add --year or -y to specify year scale you would to
81+
--downloadnum,-d a digit number to specify the number to download
82+
--directory -D use a valid directory path specify the pdf save directory.
8783
----
8884

89-
_如果你熟悉IDE的话,可以在pycharm或者vscode等python环境下运行main.py_
85+
_If you are familiar with IDEs, you can run `main.py` in Python environments such as `pycharm` or `vscode`._
9086

91-
. 根据提示输入 `y` 或者 `n` 决定是否以给定的参数执行程序
87+
. According to the prompts, input `y` or `n` to decide whether to execute the program with the given parameters.
9288

9389
image:assets/pubmedsoso_teminal.png[comfirm picture, 600]
9490

95-
**pubmedsoso会按照你正常搜索的顺序进行爬取下载**
91+
**pubmedsoso will crawl and download according to the normal search order.**
9692

9793
image:assets/pic_keyword.png[Pubmedsoso, 600]
9894

99-
. 文献会自动下载到之前说的"document/pub/"下,同时会生成原始遍历信息的txt文件,程序最终执行完成会生成excel文件。
95+
. The literature will be automatically downloaded to the "document/pub/" folder mentioned earlier, and a txt file with the original traversal information will be generated. The program will finally generate an Excel file after execution.
10096

10197
image::assets/pic_result.png[Pubmedsoso, 600]
10298

103-
WARNING::请勿过分爬取pubmed网站
99+
WARNING:: Please do not excessively crawl the PubMed website. Since this project uses asynchronous mechanisms, it has high concurrency capabilities. Parameters related to access speed can be set in `config.py`, and the default values are not too large.
104100

105-
== ExcelHelper 模块 ==
106-
107-
这个是方便大家在爬取之后,将历史信息导出到excel的模块,可以单独执行。比如在IDE或者命令行中执行 `python ExcelHelper.py`
101+
== ExcelHelper Module ==
102+
This module is convenient for exporting historical information to Excel after crawling. It can be executed separately, such as in an IDE or command line, by executing `python ExcelHelper.py`.
108103

109104
image::assets/pic_save.png[Pubmedsoso]
110105

111-
出现如上提示,可以选择sqlite3数据中的历史记录进行导出,会自动在本地生成一个导出的文件。**不能有重复命名的excel文件,需要按提示删除**
106+
When the above prompt appears, you can choose to export historical records from the `sqlite3` data and an exported file will be automatically generated locally. **Duplicate-named Excel files are not allowed and need to be deleted as prompted.**
112107

113108
== TO DO List ==
109+
* [ ] Precise search and download, this is still a bit difficult_
110+
111+
* [x] Custom keyword download, waiting for me to figure out the PubMed search parameter URL generation rules (already implemented)
112+
* [ ] Automatic completion of non-free literature download via SciHub, perhaps allowing users to write adapters themselves_
113+
* [ ] A usable GUI interface_
114+
* [ ] Ideally, a free Baidu translation plugin, sometimes it might be useful_
115+
* [x] Refactor the project using OOP and more modern tools
116+
* [x] Refactor the code using asynchronous methods to improve execution efficiency
117+
* [ ] A potentially necessary logging system_
118+
* [ ] Implement an active literature subscription feature based on email subscription and push mechanism, pushing the latest literature to users
119+
120+
== Debugging Guide ==
121+
Due to the specificity of the `asyncio` asynchronous module, some special issues may occur during debugging on Windows.
122+
123+
If you need to develop and debug the code, you need to modify two places:
124+
125+
In `GetSearchResult.py`:
126+
127+
[source, bash]
128+
----
129+
try:
130+
if platform.system() == "Windows":
131+
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
132+
html_list = asyncio.run(WebHelper.getSearchHtmlAsync(param_list))
133+
----
114134

115-
* [ ] 精确地搜索下载,这个还有点难
116-
* [x] 自定义关键词下载,等我有空弄明白pubmed的检索参数url生成规则就行(已经实现)
117-
* [ ] 对非免费文献的scihub自动补全下载,或许可以让用户写adapter自己实现
118-
* [ ] 能用的gui界面
119-
* [ ] 最好附带一个免费的百度翻译插件,有时候大家可能用得上
120-
* [x] 采用OOP和更加现代化的工具重构项目
121-
* [x] 使用异步的方式重构代码,提高执行的效率
122-
* [ ] 可能还需要一个堪用的日志系统
123-
* [ ] 可以做一个基于邮件的订阅-主动推送机制的主动的文献订阅功能,为用户推送最新文献
135+
If debugging on Windows, please comment out the conditional execution statement above, otherwise, it will take effect during debugging and cause errors.
124136

125-
'''
137+
Additionally, `asyncio.run()` is used in multiple places in the project. During debugging, the debug parameter needs to be enabled, otherwise, the runtime will get stuck and report a `TypeError: 'Task' object is not callable` error.
126138

127-
2022.5.16 更新
128-
更新了自动创建document/pub文件夹功能,不需要手动创建文件夹了,会自动检查和创建。
139+
== Update Log ==
140+
2022.5.16 Updated the feature to automatically create the `document/pub` folder, no need to manually create the folder, it will automatically check and create.
129141

130-
2023.08.05 更新
131-
更新修复了abstract爬取失败的bug,同时不再需要用户手动复制粘贴网页的参数了。
142+
2023.08.05 Updated to fix the bug where abstract crawling failed, and no longer requires users to manually copy and paste web parameters.
132143

133-
2024.11.23 更新
134-
作者竟然想起了这个黑历史一般的项目,偷偷更新一下,“这TM是我写的代码? 怎么这么烂"
144+
2024.11.23 The author unexpectedly remembered this somewhat embarrassing project and quietly updated it, "Is this really the code I wrote? How could it be so bad?"
135145

136-
2024.12.02 更新
137-
已经用基于OOP和asyncio异步机制重构了整个代码,去掉了运行速度的限制,速度大概是原来的100倍
146+
2024.12.02 Refactored the entire code based on OOP, `xpath`, and `asyncio` asynchronous, removed the runtime speed limit, the speed is about 100 times the original, "Writing this was so exhausting."

README_CN.adoc

Lines changed: 165 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,165 @@
1+
= Pubmedsoso =
2+
:toc:
3+
4+
*Language*: link:README.adoc[English] | link:README_CN.adoc[简体中文]
5+
6+
image:assets/icon.png[Pubmedsoso]
7+
8+
一个自动批量提取pubmed文献信息和下载免费文献的小工具
9+
10+
== 主要功能 ==
11+
12+
自己写的基于aiohttp pandas和xpath 的pubmed文献信息爬取和下载的工具,按照你给定的参数,按你的要求获取相关的文献信息,并且下载对应文献的PDF原文
13+
14+
下载速度大概是1s一篇,同时能够提取文献的大部分信息,并自动生成excel文件,包括文献标题,摘要,关键词,作者名单,作者单位,是否免费,是不是review类型等信息。
15+
16+
自动下载后,会将部分信息储存在本地的文本文件中,供参考,检索数据会储存在sqlite3数据库中,最后执行完成后,自动导出所有信息,生成一个Excel文件。
17+
18+
== 依赖模块 ==
19+
20+
* 基于python3.9,也支持更高版本,主要使用了pandas, xpath, asyncio, aiohttp
21+
22+
* 项目页面右边的Release当中有Nuitka打包成的exe执行文件,自带所有依赖,可以直接下载后在windows下命令行执行使用。
23+
24+
* 如果需要自己在python环境运行请根据 `requirements.txt` 文件安装对应的依赖
25+
26+
[source, bash, indent=2]
27+
----
28+
asyncio~=3.4.3
29+
aiohttp~=3.11.8
30+
requests~=2.32.3
31+
lxml~=5.3.0
32+
pandas~=2.2.3
33+
openpyxl~=3.1.5
34+
----
35+
36+
== 使用方法 ==
37+
38+
. 克隆项目,在命令行环境下安装项目所需的依赖文件
39+
40+
_建议使用anaconda或者miniconda3等python虚拟环境工具_
41+
42+
[source, bash]
43+
----
44+
git clone https://github.com/hiddenblue/Pubmedsoso.git
45+
cd Pubmedsoso
46+
pip install -r requirements.txt
47+
----
48+
49+
不方便安装git工具时,直接直接下载页面右边Release当中的ZIP解压执行
50+
51+
image:assets/pubmed_release.png[Pubmedsoso, 600]
52+
53+
. 在terminal中切换到项目文件夹,执行 `python main.py` 带上关键词参数 或者在直接执行exe可执行文件 `pubmedsoso.exe` + 关键词
54+
55+
比如
56+
[souce, bash]
57+
----
58+
python main.py headache -n 5 -d 10
59+
----
60+
61+
`headache` 是此次运行输入的检索关键词(keyword)
62+
63+
如果你使用的关键字当中包含空格,请使用双引号将关键词整体围起来。 支持pubmed advance query box 比如 "headache AND toothache" 这样的包含AND NOT OR等逻辑表达的query
64+
65+
`-n` 参数后面数字指的是需要检索的页数, 每页会有50篇
66+
67+
`-d` 参数后面表示需要下载的文献的份数
68+
69+
`-y` 可选参数,后面表示你想检索的信息的年份范围,以年为单位,比如 -y 5表示近五年的文献
70+
71+
* 输入页数时,每页会包含50个检索结果,数字不用设置得太大,否则需要较长时间执行
72+
73+
* 然后输入需要下载的文献数量,程序会从搜索结果中找到free pmc 免费文献,自动下载,这里下载速度取决你的网络状况。
74+
75+
* 每个文献下载超过30s自动超时跳过,下载下一个。
76+
77+
[source, bash]
78+
----
79+
PS > python main.py --help
80+
81+
usage: python main.py keyword
82+
pubmedsoso is a python program for crawler article information and download pdf file
83+
84+
positional arguments:
85+
keyword specify the keywords to search pubmed e.g. "headache"
86+
87+
optional arguments:
88+
--help, -h show this help message and exit
89+
--version, -v use --version to show the version
90+
--pagenum, -n add --pagenum or -n to specify the page number to
91+
--year -y add --year or -y to specify year scale you would to
92+
--downloadnum,-d a digit number to specify the number to download
93+
--directory -D use a vaild directory path specify the pdf save directory.
94+
95+
----
96+
97+
_如果你熟悉IDE的话,可以在pycharm或者vscode等python环境下运行main.py_
98+
99+
. 根据提示输入 `y` 或者 `n` 决定是否以给定的参数执行程序
100+
101+
image:assets/pubmedsoso_teminal.png[comfirm picture, 600]
102+
103+
**pubmedsoso会按照你正常搜索的顺序进行爬取下载**
104+
105+
image:assets/pic_keyword.png[Pubmedsoso, 600]
106+
107+
. 文献会自动下载到之前说的"document/pub/"下,同时会生成原始遍历信息的txt文件,程序最终执行完成会生成excel文件。
108+
109+
image::assets/pic_result.png[Pubmedsoso, 600]
110+
111+
WARNING:: 请勿过分爬取pubmed网站
112+
113+
因为本项目使用异步机制,具有很高的并发能力,访问速度等相关参数可以在 `config.py` 当中设置,默认数值不算太大。
114+
115+
== ExcelHelper 模块 ==
116+
117+
这个是方便大家在爬取之后,将历史信息导出到excel的模块,可以单独执行。比如在IDE或者命令行中执行 `python ExcelHelper.py`
118+
119+
image::assets/pic_save.png[Pubmedsoso]
120+
121+
出现如上提示,可以选择sqlite3数据中的历史记录进行导出,会自动在本地生成一个导出的文件。**不能有重复命名的excel文件,需要按提示删除**
122+
123+
== TO DO List ==
124+
125+
* [ ] 精确地搜索下载,这个还有点难
126+
* [x] 自定义关键词下载,等我有空弄明白pubmed的检索参数url生成规则就行(已经实现)
127+
* [ ] 对非免费文献的scihub自动补全下载,或许可以让用户写adapter自己实现
128+
* [ ] 能用的gui界面
129+
* [ ] 最好附带一个免费的百度翻译插件,有时候大家可能用得上
130+
* [x] 采用OOP和更加现代化的工具重构项目
131+
* [x] 使用异步的方式重构代码,提高执行的效率
132+
* [ ] 可能还需要一个堪用的日志系统
133+
* [ ] 可以做一个基于邮件的订阅-主动推送机制的主动的文献订阅功能,为用户推送最新文献
134+
135+
136+
== 调试指南 ==
137+
138+
因为asyncio异步模块的特殊性,在windows下调试时会出现一些特殊的问题
139+
如果你需要对代码进行开发 调试,需要对两处进行修改
140+
141+
`GetSearchResult.py` 当中的
142+
[source, bash]
143+
----
144+
try:
145+
if platform.system() == "Windows":
146+
asyncio.set_event_loop_policy(asyncio.WindowsSelectorEventLoopPolicy())
147+
148+
html_list = asyncio.run(WebHelper.getSearchHtmlAsync(param_list))
149+
----
150+
151+
如果在windows下进行调试请注释上面的条件执行语句,否则调试时生效将出错
152+
153+
此外项目当中多处使用的 `asyncio.run()` 调试时需要启用debug参数
154+
155+
否则会出现运行卡住, 并报 `TypeError: 'Task' object is not callable` 错误
156+
157+
158+
== Update log ==
159+
2022.5.16 更新了自动创建document/pub文件夹功能,不需要手动创建文件夹了,会自动检查和创建。
160+
161+
2023.08.05 更新修复了abstract爬取失败的bug,同时不再需要用户手动复制粘贴网页的参数了。
162+
163+
2024.11.23 作者竟然想起了这个黑历史一般的项目,偷偷更新一下,“这TM是我写的代码? 怎么这么烂"
164+
165+
2024.12.02 基于OOP xpath asyncio异步重构了整个代码,去掉运行速度限制,速度大概是原来的100倍, "写完好累好累"

assets/pubmedsoso_teminal.png

76.6 KB
Loading

0 commit comments

Comments
 (0)