You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A tool for crawling and downloading PubMed literature information, written based on `aiohttp`, `pandas`, and `xpath`. It retrieves relevant literature information according to the parameters you provide and downloads the corresponding PDF originals.
The download speed is approximately 1 second per article. It can extract most of the literature information and automatically generate an Excel file, including details such as the title, abstract, keywords, author list, author affiliations, whether it is free, and whether it is a review type.
15
14
16
-
== 依赖模块 ==
15
+
After automatic downloading, some information is stored in local text files for reference, and search data is stored in an `sqlite3` database. Finally, after execution, all information is automatically exported and an Excel file is generated.
On the project page, there is an executable file packed with `Nuitka` in the Releases section, which includes all dependencies and can be directly downloaded and executed in the Windows command line.
21
21
22
-
* 如果需要自己在python环境运行请根据requirements.txt文件安装对应的依赖
22
+
_If you need to run it in a Python environment, please install the corresponding dependencies according to the `requirements.txt` file._
23
23
24
24
[source, bash, indent=2]
25
25
----
@@ -31,9 +31,8 @@ pandas~=2.2.3
31
31
openpyxl~=3.1.5
32
32
----
33
33
34
-
== 使用方法 ==
35
-
36
-
. 克隆项目,在命令行环境下安装项目所需的依赖文件
34
+
== Usage ==
35
+
. Clone the project and install the required dependencies in the command line environment. It is recommended to use Python virtual environment tools such as `anaconda` or `miniconda3`.
37
36
38
37
[source, bash]
39
38
----
@@ -42,96 +41,106 @@ cd Pubmedsoso
42
41
pip install -r requirements.txt
43
42
----
44
43
45
-
当然,不方便安装git工具的话,直接直接下载Release当中的ZIP解压执行
44
+
If it is inconvenient to install the `git` tool, you can directly download the ZIP file from the Releases section on the right side of the page and unzip it for execution.
. Switch to the project folder in the terminal and execute `python main.py` with keyword parameters, or directly execute the executable file `pubmedsoso.exe` + keywords. For example:
49
49
50
-
比如
51
-
[souce, bash]
50
+
[source, bash]
52
51
----
53
52
python main.py headache -n 5 -d 10
54
53
----
55
54
56
-
`headache` 参数是此次运行输入的检索关键词
57
-
58
-
`-n` 参数后面数字指的是需要检索的页数, 每页会有50篇
55
+
`headache` is the search keyword (keyword) input for this run. If your keyword contains spaces, please enclose the keyword with double quotes. It supports PubMed advanced query boxes, such as "headache AND toothache", which include logical expressions like AND, NOT, OR.
59
56
60
-
`-d` 参数后面表示需要下载的文献的份数
57
+
`-n` parameter followed by a number specifies the number of pages to search, with 50 articles per page.
61
58
62
-
`-y` 可选参数,后面表示你想检索的信息的年份范围,以年为单位,比如 -y 5表示近五年的文献
59
+
`-d` parameter followed by a number specifies the number of articles to download.
63
60
64
-
* 输入页数时,每页会检索50个,数字不用设置得太大,否则需要较长时间执行
61
+
`-y` optional parameter followed by a number specifies the year range for the information you want to search, in years. For example, -y 5 means literature from the last five years.
When entering the number of pages, each page contains 50 search results. There is no need to set a large number, otherwise, it will take a long time to execute.
67
64
68
-
* 每个文献下载超过30s自动超时跳过,下载下一个。
65
+
Then enter the number of articles you need to download. The program will find free PMC articles from the search results and download them automatically. The download speed depends on your network condition. Each article download will automatically timeout and skip after 30 seconds, downloading the next one.
69
66
70
67
[source, bash]
71
68
----
72
69
PS > python main.py --help
73
-
74
70
usage: python main.py keyword
75
-
pubmedsoso is a python program for crawler article information and download pdf file
71
+
pubmedsoso is a python program for crawling article information and downloading pdf file
76
72
77
73
positional arguments:
78
-
keyword specify the keywords to search pubmed For example "headache"
74
+
keyword specify the keywords to search pubmed e.g. "headache"
79
75
80
76
optional arguments:
81
-
--help, -h show this help message and exit
82
-
--version, -v use --version to show the version
83
-
--page_num, -n add --number or -n to specify the page number to
84
-
--year -y add --year or -y to specify year scale you would to
85
-
--download_num, -d add --download_num or -d to specify the number to download
86
-
77
+
--help, -h show this help message and exit
78
+
--version, -v use --version to show the version
79
+
--pagenum, -nadd --pagenum or -n to specify the page number to
80
+
--year -y add --year or -y to specify year scale you would to
81
+
--downloadnum,-d a digit number to specify the number to download
82
+
--directory -D use a valid directory path specify the pdf save directory.
. The literature will be automatically downloaded to the "document/pub/" folder mentioned earlier, and a txt file with the original traversal information will be generated. The program will finally generate an Excel file after execution.
100
96
101
97
image::assets/pic_result.png[Pubmedsoso, 600]
102
98
103
-
WARNING::请勿过分爬取pubmed网站
99
+
WARNING:: Please do not excessively crawl the PubMed website. Since this project uses asynchronous mechanisms, it has high concurrency capabilities. Parameters related to access speed can be set in `config.py`, and the default values are not too large.
This module is convenient for exporting historical information to Excel after crawling. It can be executed separately, such as in an IDE or command line, by executing `python ExcelHelper.py`.
When the above prompt appears, you can choose to export historical records from the `sqlite3` data and an exported file will be automatically generated locally. **Duplicate-named Excel files are not allowed and need to be deleted as prompted.**
112
107
113
108
== TO DO List ==
109
+
* [ ] Precise search and download, this is still a bit difficult_
110
+
111
+
* [x] Custom keyword download, waiting for me to figure out the PubMed search parameter URL generation rules (already implemented)
112
+
* [ ] Automatic completion of non-free literature download via SciHub, perhaps allowing users to write adapters themselves_
113
+
* [ ] A usable GUI interface_
114
+
* [ ] Ideally, a free Baidu translation plugin, sometimes it might be useful_
115
+
* [x] Refactor the project using OOP and more modern tools
116
+
* [x] Refactor the code using asynchronous methods to improve execution efficiency
117
+
* [ ] A potentially necessary logging system_
118
+
* [ ] Implement an active literature subscription feature based on email subscription and push mechanism, pushing the latest literature to users
119
+
120
+
== Debugging Guide ==
121
+
Due to the specificity of the `asyncio` asynchronous module, some special issues may occur during debugging on Windows.
122
+
123
+
If you need to develop and debug the code, you need to modify two places:
If debugging on Windows, please comment out the conditional execution statement above, otherwise, it will take effect during debugging and cause errors.
124
136
125
-
'''
137
+
Additionally, `asyncio.run()` is used in multiple places in the project. During debugging, the debug parameter needs to be enabled, otherwise, the runtime will get stuck and report a `TypeError: 'Task' object is not callable` error.
126
138
127
-
2022.5.16 更新
128
-
更新了自动创建document/pub文件夹功能,不需要手动创建文件夹了,会自动检查和创建。
139
+
== Update Log ==
140
+
2022.5.16 Updated the feature to automatically create the `document/pub` folder, no need to manually create the folder, it will automatically check and create.
129
141
130
-
2023.08.05 更新
131
-
更新修复了abstract爬取失败的bug,同时不再需要用户手动复制粘贴网页的参数了。
142
+
2023.08.05 Updated to fix the bug where abstract crawling failed, and no longer requires users to manually copy and paste web parameters.
132
143
133
-
2024.11.23 更新
134
-
作者竟然想起了这个黑历史一般的项目,偷偷更新一下,“这TM是我写的代码? 怎么这么烂"
144
+
2024.11.23 The author unexpectedly remembered this somewhat embarrassing project and quietly updated it, "Is this really the code I wrote? How could it be so bad?"
2024.12.02 Refactored the entire code based on OOP, `xpath`, and `asyncio` asynchronous, removed the runtime speed limit, the speed is about 100 times the original, "Writing this was so exhausting."
0 commit comments