You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A small tool for automatically extracting PubMed literature information and downloading free literature in bulk.
8
+
A small tool for automatically batch extracting PubMed literature information and downloading free literature.
9
9
10
-
== Features ==
11
-
A tool for crawling and downloading PubMed literature information, written based on `aiohttp`, `pandas`, and `xpath`. It retrieves relevant literature information according to the parameters you provide and downloads the corresponding PDF originals.
10
+
== Main Features
12
11
13
-
The download speed is approximately 1 second per article. It can extract most of the literature information and automatically generate an Excel file, including details such as the title, abstract, keywords, author list, author affiliations, whether it is free, and whether it is a review type.
12
+
A tool for crawling and downloading PubMed literature information, written based on `aiohttp`, `pandas`, and `xpath`. It retrieves relevant literature information according to the parameters you provide and downloads the corresponding PDF originals. The download speed is approximately 1 second per article. It can extract most of the literature information and automatically generate an Excel file, including titles, abstracts, keywords, author lists, author affiliations, whether it is free, and whether it is a review type, etc.
14
13
15
-
After automatic downloading, some information is stored in local text files for reference, and search data is stored in an `sqlite3` database. Finally, after execution, all information is automatically exported and an Excel file is generated.
14
+
After automatic downloading, some information will be stored in local text files for reference. Search data will be stored in an `sqlite3` database. Finally, after execution, all information will be automatically exported and an Excel file will be generated.
16
15
17
-
== Dependencies ==
18
-
_Based on Python 3.9, also supports higher versions. Mainly uses `pandas`, `xpath`, `asyncio`, `aiohttp`._
16
+
== Dependency Modules
19
17
20
-
On the project page, there is an executable file packed with `Nuitka` in the Releases section, which includes all dependencies and can be directly downloaded and executed in the Windows command line.
18
+
*Based on Python 3.9, also supports higher versions. Mainly uses `pandas`, `xpath`, `asyncio`, `aiohttp`.*
21
19
22
-
_If you need to run it in a Python environment, please install the corresponding dependencies according to the `requirements.txt` file._
20
+
In the Release section on the right side of the project page, there is an executable file packed with Nuitka, which includes all dependencies and can be directly downloaded and executed in the Windows command line.
21
+
22
+
*If you need to run it in a Python environment, please install the corresponding dependencies according to the `requirements.txt` file.*
23
23
24
24
[source, bash, indent=2]
25
25
----
@@ -29,10 +29,13 @@ requests~=2.32.3
29
29
lxml~=5.3.0
30
30
pandas~=2.2.3
31
31
openpyxl~=3.1.5
32
+
colorlog~=6.9.0
32
33
----
33
34
34
-
== Usage ==
35
-
. Clone the project and install the required dependencies in the command line environment. It is recommended to use Python virtual environment tools such as `anaconda` or `miniconda3`.
35
+
== Usage
36
+
37
+
. Clone the project and install the required dependency files in the command line environment.
38
+
It is recommended to use Python virtual environment tools such as `anaconda` or `miniconda3`.
36
39
37
40
[source, bash]
38
41
----
@@ -41,84 +44,90 @@ cd Pubmedsoso
41
44
pip install -r requirements.txt
42
45
----
43
46
44
-
If it is inconvenient to install the `git` tool, you can directly download the ZIP file from the Releases section on the right side of the page and unzip it for execution.
47
+
If it is inconvenient to install the git tool, you can directly download the ZIP from the Release on the right side of the page and unzip it for execution.
45
48
46
49
image:assets/pubmed_release.png[Pubmedsoso, 600]
47
50
48
-
. Switch to the project folder in the terminal and execute `python main.py` with keyword parameters, or directly execute the executable file `pubmedsoso.exe` + keywords. For example:
51
+
. Switch to the project folder in the terminal and execute `python main.py` with keyword parameters, or directly execute the executable file `pubmedsoso.exe` + keywords.
52
+
For example:
49
53
50
54
[source, bash]
51
55
----
52
-
python main.py headache -n 5 -d 10
56
+
python main.py -k headache -n 5 -d 10 -y 5
53
57
----
54
58
55
-
`headache` is the search keyword (keyword) input for this run. If your keyword contains spaces, please enclose the keyword with double quotes. It supports PubMed advanced query boxes, such as "headache AND toothache", which include logical expressions like AND, NOT, OR.
59
+
*`-k headache`* is the search keyword (keyword) input for this run. If your keyword contains spaces, please enclose the entire keyword in double quotes.
60
+
It supports PubMed advanced query box, such as "headache AND toothache", which includes logical expressions such as AND, NOT, OR.
56
61
57
-
`-n` parameter followed by a number specifies the number of pages to search, with 50 articles per page.
62
+
*`-n`* parameter followed by a number refers to the number of pages to be searched, with 50 articles per page.
58
63
59
-
`-d` parameter followed by a number specifies the number of articles to download.
64
+
*`-d`* parameter followed by a number indicates the number of articles to be downloaded.
60
65
61
-
`-y` optional parameter followed by a number specifies the year range for the information you want to search, in years. For example, -y 5 means literature from the last five years.
66
+
*`-y`* optional parameter, followed by the year range of the information you want to search, in years. For example, `-y 5` means literature from the last five years.
62
67
63
-
When entering the number of pages, each page contains 50 search results. There is no need to set a large number, otherwise, it will take a long time to execute.
68
+
When entering the number of pages, each page will contain 50 search results. There is no need to set a large number, otherwise, it will take a long time to execute.
64
69
65
-
Then enter the number of articles you need to download. The program will find free PMC articles from the search results and download them automatically. The download speed depends on your network condition. Each article download will automatically timeout and skip after 30 seconds, downloading the next one.
70
+
Then enter the number of articles you need to download. The program will find free PMC articles from the search results and automatically download them. The download speed depends on your network condition.
71
+
72
+
Each article download will automatically timeout and skip to the next one if it exceeds 30 seconds.
66
73
67
74
[source, bash]
68
75
----
69
76
PS > python main.py --help
70
-
usage: python main.py keyword
71
-
pubmedsoso is a python program for crawling article information and downloading pdf file
72
-
73
-
positional arguments:
74
-
keyword specify the keywords to search pubmed e.g. "headache"
75
-
77
+
usage: python main.py -k keyword
78
+
pubmedsoso is a python program for crawler article information and download pdf file
76
79
optional arguments:
77
80
--help, -h show this help message and exit
78
81
--version, -v use --version to show the version
82
+
--keyword, -k specify the keywords to search pubmed
79
83
--pagenum, -n add --pagenum or -n to specify the page number to
80
84
--year -y add --year or -y to specify year scale you would to
81
85
--downloadnum,-d a digit number to specify the number to download
82
-
--directory -D use a valid directory path specify the pdf save directory.
86
+
--directory -D use a vaild dir path specify the pdf save directory.
87
+
--output -o add --output filename to appoint name of pdf file
88
+
--loglevel -l set the console log level, e.g debug
83
89
----
84
90
85
-
_If you are familiar with IDEs, you can run `main.py` in Python environments such as `pycharm` or `vscode`._
91
+
*If you are familiar with IDEs, you can run `main.py` in Python environments such as `pycharm` or `vscode`.*
86
92
87
-
. According to the prompts, input `y` or `n` to decide whether to execute the program with the given parameters.
93
+
. According to the prompt, enter `y` or `n` to decide whether to execute the program with the given parameters.
**pubmedsoso will crawl and download according to the normal search order.**
92
98
93
99
image:assets/pic_keyword.png[Pubmedsoso, 600]
94
100
95
-
. The literature will be automatically downloaded to the "document/pub/" folder mentioned earlier, and a txt file with the original traversal information will be generated. The program will finally generate an Excel file after execution.
101
+
. The literature will be automatically downloaded to the "document/pub/" folder mentioned earlier, and a txt file with the original traversal information will be generated. Finally, an Excel file will be generated after the program execution is complete.
96
102
97
103
image::assets/pic_result.png[Pubmedsoso, 600]
98
104
99
-
WARNING:: Please do not excessively crawl the PubMed website. Since this project uses asynchronous mechanisms, it has high concurrency capabilities. Parameters related to access speed can be set in `config.py`, and the default values are not too large.
105
+
WARNING:: Please do not excessively crawl the PubMed website.
106
+
Since this project uses asynchronous mechanisms, it has high concurrency capabilities. Parameters related to access speed can be set in `config.py`, and the default values are not too large.
107
+
108
+
== ExcelHelper Module
100
109
101
-
== ExcelHelper Module ==
102
-
This module is convenient for exporting historical information to Excel after crawling. It can be executed separately, such as in an IDE or command line, by executing `python ExcelHelper.py`.
110
+
This is a module to facilitate exporting historical information to Excel after crawling. It can be executed separately, such as in an IDE or command line by executing `python ExcelHelper.py`.
103
111
104
112
image::assets/pic_save.png[Pubmedsoso]
105
113
106
-
When the above prompt appears, you can choose to export historical records from the `sqlite3` data and an exported file will be automatically generated locally. **Duplicate-named Excel files are not allowed and need to be deleted as prompted.**
114
+
With the above prompt, you can choose to export historical records from the `sqlite3` database, which will automatically generate an exported file locally. **Duplicate-named Excel files are not allowed and need to be deleted as prompted.**
107
115
108
-
== TO DO List ==
109
-
* [ ] Precise search and download, this is still a bit difficult_
116
+
== TO DO List
110
117
118
+
* [ ] Precise search and download, this is still a bit difficult*
111
119
* [x] Custom keyword download, waiting for me to figure out the PubMed search parameter URL generation rules (already implemented)
112
-
* [ ] Automatic completion of non-free literature download via SciHub, perhaps allowing users to write adapters themselves_
113
-
* [ ] A usable GUI interface_
114
-
* [ ] Ideally, a free Baidu translation plugin, sometimes it might be useful_
120
+
* [ ] Automatic completion download of non-free literature via SciHub, perhaps allowing users to write adapters themselves*
121
+
* [ ] A usable GUI interface*
122
+
* [ ] It would be best to include a free Baidu translation plugin, which might be useful sometimes*
115
123
* [x] Refactor the project using OOP and more modern tools
116
124
* [x] Refactor the code using asynchronous methods to improve execution efficiency
117
-
* [] A potentially necessary logging system_
118
-
* [ ] Implement an active literature subscription feature based on email subscription and push mechanism, pushing the latest literature to users
125
+
* [x] A usable logging system may also be needed
126
+
* [ ] A subscription-based, proactive literature push mechanism could be developed to push the latest literature to users
119
127
120
-
== Debugging Guide ==
121
-
Due to the specificity of the `asyncio` asynchronous module, some special issues may occur during debugging on Windows.
128
+
== Debugging Guide
129
+
130
+
Due to the peculiarities of the `asyncio` asynchronous module, some special issues may arise during debugging on Windows.
122
131
123
132
If you need to develop and debug the code, you need to modify two places:
124
133
@@ -136,11 +145,12 @@ If debugging on Windows, please comment out the conditional execution statement
136
145
137
146
Additionally, `asyncio.run()` is used in multiple places in the project. During debugging, the debug parameter needs to be enabled, otherwise, the runtime will get stuck and report a `TypeError: 'Task' object is not callable` error.
138
147
139
-
== Update Log ==
140
-
2022.5.16 Updated the feature to automatically create the `document/pub` folder, no need to manually create the folder, it will automatically check and create.
148
+
== Update Log
149
+
150
+
2022.5.16 Updated the automatic creation of the `document/pub` folder feature, no need to manually create the folder, it will automatically check and create.
141
151
142
-
2023.08.05 Updated to fix the bug where abstract crawling failed, and no longer requires users to manually copy and paste web parameters.
152
+
2023.08.05 Updated to fix the bug where abstract crawling failed, and users no longer need to manually copy and paste webpage parameters.
143
153
144
154
2024.11.23 The author unexpectedly remembered this somewhat embarrassing project and quietly updated it, "Is this really the code I wrote? How could it be so bad?"
145
155
146
-
2024.12.02 Refactored the entire code based on OOP, `xpath`, and `asyncio` asynchronous, removed the runtime speed limit, the speed is about 100 times the original, "Writing this was so exhausting."
156
+
2024.12.02 Refactored the entire code based on OOP, `xpath`, and `asyncio` asynchronous, removed the runtime speed limit, the speed is about 100 times the original, "I'm so tired after writing this."
0 commit comments