Skip to content

docs: add OpenAlex search skill#2

Open
XiaokunDuan wants to merge 4 commits into
modelscope:mainfrom
XiaokunDuan:add-openalex-search-skill
Open

docs: add OpenAlex search skill#2
XiaokunDuan wants to merge 4 commits into
modelscope:mainfrom
XiaokunDuan:add-openalex-search-skill

Conversation

@XiaokunDuan

@XiaokunDuan XiaokunDuan commented Jun 15, 2026

Copy link
Copy Markdown

中文说明

这个 PR 在 2 📚 文献研究:检索、精读、综述与引用网络 章节中新增了一个条目:OpenAlex Search Skill

它是一个面向 Codex / 命令行环境的 OpenAlex 文献检索 skill,仓库地址为:

这次只更新 README 文档,不包含密钥、配置文件或任何运行时私有信息。

OpenAlex 是什么

OpenAlex 是 OurResearch 维护的开放全球学术图谱,可以理解为一个开放版的 scholarly graph / bibliographic index。它覆盖 works、authors、institutions、sources/venues、topics、publishers、funders 等实体,并把它们之间的引用、作者、机构、期刊/会议等关系连接成一个异构学术图谱。

根据 OpenAlex 官方帮助中心,OpenAlex 当前 catalog 约 474 million scholarly works,并将这些 works 连接到作者、机构、资助方等实体。OpenAlex 开发者文档也说明它覆盖数亿 scholarly works、authors、institutions 等实体,以及数十亿连接。OpenAlex 论文可参考:Priem, Piwowar & Orr (2022), OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts

参考资料:

为什么把它放在文献研究阶段

OpenAlex 的数据量很大,直接使用 API 时经常会遇到两个问题:

  1. 检索范围太大:如果直接做宽泛关键词搜索,很容易返回大量噪声结果。
  2. 参数选择影响很大:title/abstract 检索、semantic search、XPAC、stemming、citation sort、年份过滤等选项会显著改变结果的覆盖率和精度。

这个 skill 的目标不是替代 OpenAlex,而是把一套可复用的“文献发现默认策略”沉淀下来,方便 agent 在做 literature review、related work discovery、citation-aware paper search 时稳定调用。

这个 skill 做什么

openalex_search 封装了 OpenAlex works search,默认使用更适合学术检索的设置:

  • Entity: works
  • 默认模式:Boolean keyword search
  • 默认字段:title + abstract
  • 默认 API filter: title_and_abstract.search:<query>
  • 默认开启 XPAC: include_xpac=true
  • 默认开启 stemming
  • 默认排序:OpenAlex relevance

适用场景包括:

  • 找某个 topic 的核心论文
  • 做 related work 初筛
  • 将自然语言研究问题扩展成可检索关键词
  • 用 citation count、年份、semantic search 等策略补充候选论文
  • 给后续阅读、综述、citation graph 分析提供候选集

推荐用法

精确术语 / 已知短语 / review 检索时,优先用 Boolean title-and-abstract search:

python3 openalex_search/scripts/openalex_search.py '"large language model" AND interpretability'

探索性问题 / 还不知道关键词时,可以先用 semantic search:

python3 openalex_search/scripts/openalex_search.py --mode semantic "papers about how large language models make internal decisions"

推荐 workflow:

  1. 先用 semantic search 找核心论文和领域词汇。
  2. 再用 Boolean title + abstract search 扩大覆盖。
  3. 对高影响力论文使用 citation sorting。
  4. 对特定时期问题加年份过滤。
  5. 需要下游处理时输出 JSON。

参数含义

  • --mode boolean:默认模式。使用 OpenAlex title-and-abstract 字段检索,适合明确关键词、短语和综述检索。
  • --mode semantic:使用 OpenAlex search.semantic,适合自然语言探索,不等同于 title/abstract 字段检索。
  • --per-page 20:控制返回结果数量。
  • --sort cited_by_count:desc:按引用数排序,适合找高影响力论文。
  • --from-year 2020 / --to-year 2025:按发表年份过滤。
  • --no-xpac:关闭 OpenAlex expansion pack,结果可能更干净,但覆盖会下降。
  • --no-stem:使用更严格的 word-form matching,适合不希望词干扩展的精确检索。
  • --json:输出原始 JSON,方便后续 RAG、表格处理或 citation graph 分析。

边界与注意事项

  • OpenAlex 覆盖非常大,但元数据质量在不同领域、语言、出版源之间并不完全均衡。
  • Semantic search 适合探索,不应被当作严格可复现的 Boolean 检索替代品。
  • XPAC 能扩大覆盖,但可能带来 metadata 更弱的记录;需要高精度时可以关闭。
  • 这个 skill 不包含 API key,也不会把任何本地私有配置提交到本仓库。

English Version

This PR adds OpenAlex Search Skill to the 2 📚 Literature Research: Retrieval, Reading, Review & Citation Networks section.

It is a Codex / command-line skill for reusable OpenAlex literature search workflows:

This is a documentation-only PR. It does not include API keys, local configuration files, or private runtime data.

What OpenAlex is

OpenAlex is an open global scholarly graph maintained by OurResearch. It indexes and connects scholarly entities such as works, authors, institutions, sources/venues, topics, publishers, and funders.

According to the OpenAlex help center, OpenAlex currently catalogs about 474 million scholarly works, linking them to authors, institutions, funders, and more. The OpenAlex developer documentation describes it as a fully open catalog of the global research system, covering hundreds of millions of scholarly entities and billions of connections. The associated paper is Priem, Piwowar & Orr (2022), OpenAlex: A fully-open index of scholarly works, authors, venues, institutions, and concepts.

References:

Why this belongs in the literature research section

OpenAlex is extremely large, which is useful for literature discovery but also makes search strategy important. Broad keyword queries can easily return noisy results, and parameters such as title/abstract search, semantic search, XPAC, stemming, citation sorting, and year filtering can substantially change precision and recall.

This skill does not try to replace OpenAlex. Instead, it captures a reusable default search strategy for agents doing literature review, related-work discovery, citation-aware paper search, and early-stage topic exploration.

What the skill does

openalex_search wraps OpenAlex works search with research-oriented defaults:

  • Entity: works
  • Default mode: Boolean keyword search
  • Default field scope: title + abstract
  • Default API filter: title_and_abstract.search:<query>
  • XPAC enabled by default: include_xpac=true
  • Stemming enabled by default
  • Default sort: OpenAlex relevance

Typical use cases:

  • Finding core papers for a research topic
  • Building an initial related-work candidate set
  • Expanding a natural-language research question into searchable terms
  • Complementing search with citation count, year filters, semantic search, or JSON export
  • Feeding downstream reading, review, RAG, or citation graph workflows

Recommended usage

For precise terms, known phrases, and review-style searches, start with Boolean title-and-abstract search:

python3 openalex_search/scripts/openalex_search.py '"large language model" AND interpretability'

For exploratory questions where the vocabulary is not yet clear, start with semantic search:

python3 openalex_search/scripts/openalex_search.py --mode semantic "papers about how large language models make internal decisions"

Recommended workflow:

  1. Use semantic search to discover core papers and domain vocabulary.
  2. Expand coverage with Boolean title + abstract search.
  3. Use citation sorting when looking for influential papers.
  4. Add year filters for time-bounded topics.
  5. Export JSON when the results will feed RAG, tables, or citation graph analysis.

Parameter guide

  • --mode boolean: default mode. Searches OpenAlex title-and-abstract fields; best for explicit keywords, phrases, and review searches.
  • --mode semantic: uses OpenAlex search.semantic; best for exploratory natural-language queries and not equivalent to fielded title/abstract search.
  • --per-page 20: controls the number of returned results.
  • --sort cited_by_count:desc: prioritizes highly cited works.
  • --from-year 2020 / --to-year 2025: filters by publication year.
  • --no-xpac: disables OpenAlex expansion-pack works, often cleaner but with lower coverage.
  • --no-stem: uses stricter word-form matching when stemming is undesirable.
  • --json: emits raw JSON for downstream processing.

Limitations and safety

  • OpenAlex is broad, but metadata quality varies across fields, languages, and publication sources.
  • Semantic search is good for exploration, but should not be treated as a strict replacement for reproducible Boolean queries.
  • XPAC improves coverage but can include records with weaker metadata; disable it when precision is more important.
  • No API keys, credentials, or local private configuration are included in this PR.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the "OpenAlex Search Skill" to the documentation tables in both README.md and README_en.md. The reviewer suggested removing the redundant English description from the Chinese README.md to maintain consistency with other entries, as a dedicated English README is already provided.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread README.md Outdated
@VoyagerXvoyagerx

Copy link
Copy Markdown
Collaborator

Thanks for your contribution! The project looks good. Would you please add a LICENSE to your project?

@XiaokunDuan XiaokunDuan force-pushed the add-openalex-search-skill branch from 176de05 to 1aaf23d Compare June 21, 2026 02:18
@XiaokunDuan

Copy link
Copy Markdown
Author

Thanks! I added an MIT LICENSE to the linked openalex_search project: https://github.com/XiaokunDuan/openalex_search/blob/main/LICENSE

I also rebased this PR onto the latest upstream main and resolved the README.md / README_en.md conflicts. The Chinese README now keeps the Chinese-only description, and both README entries use the repository's current static star-marker format.

@XiaokunDuan

Copy link
Copy Markdown
Author

@VoyagerXvoyagerx Sorry to interrupt, please check.

Updated project entries with new descriptions and star counts.
@VoyagerXvoyagerx

Copy link
Copy Markdown
Collaborator

@XiaokunDuan Thank you for your contribution! To keep the README.md concise, I've removed the details of parameters from the project description. Please feel free to leave a comment if you have any questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants