Releases · allozaur/llama.cpp

16 Oct 22:03

1bb4f43

b6782 Latest

Latest

mtmd : support home-cooked Mistral Small Omni (#14928)

Assets 15

cudart-llama-bin-win-cuda-12.4-x64.zip

sha256:8c79a9b226de4b3cacfd1f83d24f962d0773be79f1e7b75c6af4ded7e32ae1d6

373 MB 2025-10-16T22:03:19Z
llama-b6782-bin-macos-arm64.zip

sha256:6c01bf0745f74f6bea865678b5688f937cd5a254431ee7948733409d7ec11413

10.4 MB 2025-10-16T22:03:37Z
llama-b6782-bin-macos-x64.zip

sha256:12e0afe624715d31f8696c579c2f7d5de17f19d157f1e1bb07f9b7cf27b2c625

27 MB 2025-10-16T22:03:38Z
llama-b6782-bin-ubuntu-vulkan-x64.zip

sha256:55f0d64718b916c0dde3068521e07d22def9ff312f77deac46c05667745146c3

25.8 MB 2025-10-16T22:03:40Z
llama-b6782-bin-ubuntu-x64.zip

sha256:7d510d8b92f36be47b0e2460f4279b2daf37a02c82c08bd808b8a1b79960f033

12.5 MB 2025-10-16T22:03:41Z
llama-b6782-bin-win-cpu-arm64.zip

sha256:be6c489933456bd26319a50cb5ccabd3fc43b3119f717f79e4d01b61601fc91b

10.6 MB 2025-10-16T22:03:42Z
llama-b6782-bin-win-cpu-x64.zip

sha256:5770969f26941ae145ab91e28a5c40a9324950faefc1d239a982a85629c4c548

13.7 MB 2025-10-16T22:03:43Z
llama-b6782-bin-win-cuda-12.4-x64.zip

sha256:148e1fde13144c28d838374f0f907250077d0e2a1c61d55f02b1680aea7ab7c6

169 MB 2025-10-16T22:03:45Z
llama-b6782-bin-win-hip-radeon-x64.zip

sha256:10fa14cc01e92bdda1d2e596d709635f3a90821a64022583a80ad907a0672854

321 MB 2025-10-16T22:03:52Z
llama-b6782-bin-win-opencl-adreno-arm64.zip

sha256:ac905bb86b9570c9dc14075cd32be1849f2f325a833ee63853d01163e900ab67

11 MB 2025-10-16T22:04:04Z
Source code (zip)

2025-10-16T17:00:31Z
Source code (tar.gz)

2025-10-16T17:00:31Z

16 Oct 10:22

github-actions

b6779

7a50cf3

b6779

CANN: format code using .clang-format (#15863)

This commit applies .clang-format rules to all source files under the
ggml-cann directory to ensure consistent coding style and readability.
The .clang-format option `SortIncludes: false` has been set to disable
automatic reordering of include directives.
No functional changes are introduced.

Co-authored-by: hipudding <[email protected]>

Assets 15

15 Oct 10:24

github-actions

b6765

fa882fd

b6765

metal : avoid using Metal's gpuAddress property (#16576)

* metal : avoid using Metal's gpuAddress property

* metal : fix rope kernels buffer check

Assets 15

13 Oct 09:17

github-actions

b6749

1fb9504

b6749

fix: add remark plugin to render raw HTML as literal text (#16505)

* fix: add remark plugin to render raw HTML as literal text

Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering

Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure

Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization

Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required

Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify

* fix: address review feedback from allozaur

* chore: update webui build output

Assets 15

13 Oct 08:28

github-actions

b6746

f9bc66c

b6746

CANN: Update several operators to support FP16 data format (#16251)

Many Ascend operators internally use FP16 precision for computation.
If input data is in FP32, it must first be cast to FP16 before
computation, and then cast back to FP32 after computation, which
introduces unnecessary cast operations. Moreover, FP16 computation
requires significantly less workload compared to FP32, leading to
noticeable efficiency improvements.

In this change, `get_rows`, `rms_norm`, and `flash_attn_ext` are extended
to support multiple data types. Validation on the Qwen2 0.5b model shows
correct accuracy and about 10% performance gain in concurrent scenarios.

Co-authored-by: noemotiovon <[email protected]>

Assets 15

10 Oct 06:58

github-actions

b6725

1faa13a

b6725

webui: updated the chat service to only include max_tokens in the req…

Assets 15

03 Oct 11:11

github-actions

b6681

84c8e30

b6681

Fix missing messages on sibling navigation (#16408)

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output

Assets 15

03 Oct 10:02

github-actions

b6677

7723327

b6677

Capture model name only after first token (streaming) or completed re…

Assets 15

03 Oct 06:30

github-actions

b6673

d64c810

b6673

test-barrier : do not use more threads than physically available (#16…

Assets 15

30 Sep 22:59

github-actions

b6653

e74c92e

b6653

model : support GLM 4.6 (make a few NextN/MTP tensors not required) (…

Assets 15

Releases: allozaur/llama.cpp

b6782

Uh oh!

b6779

Uh oh!

b6765

Uh oh!

b6749

Uh oh!

b6746

Uh oh!

b6725

Uh oh!

b6681

Uh oh!

b6677

Uh oh!

b6673

Uh oh!

b6653

Uh oh!