Garbage output on ARMv8.0 (Cortex-A53/A73) — NEON-only fallback path produces incorrect results

Environment
Hardware:Amlogic A311D SoC
CPU: 4× Cortex-A73 + 2× Cortex-A53 (ARMv8.0, NEON only, no dotprod)
RAM: 4GB
OS: Ubuntu 20.04 aarch64
Compiler: Clang 18.1.8
BitNet commit: 1f86f058
Model: BitNet-b1.58-2B-4T (official GGUF from microsoft/BitNet-b1.58-2B-4T-gguf)

Problem:-
The model loads correctly but produces garbage output on ARMv8.0 CPUs that lack the dotprod extension. The #else (non-dotprod) NEON path in src/ggml-bitnet-mad.cpp (ggml_vec_dot_i2_i8_s_1x1 using vmlal_s8) appears to compute incorrect results.

###Steps to Reproduce:-

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

# setup_env.py fails with NotImplementedError on this hardware
# Manual build:
cmake -B build \
  -DCMAKE_BUILD_TYPE=Release \
  -DCMAKE_C_COMPILER=clang \
  -DCMAKE_CXX_COMPILER=clang++ \
  -DCMAKE_C_FLAGS="-mcpu=cortex-a53 -march=armv8-a+crc" \
  -DCMAKE_CXX_FLAGS="-mcpu=cortex-a53 -march=armv8-a+crc" \
  -DGGML_NATIVE=OFF \
  -DBITNET_ARM_TL1=OFF

cmake --build build --config Release -j4

./build/bin/llama-cli \
  -m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf \
  -p "Hello, tell me about yourself." \
  -n 128 -t 4

Expected Output:-
Coherent English text.

Actual Output:-
"Dank spraw/sl-ingimum avReactSM may ad bUISapp.ed,reuschPIrcMarketcryptampo
reco initially CoronaMISS speculate441 Geme Eduardo untilidor.wordpresstor
narrowln halband fast standuce Best Diveamma occasionally Mines choppeddates
raspbreaking-NEL inst de(def caves ¬slaBU platformsapiro uncert stumpilla..."

Issues Found During Investigation:-
1.setup_env.py fails with NotImplementedError in gen_code() — the TL1 kernel generator doesn't support the 2B model dimensions (2560/6912), only 3B (3200/8640).

2.BITNET_ARM_TL1=ON has no effect for the 2B model because ggml_preprocessor() and ggml_qgemm_lut() in the generated bitnet-lut-kernels.h only handle dimensions 3200/8640/3200, not 2560/6912.

3.Hardcoded dotprod inline assembly in ggml-aarch64.c (168 .inst encoded sdot instructions in Q4_0 kernels) — these cause SIGILL if the code path is reached on ARMv8.0. Currently dead code for i2_s, but risky.

4.he official GGUF (microsoft/BitNet-b1.58-2B-4T-gguf) uses general.architecture = bitnet-b1.58, but the fork's llama.cpp expects bitnet — causing tokenizer warnings: GENERATION QUALITY WILL BE DEGRADED.

5.The non-dotprod NEON fallback in ggml_vec_dot_i2_i8_s_1x1 (lines ~340-400 of ggml-bitnet-mad.cpp) uses vmlal_s8 accumulation into int16x8_t, which may overflow for large inner dimensions (2560/6912) since int16 saturates at ±32767 and each accumulation adds values up to ±127×2=±254.

6.Suspected Root Cause
The vmlal_s8 accumulation in the non-dotprod path uses int16x8_t accu32 which accumulates 32 iterations × 8 multiply-adds per iteration = 256 signed int8 products before widening to int32. Each product can be up to 127×2 = 254. Sum of 256 such values = up to 65,024, which overflows int16 (max 32,767). The dotprod path (vdotq_s32) accumulates directly into int32 and doesn't have this problem.

7.Suggested Fix
In the non-dotprod path, reduce the inner loop count before widening to int32, or accumulate directly into int32:
// Instead of accumulating 32 iterations in int16:
// Widen to int32 every 8 iterations instead of every 32


Additional Notes:-
Output is deterministic (same garbage every run)
Single-thread (-t 1) produces identical garbage
Both QK_I2_S=64 and QK_I2_S=128 produce garbage
The model metadata loads correctly (210 i2_s tensors, 121 f32, 1 f16)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage output on ARMv8.0 (Cortex-A53/A73) — NEON-only fallback path produces incorrect results #411

setup_env.py fails with NotImplementedError on this hardware

Manual build:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Garbage output on ARMv8.0 (Cortex-A53/A73) — NEON-only fallback path produces incorrect results #411

Description

setup_env.py fails with NotImplementedError on this hardware

Manual build:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions