Skip to content

Garbage output on ARMv8.0 (Cortex-A53/A73) — NEON-only fallback path produces incorrect results #411

@RKAiCodes05

Description

@RKAiCodes05

Environment
Hardware:Amlogic A311D SoC
CPU: 4× Cortex-A73 + 2× Cortex-A53 (ARMv8.0, NEON only, no dotprod)
RAM: 4GB
OS: Ubuntu 20.04 aarch64
Compiler: Clang 18.1.8
BitNet commit: 1f86f058
Model: BitNet-b1.58-2B-4T (official GGUF from microsoft/BitNet-b1.58-2B-4T-gguf)

Problem:-
The model loads correctly but produces garbage output on ARMv8.0 CPUs that lack the dotprod extension. The #else (non-dotprod) NEON path in src/ggml-bitnet-mad.cpp (ggml_vec_dot_i2_i8_s_1x1 using vmlal_s8) appears to compute incorrect results.

###Steps to Reproduce:-

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

setup_env.py fails with NotImplementedError on this hardware

Manual build:

cmake -B build
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_C_COMPILER=clang
-DCMAKE_CXX_COMPILER=clang++
-DCMAKE_C_FLAGS="-mcpu=cortex-a53 -march=armv8-a+crc"
-DCMAKE_CXX_FLAGS="-mcpu=cortex-a53 -march=armv8-a+crc"
-DGGML_NATIVE=OFF
-DBITNET_ARM_TL1=OFF

cmake --build build --config Release -j4

./build/bin/llama-cli
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
-p "Hello, tell me about yourself."
-n 128 -t 4

Expected Output:-
Coherent English text.

Actual Output:-
"Dank spraw/sl-ingimum avReactSM may ad bUISapp.ed,reuschPIrcMarketcryptampo
reco initially CoronaMISS speculate441 Geme Eduardo untilidor.wordpresstor
narrowln halband fast standuce Best Diveamma occasionally Mines choppeddates
raspbreaking-NEL inst de(def caves ¬slaBU platformsapiro uncert stumpilla..."

Issues Found During Investigation:-
1.setup_env.py fails with NotImplementedError in gen_code() — the TL1 kernel generator doesn't support the 2B model dimensions (2560/6912), only 3B (3200/8640).

2.BITNET_ARM_TL1=ON has no effect for the 2B model because ggml_preprocessor() and ggml_qgemm_lut() in the generated bitnet-lut-kernels.h only handle dimensions 3200/8640/3200, not 2560/6912.

3.Hardcoded dotprod inline assembly in ggml-aarch64.c (168 .inst encoded sdot instructions in Q4_0 kernels) — these cause SIGILL if the code path is reached on ARMv8.0. Currently dead code for i2_s, but risky.

4.he official GGUF (microsoft/BitNet-b1.58-2B-4T-gguf) uses general.architecture = bitnet-b1.58, but the fork's llama.cpp expects bitnet — causing tokenizer warnings: GENERATION QUALITY WILL BE DEGRADED.

5.The non-dotprod NEON fallback in ggml_vec_dot_i2_i8_s_1x1 (lines ~340-400 of ggml-bitnet-mad.cpp) uses vmlal_s8 accumulation into int16x8_t, which may overflow for large inner dimensions (2560/6912) since int16 saturates at ±32767 and each accumulation adds values up to ±127×2=±254.

6.Suspected Root Cause
The vmlal_s8 accumulation in the non-dotprod path uses int16x8_t accu32 which accumulates 32 iterations × 8 multiply-adds per iteration = 256 signed int8 products before widening to int32. Each product can be up to 127×2 = 254. Sum of 256 such values = up to 65,024, which overflows int16 (max 32,767). The dotprod path (vdotq_s32) accumulates directly into int32 and doesn't have this problem.

7.Suggested Fix
In the non-dotprod path, reduce the inner loop count before widening to int32, or accumulate directly into int32:
// Instead of accumulating 32 iterations in int16:
// Widen to int32 every 8 iterations instead of every 32

Additional Notes:-
Output is deterministic (same garbage every run)
Single-thread (-t 1) produces identical garbage
Both QK_I2_S=64 and QK_I2_S=128 produce garbage
The model metadata loads correctly (210 i2_s tensors, 121 f32, 1 f16)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions