-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Description
Environment
Hardware:Amlogic A311D SoC
CPU: 4× Cortex-A73 + 2× Cortex-A53 (ARMv8.0, NEON only, no dotprod)
RAM: 4GB
OS: Ubuntu 20.04 aarch64
Compiler: Clang 18.1.8
BitNet commit: 1f86f058
Model: BitNet-b1.58-2B-4T (official GGUF from microsoft/BitNet-b1.58-2B-4T-gguf)
Problem:-
The model loads correctly but produces garbage output on ARMv8.0 CPUs that lack the dotprod extension. The #else (non-dotprod) NEON path in src/ggml-bitnet-mad.cpp (ggml_vec_dot_i2_i8_s_1x1 using vmlal_s8) appears to compute incorrect results.
###Steps to Reproduce:-
git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
setup_env.py fails with NotImplementedError on this hardware
Manual build:
cmake -B build
-DCMAKE_BUILD_TYPE=Release
-DCMAKE_C_COMPILER=clang
-DCMAKE_CXX_COMPILER=clang++
-DCMAKE_C_FLAGS="-mcpu=cortex-a53 -march=armv8-a+crc"
-DCMAKE_CXX_FLAGS="-mcpu=cortex-a53 -march=armv8-a+crc"
-DGGML_NATIVE=OFF
-DBITNET_ARM_TL1=OFF
cmake --build build --config Release -j4
./build/bin/llama-cli
-m models/BitNet-b1.58-2B-4T/ggml-model-i2_s.gguf
-p "Hello, tell me about yourself."
-n 128 -t 4
Expected Output:-
Coherent English text.
Actual Output:-
"Dank spraw/sl-ingimum avReactSM may ad bUISapp.ed,reuschPIrcMarketcryptampo
reco initially CoronaMISS speculate441 Geme Eduardo untilidor.wordpresstor
narrowln halband fast standuce Best Diveamma occasionally Mines choppeddates
raspbreaking-NEL inst de(def caves ¬slaBU platformsapiro uncert stumpilla..."
Issues Found During Investigation:-
1.setup_env.py fails with NotImplementedError in gen_code() — the TL1 kernel generator doesn't support the 2B model dimensions (2560/6912), only 3B (3200/8640).
2.BITNET_ARM_TL1=ON has no effect for the 2B model because ggml_preprocessor() and ggml_qgemm_lut() in the generated bitnet-lut-kernels.h only handle dimensions 3200/8640/3200, not 2560/6912.
3.Hardcoded dotprod inline assembly in ggml-aarch64.c (168 .inst encoded sdot instructions in Q4_0 kernels) — these cause SIGILL if the code path is reached on ARMv8.0. Currently dead code for i2_s, but risky.
4.he official GGUF (microsoft/BitNet-b1.58-2B-4T-gguf) uses general.architecture = bitnet-b1.58, but the fork's llama.cpp expects bitnet — causing tokenizer warnings: GENERATION QUALITY WILL BE DEGRADED.
5.The non-dotprod NEON fallback in ggml_vec_dot_i2_i8_s_1x1 (lines ~340-400 of ggml-bitnet-mad.cpp) uses vmlal_s8 accumulation into int16x8_t, which may overflow for large inner dimensions (2560/6912) since int16 saturates at ±32767 and each accumulation adds values up to ±127×2=±254.
6.Suspected Root Cause
The vmlal_s8 accumulation in the non-dotprod path uses int16x8_t accu32 which accumulates 32 iterations × 8 multiply-adds per iteration = 256 signed int8 products before widening to int32. Each product can be up to 127×2 = 254. Sum of 256 such values = up to 65,024, which overflows int16 (max 32,767). The dotprod path (vdotq_s32) accumulates directly into int32 and doesn't have this problem.
7.Suggested Fix
In the non-dotprod path, reduce the inner loop count before widening to int32, or accumulate directly into int32:
// Instead of accumulating 32 iterations in int16:
// Widen to int32 every 8 iterations instead of every 32
Additional Notes:-
Output is deterministic (same garbage every run)
Single-thread (-t 1) produces identical garbage
Both QK_I2_S=64 and QK_I2_S=128 produce garbage
The model metadata loads correctly (210 i2_s tensors, 121 f32, 1 f16)