Skip to content

Comments

{2025.06}[2024a] Add OSU with CUDA#1401

Open
casparvl wants to merge 3 commits intoEESSI:mainfrom
casparvl:osu_cuda
Open

{2025.06}[2024a] Add OSU with CUDA#1401
casparvl wants to merge 3 commits intoEESSI:mainfrom
casparvl:osu_cuda

Conversation

@casparvl
Copy link
Collaborator

No description provided.

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80
bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/amd/zen4,accel=nvidia/cc90

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Feb 19, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.02/pr_1401/19727277

date job status comment
Feb 19 22:27:34 UTC 2026 submitted job id 19727277 will be eligible to start in about 20 seconds
Feb 19 22:27:45 UTC 2026 received job awaits launch by Slurm scheduler
Feb 19 22:28:03 UTC 2026 running job 19727277 is running
Feb 19 22:34:50 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-19727277.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-icelake-accel-nvidia-cc80-17715404340.tar.zstsize: 0 MiB (831648 bytes)
entries: 52
modules under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0.lua
software under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/software
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0
reprod directories under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/reprod
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0/20260219_223130UTC
other under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80
no other files in tarball
Feb 19 22:34:50 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19727277.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Feb 19, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: amd-zen4 and accelerator nvidia/cc90
Building for: x86_64/amd/zen4 and accelerator nvidia/cc90
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.02/pr_1401/19727411

date job status comment
Feb 19 22:27:40 UTC 2026 submitted job id 19727411 will be eligible to start in about 20 seconds
Feb 19 22:27:49 UTC 2026 received job awaits launch by Slurm scheduler
Feb 19 22:28:18 UTC 2026 running job 19727411 is running
Feb 19 22:34:26 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-19727411.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-amd-zen4-accel-nvidia-cc90-17715404080.tar.zstsize: 0 MiB (832048 bytes)
entries: 52
modules under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/modules/all
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0.lua
software under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/software
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0
reprod directories under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90/reprod
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0/20260219_223108UTC
other under 2025.06/software/linux/x86_64/amd/zen4/accel/nvidia/cc90
no other files in tarball
Feb 19 22:34:26 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19727411.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator

bedroge commented Feb 20, 2026

ESC[31mERROR: /cvmfs/software.eessi.io/versions/2025.06/compat/linux/x86_64/lib/nvidia is a symlink pointing to /cvmfs/software.eessi.io/defaults/nvidia, which is a symlink pointing to /dev/null

Looks like the variant symlinks needs to be configured for the Surf bot?

@casparvl
Copy link
Collaborator Author

ah, yeah, same issue you had on the jsc bot...

@casparvl casparvl added accel:nvidia 2025.06-software.eessi.io 2025.06 version of software.eessi.io labels Feb 21, 2026
options:
accept-eula-for: cuDNN
cuda-sanity-check-accept-missing-ptx: True
- OSU-Micro-Benchmarks-7.5-gompi-2024a-CUDA-12.6.0.eb:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self: this should go in a -2024a easystack, not a -system

@casparvl casparvl changed the title Add OSU with CUDA {2025.06}[2025a] Add OSU with CUDA Feb 21, 2026
@casparvl casparvl changed the title {2025.06}[2025a] Add OSU with CUDA {2025.06}[2024a] Add OSU with CUDA Feb 21, 2026
@bedroge
Copy link
Collaborator

bedroge commented Feb 23, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace,accel=nvidia/cc90

@eessi-bot-jsc
Copy link

eessi-bot-jsc bot commented Feb 23, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace and accelerator nvidia/cc90
Building for: aarch64/nvidia/grace and accelerator nvidia/cc90
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.02/pr_1401/14511605

date job status comment
Feb 23 09:30:12 UTC 2026 submitted job id 14511605 awaits release by job manager
Feb 23 09:30:52 UTC 2026 released job awaits launch by Slurm scheduler
Feb 23 09:31:56 UTC 2026 running job 14511605 is running
Feb 23 09:42:18 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14511605.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-accel-nvidia-cc90-17718395410.tar.gzsize: 0 MiB (902092 bytes)
entries: 52
modules under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/reprod
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0/20260223_093638UTC
other under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90
no other files in tarball
Feb 23 09:42:18 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-nvidia-gh200+default
P: latency: 2.45 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-nvidia-gh200+default
P: latency: 6.12 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-nvidia-gh200+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-nvidia-gh200+default
P: bandwidth: 19505.95 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-14511605.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@bedroge
Copy link
Collaborator

bedroge commented Feb 23, 2026

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace,accel=nvidia/cc90

@eessi-bot-jsc
Copy link

eessi-bot-jsc bot commented Feb 23, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace and accelerator nvidia/cc90
Building for: aarch64/nvidia/grace and accelerator nvidia/cc90
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.02/pr_1401/14511666

date job status comment
Feb 23 09:41:32 UTC 2026 submitted job id 14511666 awaits release by job manager
Feb 23 09:42:16 UTC 2026 released job awaits launch by Slurm scheduler
Feb 23 09:43:23 UTC 2026 running job 14511666 is running
Feb 23 09:53:41 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-14511666.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-aarch64-nvidia-grace-accel-nvidia-cc90-17718401980.tar.gzsize: 0 MiB (902350 bytes)
entries: 52
modules under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/modules/all
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0.lua
software under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/software
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0
reprod directories under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90/reprod
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0/20260223_094734UTC
other under 2025.06/software/linux/aarch64/nvidia/grace/accel/nvidia/cc90
no other files in tarball
Feb 23 09:53:41 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ OK ] (1/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_allreduce %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /e4bf9965 @BotBuildTests:aarch64-nvidia-gh200+default
P: latency: 2.48 us (r:0, l:None, u:None)
[ OK ] (2/4) EESSI_OSU_coll %benchmark_info=mpi.collective.osu_alltoall %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node %device_type=cpu /3da4890b @BotBuildTests:aarch64-nvidia-gh200+default
P: latency: 6.1 us (r:0, l:None, u:None)
[ OK ] (3/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_latency %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /3255009a @BotBuildTests:aarch64-nvidia-gh200+default
P: latency: 0.25 us (r:0, l:None, u:None)
[ OK ] (4/4) EESSI_OSU_pt2pt_CPU %benchmark_info=mpi.pt2pt.osu_bw %module_name=OSU-Micro-Benchmarks/7.5-gompi-2025a %scale=1_node /59f4b331 @BotBuildTests:aarch64-nvidia-gh200+default
P: bandwidth: 18982.66 MB/s (r:0, l:None, u:None)
[ PASSED ] Ran 4/4 test case(s) from 4 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-14511666.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

Checking if the SURF bot config has updated correctly so that symlinks now work...

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Feb 23, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.02/pr_1401/19929639

date job status comment
Feb 23 14:05:30 UTC 2026 submitted job id 19929639 will be eligible to start in about 20 seconds
Feb 23 14:05:45 UTC 2026 received job awaits launch by Slurm scheduler
Feb 23 14:06:29 UTC 2026 running job 19929639 is running
Feb 23 14:06:50 UTC 2026 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Did not find bot/check-result.sh script in job's work directory.
  • Check job manually or ask an admin of the bot instance to assist you.
Feb 23 14:06:50 UTC 2026 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job19929639.test does not exist in job directory, or parsing it failed.

@casparvl
Copy link
Collaborator Author

bot/build.sh script found in '/gpfs/work1/1/eessibot/eessi-bot-surf/jobs/2026.02/pr_1401/event_b3fb6b80-10c0-11f1-8303-ffd3c03cd7a0/run_000/x86_64/intel/icelake/nvidia/cc80/eessi.io-2025.06-software', so running it!
Cloning into 'software-layer-scripts'...
ln: failed to create symbolic link './licenses': File exists
bot/build.sh finished

Eehhh, I hope this is not due to our own bot config...? Did something change on the software-layer-script side? is this @hvelab new licenses PR at work or something? :)

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-jsc for:arch=aarch64/nvidia/grace,accel=nvidia/cc90

@eessi-bot-jsc
Copy link

eessi-bot-jsc bot commented Feb 23, 2026

New job on instance eessi-bot-jsc for repository eessi.io-2025.06-software
Building on: nvidia-grace and accelerator nvidia/cc90
Building for: aarch64/nvidia/grace and accelerator nvidia/cc90
Job dir: /p/project1/ceasybuilders/eessibot/jobs/2026.02/pr_1401/14512537

date job status comment
Feb 23 14:09:38 UTC 2026 submitted job id 14512537 awaits release by job manager
Feb 23 14:10:24 UTC 2026 released job awaits launch by Slurm scheduler
Feb 23 14:11:26 UTC 2026 finished
🤷 UNKNOWN (click triangle for detailed information)
  • Did not find bot/check-result.sh script in job's work directory.
  • Check job manually or ask an admin of the bot instance to assist you.
Feb 23 14:11:26 UTC 2026 test result
🤷 UNKNOWN (click triangle for detailed information)
  • Job test file _bot_job14512537.test does not exist in job directory, or parsing it failed.

@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Feb 23, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.02/pr_1401/19941787

date job status comment
Feb 23 15:26:37 UTC 2026 submitted job id 19941787 will be eligible to start in about 20 seconds
Feb 23 15:26:46 UTC 2026 received job awaits launch by Slurm scheduler
Feb 23 15:27:22 UTC 2026 running job 19941787 is running
Feb 23 15:44:48 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-19941787.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-icelake-accel-nvidia-cc80-17718613910.tar.zstsize: 0 MiB (831725 bytes)
entries: 52
modules under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0.lua
software under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/software
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0
reprod directories under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/reprod
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0/20260223_154042UTC
other under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80
no other files in tarball
Feb 23 15:44:48 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19941787.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

…can pass NCCL an explicit option to ignore missing PTX
@casparvl
Copy link
Collaborator Author

bot: build repo:eessi.io-2025.06-software instance:eessi-bot-surf for:arch=x86_64/intel/icelake,accel=nvidia/cc80

@eessi-bot-surf
Copy link

eessi-bot-surf bot commented Feb 23, 2026

New job on instance eessi-bot-surf for repository eessi.io-2025.06-software
Building on: intel-icelake and accelerator nvidia/cc80
Building for: x86_64/intel/icelake and accelerator nvidia/cc80
Job dir: /projects/eessibot/eessi-bot-surf/jobs/2026.02/pr_1401/19943552

date job status comment
Feb 23 15:48:04 UTC 2026 submitted job id 19943552 will be eligible to start in about 20 seconds
Feb 23 15:48:14 UTC 2026 received job awaits launch by Slurm scheduler
Feb 23 15:48:38 UTC 2026 running job 19943552 is running
Feb 23 16:05:59 UTC 2026 finished
😢 FAILURE (click triangle for details)
Details
✅ job output file slurm-19943552.out
✅ no message matching FATAL:
❌ found message matching ERROR:
❌ found message matching FAILED:
❌ found message matching required modules missing:
❌ no message matching No missing installations
✅ found message matching .tar.* created!
Artefacts
eessi-2025.06-software-linux-x86_64-intel-icelake-accel-nvidia-cc80-17718626360.tar.zstsize: 43 MiB (45823905 bytes)
entries: 93
modules under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/modules/all
NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0.lua
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0.lua
software under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/software
NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0
reprod directories under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80/reprod
NCCL/2.22.3-GCCcore-13.3.0-CUDA-12.6.0/20260223_160205UTC
UCX-CUDA/1.16.0-GCCcore-13.3.0-CUDA-12.6.0/20260223_155924UTC
other under 2025.06/software/linux/x86_64/intel/icelake/accel/nvidia/cc80
no other files in tarball
Feb 23 16:05:59 UTC 2026 test result
😁 SUCCESS (click triangle for details)
ReFrame Summary
[ PASSED ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)
Details
✅ job output file slurm-19943552.out
❌ found message matching ERROR:
✅ no message matching [\s*FAILED\s*].*Ran .* test case

@casparvl
Copy link
Collaborator Author

MPI perftest: disabled
checking for ucp/api/ucp.h... no
checking for ucs/sys/uid.h... no
configure: WARNING: UCX not found
UCX support: no
configure: error: UCX is not available

During the UCC-CUDA config. Strange, because UCX is listed as a dependency in the UCC-CUDA easyconfig.

I'll try interactively...

@ocaisa
Copy link
Member

ocaisa commented Feb 23, 2026

I suspect easybuilders/easybuild-framework#5124

@casparvl
Copy link
Collaborator Author

I haven't checked, but this could very well be the issue. And your upstream issue is surprisingly recent. Isn't it strange we haven't noticed this before? I mean we changed to this config a while ago, no?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

2025.06-software.eessi.io 2025.06 version of software.eessi.io accel:nvidia

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants