CRAMboozle is a tool and Snakemake workflow for de-identifying sequencing data stored in BAM or CRAM format, protecting the genetic privacy of donor individuals while preserving essential alignment data for analysis.
CRAMboozle is a derivative of BAMboozle by Christoph Ziegenhain and Rickard Sandberg (Karolinska Institutet). The original tool is described in:
Ziegenhain, C., Sandberg, R. BAMboozle removes genetic variation from human sequence data for open data sharing. Nat Commun 12, 6216 (2021). https://doi.org/10.1038/s41467-021-26152-8
CRAMboozle extends the original BAMboozle (v0.5.0) with the following changes:
| Feature | BAMboozle | CRAMboozle |
|---|---|---|
| Input formats | BAM only | BAM or CRAM (auto-detected) |
| Output format | BAM | CRAM v3.1 by default (configurable) |
| Compression | Default BAM | CRAM v3.1 with level 0–9 and advanced codecs (LZMA, BZIP2, FQZ, TOK, ARITH) |
| CPU allocation | Manual --p flag |
Auto-detects available CPUs (still overridable) |
| Batch processing | Single file at a time | Snakemake workflow for parallel multi-sample processing on SLURM clusters |
| Reference validation | None | Validates CRAM–reference compatibility before processing |
The core de-identification logic (sequence replacement, tag sanitization, splice handling) is preserved from the original BAMboozle.
CRAMboozle replaces observed read sequences with reference genome sequence and sanitizes auxiliary tags, removing donor-specific genetic variation. For full details on the de-identification strategy (SNP replacement, insertion/deletion handling, clipping, splicing, tag sanitization, strict mode), see the BAMboozle paper and original README.
- Python 3.10+
- pysam 0.21.0+ (for CRAM v3.1 support)
- Snakemake (for batch workflow; optional for single-file use)
- Reference genome FASTA file, indexed with
samtools faidx
python CRAMboozle.py \
--input sample.bam \
--out sample_deidentified.cram \
--fa /path/to/reference.fastaKey options:
| Flag | Description |
|---|---|
--input |
Input BAM or CRAM file |
--out |
Output file path (CRAM by default) |
--fa |
Reference genome FASTA (indexed) |
--p N |
Number of processes (default: all available CPUs) |
--strict |
Also sanitize mapping quality & extra auxiliary tags |
--keepunmapped |
Keep unmapped reads in output |
--keepsecondary |
Keep secondary alignments in output |
--force-bam |
Force BAM output instead of CRAM |
--compression-level |
CRAM compression level 0–9 (default: 9) |
The included Snakemake workflow processes many samples in parallel on a SLURM cluster.
Edit config/samples.yaml:
samples:
sample1: /path/to/sample1.bam
sample2: /path/to/sample2.cram
sample3: /path/to/sample3.bamTip: See
config/generate_samples.txtfor a one-liner to auto-generate this file from a directory of CRAMs.
Edit config/config.yaml to set:
reference_genome— path to your indexed FASTA referenceresults_dir— output directory (default:results/)strict_mode,keep_unmapped,keep_secondary— processing options
Edit config/cluster_slurm.yaml to match your cluster's partition names, memory, and CPU limits.
# Dry run first (recommended)
snakemake -s CRAMboozle.snakefile \
--cluster-config config/cluster_slurm.yaml \
--cluster "sbatch -p {cluster.partition} --mem={cluster.mem} -t {cluster.time} -c {cluster.ncpus} -n {cluster.ntasks} -o {cluster.output} -J {cluster.JobName}" \
-j 40 --latency-wait 60 --keep-going -np
# Remove -np to executeFor local execution (small datasets, no cluster):
snakemake -s CRAMboozle.snakefile -j 4For each sample, the workflow produces:
{sample}_deidentified.cram— de-identified CRAM file{sample}_deidentified.cram.crai— CRAM indexlogs/{sample}_cramboozle.log— processing log
CRAMboozle/
├── CRAMboozle.py # Core de-identification script
├── CRAMboozle.snakefile # Snakemake workflow
├── config/
│ ├── config.yaml # Pipeline settings
│ ├── samples.yaml # Sample manifest
│ └── cluster_slurm.yaml # SLURM cluster resources
└── results/ # Output directory
- Reference mismatch: CRAMboozle validates the reference before processing CRAM files. If validation fails, check your CRAM headers (
samtools view -H file.cram | grep '^@SQ') and ensure the correct reference build is specified. - Missing index: The reference FASTA must be indexed (
samtools faidx reference.fasta). Input files should be coordinate-sorted and indexed, though CRAMboozle will attempt to sort and index them if not. - Memory issues: Increase
meminconfig/cluster_slurm.yaml.
This project is licensed under the GNU General Public License v3.0 — the same license as the original BAMboozle. See LICENSE for details.
CRAMboozle is derived from BAMboozle by Christoph Ziegenhain and Rickard Sandberg, originally released under GPL-3.0. If you use CRAMboozle, please cite the original BAMboozle paper:
Ziegenhain, C., Sandberg, R. BAMboozle removes genetic variation from human sequence data for open data sharing. Nat Commun 12, 6216 (2021). https://doi.org/10.1038/s41467-021-26152-8
CRAMboozle modifications by Robert Patton (Ha Lab, Fred Hutch Cancer Center).