PoC for embedding text files into the weights of a character-level LSTM neural network
# Char-level LSTM Memorizer PoC

**Original article:** [LSTM or Transformer as "malware packer"](https://bednarskiwsieci.pl/en/blog/lstm-or-transformer-as-malware-packer/)
A simple proof-of-concept demonstrating how to **embed any text file** (e.g., source code) into the weights of a character-level LSTM neural network model and then **accurately reconstruct** its contents during inference.
## Features
- Builds a character-level vocabulary + a special Beginning-of-Sequence (BOS) token.
- Trains an LSTM on a single file until overfitting (memorization).
- Generates the entire character sequence starting from the BOS token.
- Compares SHA-256 checksums of the original and generated files.
## Requirements
- Python 3.12+
- PyTorch 2.7.1+
- NumPy 2.3.1+
- Safetensors 0.5.3+
- Matplotlib 3.10.3+
- Ruff 0.12.2+ (optional)
Use `uv` to manage dependencies. If you don't have `uv` installed, you can install it via pip:
```bash
pip install uv
```
## Installation
1. Clone the repository:
```bash
git clone https://github.com/piotrmaciejbednarski/lstm-memorizer.git
cd lstm-memorizer
```
2. Synchronize using `uv`:
```bash
uv sync
```
## Example
One of the generated examples can be found in the `example` directory. The model weights `model.safetensors` were trained on `bubble_sort.py`.
You can run the entire process in encoding mode and then decoding mode to verify that the generated file is identical to the original.
### Training
```bash
uv run main.py train ./examples/bubble_sort.py \
--epochs 4000 \
--hidden 32 \
--layers 2 \
--lr 1e-3 \
--weights ./output/model.safetensors
```
```
Using device: mps
Epoch 1/4000 loss=4.0081
Epoch 500/4000 loss=0.5532
Epoch 1000/4000 loss=0.1054
Epoch 1500/4000 loss=0.0395
Epoch 2000/4000 loss=0.0220
Epoch 2500/4000 loss=0.0120
Epoch 3000/4000 loss=0.0076
Epoch 3500/4000 loss=0.0051
Epoch 4000/4000 loss=0.0060
Model saved to ./output/model.safetensors
```
### G