psspnn

This code reproduces the protein secondary structure prediction neural network described in Holley, L.H. & Karplus, M. (1989). Protein secondary structure prediction with a neural network. Proc. Natl. Acad. Sci. USA 86, 152–156. doi:10.1073/pnas.86.1.152 · PubMed

Architecture

The network is a feed-forward neural network with sigmoid activations:

Input (17 × 21 = 357 units)
   ↓ sliding window of 17 residues, one-hot encoded
   ↓ 21 bits per position: 20 amino acids + 1 null (terminal padding)
Hidden (2 units, default)
   ↓
Output (2 units)
   (0.9, 0.1) = helix   (0.1, 0.9) = sheet   (0.1, 0.1) = coil

Training uses full-batch backpropagation (Rumelhart et al. 1986 doi:10.1038/323533a0) minimising the sum of squared errors. Training stops when the fractional change in loss per cycle drops below 2×10⁻⁴.

The 62 benchmark proteins are those used in Kabsch & Sander (1983) FEBS Lett 155:179 (PubMed), split into 47 training and 14 test proteins as in the original paper.

Post-processing applies minimum-length rules from Kabsch & Sander (1983): helix runs < 4 residues and strand runs < 2 residues are reclassified as coil.

Output

With 2 output units you can encode 3 classes using a binary scheme:

Output 1	Output 2	Class
0.9	0.1	Helix (H)
0.1	0.9	Sheet (E)
0.1	0.1	Coil (C)

Offset targets (0.1/0.9 instead of 0/1) prevent sigmoid saturation, which is standard practice for sigmoid-output networks (Rumelhart et al. 1986). Coil is the "default" — it's what you get when neither output unit is strongly activated, coil is essentially "neither helix nor sheet," so two binary decisions are sufficient to distinguish all three states. A 3-unit one-hot output layer (one neuron per class) would also work but would add extra parameters.

Requirements

Python ≥ 3.10
numpy, biopython, pydssp, click

Installation

git clone https://github.com/bioteam/psspnn.git
cd psspnn
pip install .

Usage

1. Download the benchmark PDB files

psspnn download --split all

Files are cached to ~/.psspnn/pdb_cache/. Re-running is safe; already-cached files are skipped.

2. Train

psspnn train --verbose

Trains on the 47 proteins from the training set. Progress is printed every cycle. The model is saved to ~/.psspnn/model.npz.

Options:

Flag	Default	Description
`--window`	17	Input window size
`--hidden`	2	Hidden layer size (0 = no hidden layer)
`--lr`	10.0	Learning rate
`--max-cycles`	2000	Maximum training cycles
`--tol`	2e-4	Fractional-change stopping threshold
`--seed`	—	Random seed for reproducibility
`--model-out`	`~/.psspnn/model.npz`	Output path

The training proteins have varying lengths (47 proteins totalling ~8,315 residues, average ~173 per protein). For each residue in a protein, a window of 17 residues centred on that position is extracted and one-hot encoded as a 17×21 = 357-element input vector. That window becomes one training sample, with the target being the DSSP label (H/E/C) of the central residue.

3. Evaluate

psspnn evaluate --split test

Prints per-protein percent-correct and aggregate quality indices (percent correct, Matthews correlation coefficients, PC(S)) against the expected values from Table 1 of the paper (~63% on the test set).

4. Predict

psspnn predict LFTGHPETLEKFDKFKHLKTEAEMKASEDLKKHGTVVLTALGGILK

Outputs predictions as a string of H (helix), E (sheet), C (coil):

Sequence:   LFTGHPETLEKFDKFKHLKTEAEMKASEDLKKHGTVVLTALGGILK
Prediction: CCCCCHHHHHHHHHHHHHHHHHHHHHHHHHHHHCEEEHHHHHCCCC

Use --format table for per-residue output with raw network activations.

Expected results

Metric	Training set	Test set (paper)
Percent correct	68.5%	63.2%
C_helix	0.49	0.41
C_sheet	0.47	0.32
C_coil	0.43	0.36
PC(helix)	65.3%	59.2%
PC(sheet)	63.4%	53.3%
PC(coil)	71.1%	66.9%

Package structure

src/psspnn/
├── data/
│   ├── protein_ids.py   Ordered list of 62 benchmark PDB IDs
│   ├── download.py      PDB file retrieval via biopython
│   ├── dssp.py          DSSP assignment via pydssp → H/E/C labels
│   └── encoding.py      One-hot window encoding and dataset assembly
├── model/
│   ├── network.py       HolleyKarplusNet (forward pass, save/load)
│   ├── training.py      Full-batch backprop with SSE loss
│   └── predict.py       Threshold + contiguity post-processing
├── evaluation/
│   └── metrics.py       Matthews CC, percent correct, PC(S)
└── cli.py               Command-line interface

Running tests

pytest

Notes

The learning rate is not stated in the paper. Rumelhart et al. (1986) used 0.1 for per-pattern updates; since this implementation uses full-batch training with mean gradients, the default is 10.0. Adjust with --lr if convergence is slow.
The training protein IDs are sourced from Kabsch & Sander (1983) FEBS Lett 155:179 table 1. Some 1983-era PDB entries have been superseded; failed downloads are logged as warnings.
The DSSP (Define Secondary Structure of Proteins; Kabsch & Sander, 1983) 8-to-3 state mapping used is: {H, G, I} → H, {E, B} → E, {T, S, ' '} → C.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github/workflows		.github/workflows
src/psspnn		src/psspnn
tests		tests
.gitignore		.gitignore
Holley 1989-Protein secondary structure prediction with a neural network.pdf		Holley 1989-Protein secondary structure prediction with a neural network.pdf
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

psspnn

Architecture

Output

Requirements

Installation

Usage

1. Download the benchmark PDB files

2. Train

3. Evaluate

4. Predict

Expected results

Package structure

Running tests

Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

psspnn

Architecture

Output

Requirements

Installation

Usage

1. Download the benchmark PDB files

2. Train

3. Evaluate

4. Predict

Expected results

Package structure

Running tests

Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages