Literature Similarity Analyzer

A machine learning project that measures similarity between literary texts to determine authorship attribution and stylistic analysis.

Overview

This project analyzes literary texts using statistical methods and machine learning to:

Determine if two texts were written by the same author
Measure stylistic similarity between texts
Cluster texts by author based on linguistic features

The system uses n-gram analysis combined with Jaccard and Tanimoto similarity measures, processed through a decision tree classifier.

Features

Text Processing:
- Support for FB2, EPUB, and TXT formats
- Text normalization (lemmatization, punctuation removal)
- N-gram generation (2-grams, 3-grams, 4-grams)
Similarity Measures:
- Jaccard similarity for n-grams
- Jaccard similarity for vocabulary
- Tanimoto coefficient for weighted n-grams
Machine Learning:
- Decision Tree classifier
- Performance metrics (accuracy, precision, recall)
- Pre-computed feature vectors for quick analysis

Installation

Clone the repository:

git clone https://github.com/ivanVgusev/LitSim.git
cd LitSim

Install dependencies:
```
pip install -r requirements.txt
```
import nltk
```
nltk.download('punkt')
```

Usage

Basic Text Comparison

from main import compare_authors

result = compare_authors(
    "path/to/book1.txt",
    "path/to/book2.txt",
    stats_path="path/to/project/root/",
    n=3,
    lemmatise=False
)
print("Same author" if result == [0] else "Different authors")

Processing your own data

from corpus_processing import main_processing

main_processing(
    base_path="/path/to/project/root/",
    lit_folder_name="your_literature_folder")

Configuration

The system can be configured through several parameters:

Parameter	Options	Description
N-gram size	2, 3, 4	The length of word sequences to analyze
Lemmatisation	True/False	Whether to lemmatise text

Example configurations are provided in the Configuration_comparison.xlsx file.

Project Structure

LitSim/
├── literature/ # Input texts (organized by author)
├── literature_test/ # Test texts for evaluation
├── values/ # Precomputed features (non-normalized)
├── values_lemmatised/ # Precomputed features (normalized)
├── corpus_processing.py # Text processing pipeline
├── main.py # Main comparison interface
├── ml.py # Machine learning functions
├── n_grams.py # N-gram processing
├── readers.py # File format readers
├── statistical_methods.py # Similarity calculations
├── string_cleaner.py # Text cleaning utilities
├── plots_charts.py # Visualization functions
└── progress_monitor.py # Progress tracking

Methodology

Text Processing:

Files are read and cleaned (punctuation, numbers removed)
Optional normalization (words reduced to lemma forms)
N-grams are extracted (2-4 word sequences)

Feature Extraction:

Jaccard similarity for unique n-grams
Jaccard similarity for vocabulary
Tanimoto coefficient for weighted n-gram counts

Machine Learning:

Decision Tree classifier trained on:
- Positive samples (same-author comparisons)
- Negative samples (different-author comparisons)

Results

Performance metrics from sample runs (mean value in 100 trials):

Corpus 1 (4 authors, 42 books)

N	Normalised	Accuracy	Precision	Recall
2	True	0.68	0.77	0.78
2	False	0.69	0.78	0.78
3	True	0.69	0.79	0.78
3	False	0.72	0.81	0.80
4	True	0.76	0.83	0.83
4	False	0.66	0.77	0.75

Corpus 2 (8 authors, 69 books)

n-gram	Normalised	Accuracy	Precision	Recall
2	True	0.76	0.84	0.86
2	False	0.72	0.81	0.85
3	True	0.76	0.84	0.86
3	False	0.72	0.81	0.85
4	True	0.76	0.84	0.86
4	False	0.72	0.81	0.85

Contributing

Contributions are welcome! Please fork the repository and submit a pull request with your improvements.

Fork the project
Create your feature branch (git checkout -b feature/YourFeature)
Commit your changes (git commit -m 'Add YourFeature')
Push to the branch (git push origin feature/YourFeature)
Open a Pull Request

License

Distributed under the MIT License. See LICENSE for more information.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Literature Similarity Analyzer

Table of Contents

Overview

Features

Installation

Usage

Configuration

Project Structure

Methodology

Text Processing:

Feature Extraction:

Machine Learning:

Results

Corpus 1 (4 authors, 42 books)

Corpus 2 (8 authors, 69 books)

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 48 Commits
values		values
values_normalised		values_normalised
.gitignore		.gitignore
Configuration_comparison.xlsx		Configuration_comparison.xlsx
LICENSE		LICENSE
README.md		README.md
corpus_processing.py		corpus_processing.py
main.py		main.py
ml.py		ml.py
n_grams.py		n_grams.py
plots_charts.py		plots_charts.py
progress_monitor.py		progress_monitor.py
requirements.txt		requirements.txt
statistical_methods.py		statistical_methods.py
string_cleaner.py		string_cleaner.py
writers_and_readers.py		writers_and_readers.py

License

ivanVgusev/LitSim

Folders and files

Latest commit

History

Repository files navigation

Literature Similarity Analyzer

Table of Contents

Overview

Features

Installation

Usage

Configuration

Project Structure

Methodology

Text Processing:

Feature Extraction:

Machine Learning:

Results

Corpus 1 (4 authors, 42 books)

Corpus 2 (8 authors, 69 books)

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages