Gemma

Gemma is a Generic Metadata Mapper written in Python. It allows to map metadata in JSON or XML format into another representation, e.g. for indexing at Elastic, using a custom mapping description.

Manuscript metadata mapping

Useful files:

./sample/schema-for-xml-response.json contains the schema to map the xml response. The mapping has been done according to the TEI, suggested by Germaine;
mapping_functions.py contains the functions needed to run the mapping of the metadata, in order to create a JSON document ready to be indexed by elasticSearch.

Ubuntu installation and settings

apt-get update
apt-get install python3 python3-pip
pip3 install xmltodict wget
export PYTHONIOENCODING=UTF-8

MacOS installation and settings

If not present, install Homebrew:

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Then, install Python 3:

brew install python3

Check whether Homebrew have already installed pip pointing to the Homebrew’d Python 3 for you:

which pip
which pip3

If your pip already points to Python3:

pip install xmltodict wget

Otherwise:

pip3 install xmltodict wget

Explanation and instructions

There are two codes: one to download all the manuscript metadata (i.e. the responses) from the episteme repo, the other to map the metadata according to the schema. Both are partially interactive, so the input and output folders must be inserted by the user.

Download the manuscript metadata

The code retrieves all the manuscript resources in the episteme repo, and downloads the XML response naming it locally as its id. The http connection is hardcoded, while the local folder (which must already exist) where to download the files must be insterted by the user. To run the code:

python3 retrieve_response.py $local_folder

Create the JSON file for elasticSearch

The code creates a JSON file for each manuscript, extracting the relevant metadata (suggested by Germaine) from the XML response file. These metadata can be then indexed in elasticSearch. The mapping is done following a schema, provided by the ./sample/schema-for-xml-response.json file. The mapping cannot be granular for some entries, due to a couple of reasons:

In some files, the content is in the expected key, while in some others it is in a more nested key. For example, it could be in source.settlement or in source.settlement.rd.#text. This applies to:
- source.settlement
- content.locus
In some files, the value is a list, which is more complicated to manage. This will be done in the second version. Up to now, all the list (with its nested dictionaries, which can in future be indexed) is just casted as a string. This applies to:
- dimensions
- hand note
- provenance event
- listBibl.bibl

To run the code:

python3 mapping.py $schema $input_folder $output_folder

where $schema is the schema file, in this case ./sample/schema-for-xml-response.json, $input_folder is the folder where the XML responses are, $output_folder is the folder (which must already exist) where the JSON files will be created. The output JSON files are named as the XML file + the extension .elastic.json, so for example manuscript_id.xml.elastic.json.

To be fixed in version 2:

manage nested dictionaries within lists
manage mandatory fields (the ones in the basic minimal schema)
manage entries which have None in the XML response in a different way from missing entries
go beyond full-text search, starting e.g. from dates and dimensions

Elasticsearch

To download, install and run an Elasticsearch instance, follow the instructions of the web page: https://www.elastic.co/downloads/elasticsearch

Indexing

The code elastic_indexing.py allows to index all the manuscripts JSON files (by default) in a given folder, which must be provided. Addotional packages are needed:

pip3 install requests, elasticsearch

To run it:

python3 elastic_indexing.py JSON_directory

Search using the Python code

The code elastic_indexing.py already contains a function using the python elasticsearch client and an example of search. An other option is to use Kibana.

Seach using Kibana

To download, install and run Kibana on top of your Elasticsearch instance, follow the instructions on the web page: https://www.elastic.co/downloads/kibana

To search all manuscripts, write on Kibana dashboard:

GET episteme/_search
{
  "query": {
	    "match_all": { }
  }
}

In the response, hits.value is the number of found hits:

"hits" : {
  "total" : {
  "value" : 81,
  "relation" : "eq"
}

To search by exact match of a word, write on Kibana dashboard:

GET episteme/_search
{
  "query": {
    "match": {
      "index.name.with.dots.if.nested": "word to match"
    }
  }
}

To search by exact match of a sentence, write on Kibana dashboard:

    GET episteme/_search
    {
      "query": {
        "match_phrase": {
          "index.name.with.dots.if.nested": "sentence to match"
        }
      }
    }

To select the index name, please consult the JSON schema schema-for-xml-response.json. It works also with Greek, special French, and German letters.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
sample		sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
elastic_indexing.py		elastic_indexing.py
mapping.py		mapping.py
mapping_functions.py		mapping_functions.py
mapping_single.py		mapping_single.py
requirements.dist.txt		requirements.dist.txt
retrieve_response.py		retrieve_response.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gemma

Manuscript metadata mapping

Ubuntu installation and settings

MacOS installation and settings

Explanation and instructions

Download the manuscript metadata

Create the JSON file for elasticSearch

To be fixed in version 2:

Elasticsearch

Indexing

Search using the Python code

Seach using Kibana

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Gemma

Manuscript metadata mapping

Ubuntu installation and settings

MacOS installation and settings

Explanation and instructions

Download the manuscript metadata

Create the JSON file for elasticSearch

To be fixed in version 2:

Elasticsearch

Indexing

Search using the Python code

Seach using Kibana

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages