Skip to content

Normalized search data

Andrew Brehaut edited this page Jun 7, 2016 · 6 revisions

The front end would like to be able to use one normalized datastructure to represent all searches regardless of the specific interface used to generate the query. This will scale from a simple text box which search all facets of an entry, through to a complex query builder with advanced features such as per facet terms and boolean operations.

Proposed datastructure

To do this, I propose a simple, uniform, JSON representation using a simple tree of objects using the form:

{
     "op": "...", 
     "val": [ <children> ]
}

Where op is a string key for the operation name (such as "and", "phrase", "not", ...), and val is any child values (specific values depend on the type of the op. and takes an array of query nodes, phrase takes and array of strings, not takes a single query node).

NOTE: This document is not concerned with specific operations as much as the mechanism for communicating any query in a normalized form.

For example, a simple search for "keep on the" might result in the query:

{
    "op": "phrase",
    "val": [
        "keep on the"
    ]
}

An advance search using booleans might be "forgotten realms and goblinoids and not hobgoblins" producing the query document:

{
    "op": "and",
    "val": [
        {
            "op": "phrase",
            "val": [
                "forgotten",
                "realms"
            ]
        },
        {
            "op": "and",
            "val": [
                {
                    "op": "phrase",
                    "val": [
                        "goblinoids"
                    ]
                },
                {
                    "op": "not",
                    "val": {
                        "op": "phrase",
                        "val": [
                            "hobgoblins"
                        ]
                    }
                }
            ]
        }
    ]
}

Additional nodes may constrain the search a particular set of facets. Note that the data structure does not concern itself with whether a given facet is stored as a database field on the entry (intrinsic) or as a tag (extrinsic). It is up to the backend to compile the appropriate query based on its own implementation. Such as searching for any adventure with the location or setting being the underdark or forgotten realms:

{
    "op": "facet",
    "val": {
        "facets": ["setting", "location"],
        "val": {
            "op": "or",
            "val": [
                {
                    "op": "phrase",
                    "val": ["forgotten", "realms"]
                },
                {
                    "op": "phrase",
                    "val": ["underdark"]
                }
            ]
        }
    }
}

Normalization

As the UI is expected to provide rich autocompletion, the intent is that the phrase nodes in the tree will automatically be normalized based on autocompletion data, including the removal of stop words ("the", "a", "of", ...), sounds-alike matching (using double metaphone, and possible a stemmer such as the porter stemmer), and ideally also synonym matching ("illithid" and "mindflayer" both return the same phrase). NOTE: The normalization via sounds-alike matching etc is to find the appropriate autocomplete match, not producing the resulting query terms.

This is to minimize the amount of work the server implementation has to do, and also improve the quality of results.

This process is a key reason for the front end to use a single query format regardless of the interface used.

Backend implementation

The following python/django pseudocode presents a possible implementation strategy for processing these queries. (This code has not been executed so definitely contains bugs)

from django.db.models import Q

# a mapping from the public names for facets to the database field name
intrinsic_facets = {
    "title": "title",
    "author": "author",
    "setting": "setting"
}


def queryField(word, facet):
    if facet in intrinsic_facets:
        return Q(**{"%s__icontains" % intrinsic_facets[facet]: word})
    return Q(tags__facet__exact: facet, tags__value__icontains: word)

def queryAllFields(word):
    intrinsics = Q(dict(("%s__icontains" % val, word) for intrinsic_facets.itervalues()))
    return intrinsics | Q(tag__value__icontains = word)

def queryFields(phrase, facets): 
    q = Q()
    if facets == None:                    # no facets specified. search everything
        for word in phrase:
            q = q & queryAllFields(word)
    else:                                 # search only specified facets 
        for word in phrase:
            for facet in facets:
                q = q & queryField(word, facet) 
    return q

# NOTE: and and or are likely not properly parenthesizing their sub queries
def buildQuery(queryJson, facets=None):
    op = queryJson.op

    if op == "phrase":
        return queryFields(queryJson.val, facets)
    if op == "and": 
        return buildQuery(queryJson.val[0], facets) & buildQuery(queryJson.val[1], facets)
    if op == "or": 
        return buildQuery(queryJson.val[0], facets) & buildQuery(queryJson.val[1], facets)
    if op == "not":
        return ~buildQuery(queryJson.val, facets)
    if op == "facets":
        return buildQuery(queryJson.val.val, queryJson.val.facets)
Clone this wiki locally