a life in search

Running Elasticsearch Curator as an AWS Lambda function

2018-06-24T00:00:00+02:00

AWS Lambda functions can be used to run the Elasticsearch Curator CLI in a serverless way. This can be very convenient as you don’t need yet another server in order to run your Curator jobs, especially if you are using an Elasticsearch cloud hosting service and don’t have any machines directly available.

AWS Lambda

AWS Lambda is a serverless compute service that allows for the execution of event triggered units of code called functions. Lambda functions can be written in Java, nodejs, C# and Python and some common use cases are processing items as they are added to Kinesis queues or to S3 buckets. Another nice approach is to cloudify old fashioned cron-jobs using Lambda functions and CloudWatch Events and since the price of running Lambda functions is based on execution time this is much cheaper than having an EC2 virtual machine running 24/7. The execution unit of Lambda functions are called GB-seconds and is calculated as the memory (GB) x execution time (seconds), if your function executes for 0,5 seconds using 0,128 GB of memory this will equal 0,064 GB seconds. Even better is that the first 400,000 GB seconds each month are free. Currently there is a maximum timeout of 300 seconds for a single execution of a function so Lambda functions are not suitable for long running tasks.

Elasticsearch Curator

Elasticsearch Curator lets you manage your Elasticsearch indices and snapshots and is a handy tool for doing various maintanance tasks. It is a python library that can be used either by directly using the API or by using the CLI and specify tasks using action files. The Curator documentation contains several examples of action files doing varios tasks (Note that you can specify several actions in a file and that they will be executed in order).

Running the Curator CLI as a Lambda function

Using the Curator Python API in your Lambda Function is pretty straight forward but maybe you want to move existing Curator jobs that are using the the CLI and have actions specified in action files. To do this you can simply execute the Curator CLI from within your Lambda function:

run("curator.yml", "actions.yml", dry_run=os.environ.get('DRY_RUN', True))

Doing this and packaging the action file together with the Lambda function in the deploy will then allow you to setup functions running different Curator jobs. To simplify it even more I have created a small lambda-curator package on github with a full implementation, including packaging and deployment of the function.

Integration with AWS Elasticsearch

If you are running your Elasticsearch cluster using AWS Elasticsearch you get another nice feature straight out of the box. The Curator requests will automatically be signed by the IAM role of your Lambda function (when having set aws_sign_request in the curator.yml) and will therefore be authenticated against your cluster (using the IAM).

Facet filtering with EPiServer Find

2016-06-02T00:00:00+02:00

Sometimes you want to have a facet calculated on just a subset of the result or have multiple facets each being calculated on a different subset of the result. To enable this I’ve created a small extension project to the Episerver Find API, FacetFilter2Find, that enables passing a filter to TermsFacetFor(…) that will filter the result set when calculating the facet.

How to use the FacetFilter2Find extension

To use it you simply pass a filter when requesting the facet:

result = client.Search&lt;Document&gt;()
                        .TermsFacetFor(x => x.Category, x => x.Type.Match("pdf"))
                        .GetResult();

and fetch the resulting facet:

facet = result.TermsFacetFor(x => x.Category);

In order to specify multiple facets on a single field, each having a different filter, one must specify a custom name:

result = client.Search&lt;Document&gt;()
                        .TermsFacetFor(x => x.Category, x => x.Type.Match("pdf"), x => x.Name = "PdfCategories")
                        .TermsFacetFor(x => x.Category, x => x.Type.Match("doc"), x => x.Name = "DocCategories")
                        .GetResult();

and fetch the resulting facets:

pdfTypeFacet = result.Facets["PdfCategories"] as TermsFacet;
docTypeFacet = result.Facets["DocCategories"] as TermsFacet;

I hope you may find this useful when making your site awesome with EPiServer Find!

Integration tests with Protractor

2014-05-21T00:00:00+02:00

After playing around with Protractor for a while I wanted to create some integration tests where I would bootstrap the entire server/database before each test to get a consistent state to start from in each test. The problem is that we often have to wait for everything to setup before running the tests. Using Jasmine as test framework the obvious solution for this would be to use the async callback in beforeEach do something like:

beforeEach((done) ->
    server.start(() ->
        done()
    )
)

and server.start:

start = (done) ->
    server = spawn('coffee', [ 'server.coffee' ])
    server.stdout.on('data', (data) ->
        if data.toString() == 'App started\n'
            done()
    )

Simply put, the server spawns a new process that starts our server using the command: ‘coffee server.coffee’ and then listen to stdout for the log output ‘App started\n’ and then issues the callback done (this is of course purely an example).

The problem is that Protractor overrides the async callback in beforeEach and never waits for our server to actually start. The reason is that Protractor helps you with handling all the async calls of WebdriverJS and creates its own control flow using promises that overrides Jasmines async callback. Instead of using the Jasmine async callback we need to make the server.start return a promise and then add that to the Protractor control flow.

First we need to import q (the Promise module) into our project using:

npm install q

Then we modify our server.start to return a promise instead of using the callback:

Q = require('q')

start = () ->
    deferred = Q.defer()
    server = spawn('coffee', [ 'server.coffee' ])
    server.stdout.on('data', (data) ->
        if data.toString() == 'App started\n'
            deferred.resolve()
    )

    return deferred.promise

Finally we add that to the Protractor control flow in the beforeEach function:

beforeEach(() ->
    protractor.promise.controlFlow().execute(() ->
        return server.start()
    )
)

Voila! Now Protractor waits for our server to actually start before continuing with running the tests. Using this approach we can create integration tests that bootstraps the server before each test and completely isolates tests from each other enabling us to create better test scenarios that don’t depend on what order the tests are runned.

Using Protractor with CoffeeScript

2014-05-12T00:00:00+02:00

I’ve been playing around with AngularJS for the last couple of weeks during my spare time while being on paternity leave (read: late at night). I really like the framework and last night I came across Protractor, the end to end testing framework, and for the first time I think I saw some light in the the UI-testing-tunnel and thought I should share some ideas in a couple of posts. First I’ll start with how to use Protractor with CoffeScript and how to get a nice fluent syntax in your testing scenarios (thanks to cyranix for getting me inspired).

Using Protractor with CoffeeScript

I like using CoffeScript (even though I’m kind of verbose when I use it compared to most as I like keeping the parentheses) and thought it would be great if I could write my Page Objects as CoffeScript classes and chain the functions giving me a fluent syntax like:

startPage.
    .clickLogin()
            .setUserName('john.doe@acme.com')
            .setPassword('donttellanyone')
            .submit()
    browser.waitForAngular()
    
    expect(browser.getLocationAbsUrl()).toMatch('/app/#!/dashboard')

So, how do we do this? First we need to register CoffeScript in the protractor configuration to be able to use it in our scenarios:

protractor-conf.js

exports.config = {
    allScriptsTimeout: 11000,

    specs: [
        'e2e/*.scenarios.coffee'
    ],

    capabilities: {
        'browserName': 'chrome'
    },

    baseUrl: 'http://localhost:10000/app',

    framework: 'jasmine',

    jasmineNodeOpts: {
        defaultTimeoutInterval: 30000
    }
};

Note: If you don’t have CoffeScript installed grab it with:

npm install coffee-script

Next we need to create our Page Objects and we will start with the start page:

start_page.coffee

LoginPage = require('./login_page')

class StartPage
    constructor: ->
        @loginLink = element(By.id('loginLink'))

    get: ->
        browser.get('app/#!/')
        return @

    clickLogin: ->
        @loginLink.click()
        browser.waitForAngular()
        return new LoginPage()

module.exports = StartPage

We grab the login link element in the constructor (I’ve just set the id in the html view so that it is easy to get hold of in the tests and making the test a bit more robust if I decide to move the login link). When someone then clicks login we click the link and wait for angular to complete the request and then we return the Page Object for the login page.

The Page Object for the login page looks similar in structure but have a username and password field that the user can fill out (Here I use the By.model to get hold of the input fields for the username and password. Also note that I use return @ to return this in the functions to make it chainable):

login_page.coffee

LoginPage = require('./login_page')

class LoginPage
    constructor: ->
        @username = element(By.model('user.name'))
        @password = element(By.model('user.password'))
        @loginButton = element(By.id('loginButton'))
        @errorMessage = element(By.id('loginErrorMessage'))

    get: ->
        browser.get('app/#!/auth/login')
        return @

    getErrorMessage: ->
        return @errorMessage.getText()

    setUserName: (text) ->
        @username.sendKeys(text)
        return @

    clearUserName: ->
        @username.clear()
        return @

    setPassword: (text) ->
        @password.sendKeys(text)
        return @

    clearPassword: ->
        @password.clear()
        return @

    submit: ->
        @loginButton.click()

module.exports = LoginPage

Now we can simply require the start page Page Object in our scenarios to get the fluent syntax in our testing scenario:

login.scenarios.coffee

StartPage = require('../PageObjects/start_page')

describe('', () ->
    loginPage = null

    beforeEach(() ->
        startPage = new StartPage()
        startPage.get()
        loginPage = startPage.clickLogin()
    )

    describe('login', () ->
        it('should login and redirect to the dashboard with valid user credentials', () ->
            loginPage
                .setUserName('john.doe@acme.com')
                .setPassword('donttellanyone')
                .submit()
            browser.waitForAngular()
            
            expect(browser.getLocationAbsUrl()).toMatch('/app/#!/dashboard')
        )

...more tests
}

This way we get a nice structure and a pretty neat fluent syntax in our testing scenarios. Next up is how to bootstrap your server/database before each test scenario so that you can make integration test scenarios from a consistent state every time. I’ll get back to that in a future post.

Language detection using Elasticsearch

2013-10-30T00:00:00+01:00

A couple of weeks ago I had the privilege to speak at the Elasticsearch Meetup in Stockholm where I did show a slightly tweaked version of the ‘language categorizer’ I did write about a couple of months ago and would like to share the changes I did (the naive approach wasn’t accurate enough to be presented live in public ;-)).

First, instead of using the default mapping for the ‘text’ field we use the nGram-tokenizer where we specify that we want to index all 2 and 3 letter ngrams of the text. Every language basically have different frequencies for each letter sequence and we want to find the language not by the words in the query but by how common the letter sequences are (as an example ‘th’ is very common in english but rare in swedish). This way we will actually be able to detect the language for a word sequence even if we haven’t actually seen any of the words in the training set. So, we extend our mappings to use the nGram-tokenizer:

{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "ngram_analyzer" : {
                    "tokenizer" : "ngram_tokenizer"
                }
            },
            "tokenizer" : {
                "ngram_tokenizer" : {
                    "type" : "nGram",
                    "min_gram" : "2",
                    "max_gram" : "3",
                    "token_chars": [ "letter", "digit" ]
                }
            }
        }
    },
    "mappings": {
        "data": {
            "_parent": {
                "type": "category"
            },
            "properties": {
                "text": {
                    "type": "string", 
                    "analyzer": "ngram_analyzer"                
                }
            }
        }
    }
}

When querying we then want to calculate the average score for each language using the has_child query but we don’t want the query to filter out any hits but rather return a 0-score for a document that doesn’t match (i.e. we don’t want the score for one language to be based on maybe one tenth of all documents for that language and the score for another to be based on half of the documents for that language). To do this we use the ‘boosting’ query where we simply give a boost to all documents matching the query:

{
  "query":{
    "has_child": {
      "type": "data",
      "score_type" : "avg",
      "query": {
        "boosting" : {
            "positive" : {
                "match": {
                  "text": "skriver en bok"
                }
            },
            "negative" : {
               "match_all": { }
            },
            "negative_boost" : 1
        }
    }
    }
  }
}

Using these simple ‘tweaks’ we have actually created a very accurate language detector. If you want to give it a try you can create training sets from the Leipzig Corpora Collection for the different languages you want to detect.

Indexing only referenced VPP-files with EPiServer Find

2013-05-02T00:00:00+02:00

The EPiServer Find CMS integration does not index any files stored in the VPP by default. A convention is included in the integration that index files visible in the file manager and it is enabled by setting the VisibleInFilemanagerVPPIndexingConvention on the FilieIndexer conventions:

FileIndexer.Instance.Conventions.ShouldIndexVPPConvention 
  = new VisibleInFilemanagerVPPIndexingConvention();

However, this convention can be a little bit aggressive. As soon as an editor adds a file it is searchable and even though no access control mechanism are overruled some might think it is hidden until it is actually used on the site. So how do we proceed to achieve this?

Built in into the CMS there is the ContentSoftLinkRepository where we can query if files (or any IContent for that matter) is linked from within another IContent. By using this we can then create a file indexing convention that checks if the file is linked from some indexed IContent and if so we index it:

FileIndexer.Instance.Conventions.ForInstancesOf&lt;UnifiedFile&gt;().ShouldIndex(x =>
{
    var contentRepository = 
        ServiceLocation.ServiceLocator.Current.GetInstance&lt;IContentRepository&gt;();
    var contentSoftLinkRepository = 
        ServiceLocation.ServiceLocator.Current.GetInstance&lt;ContentSoftLinkRepository&gt;();
    var softLinks = contentSoftLinkRepository.Load(x.VirtualPath);

    try
    {
        foreach (var softLink in softLinks)
        {
            
            if (softLink.SoftLinkType == ReferenceType.ExternalReference ||
                softLink.SoftLinkType == ReferenceType.ImageReference)
            {
                var content = 
                    contentRepository.Get&lt;IContent&gt;(softLink.OwnerContentLink);
                if (!ContentIndexer.Instance.Conventions.ShouldIndexConvention.ShouldIndex(content).Value) // don't index referenced file if content is marked as not indexed

                {
                    continue;
                }

                // only index if content is published

                var publicationStatus = 
                    content.PublishedInLanguage()[softLink.OwnerLanguage.Name];

                if (publicationStatus != null &amp;&amp;
                    (publicationStatus.StartPublish == null ||
                     publicationStatus.StartPublish &lt; DateTime.Now) &amp;&amp;
                    (publicationStatus.StopPublish == null ||
                     DateTime.Now &lt; publicationStatus.StopPublish))
                {
                    return true;
                }
            }
        }
    }
    catch
    {
        // ooops something went wrong. Better not index this one ;-)

    }

    return false;
});

Using this convention only files that are referenced from an indexed IContent, that also is published (as by default also unpublished IContent is indexed to provide better querying in editor mode).

I hope you may find this useful when making your site awesome with EPiServer Find!

Nested filtering with EPiServer Find

2013-03-15T00:00:00+01:00

Some have already noticed one of the x2find mebers, Nested2Find, that enables nested object mappings and filtering to the EPiServer Find API. I would like to give a short description of what it does and how it can help you in some filtering scenarios.

Sometimes we need to be able to filter documents based on matching a specific object in a lists of complex objects on the document. Say for instance that we have documents that have a list of all authors that have contributed. Authors that all have a set of properties such as name and address. We then want to find all documents where one of the authors match a specific set of criterias, say all documents that have a swedish author named Henrik. This is what Nested2Find enables you to do. It lets you define nested lists of complex objects on a document for which you then later can specify a matching criteria when querying.

How to use the Nested2Find extension

Add the nested conventions to the conventions:

client.Conventions.AddNestedConventions();

Create an object containing a NestedList<> of objects (NestedList<> is simply a typed List<>):

public class Document
{
    public Document()
    {
        Authors = new NestedList&lt;Author&gt;();
    }
    public string Title { get; set; }
    public NestedList&lt;Author&gt; Authors { get; set; }
    public string Body { get; set; }
}

public class Author
{
    public string Name { get; set; }
    public string Address { get; set; }
    public string Country { get; set; }
}

Index and start filtering:

result = client.Search&lt;Document&gt;()
             .Filter(x => x.Authors, p => p.Name.Match("Henrik") &amp; p.Country.Match("Sweden"))
             .GetResult();

or:

result = client.Search&lt;Document&gt;()
            .Filter(x => x.Authors.MatchItem(p => p.FirstName.Match("Henrik") &amp; p.Country.Match("Sweden")))
            .GetResult();

I hope you may find this useful when making your site awesome with EPiServer Find!

Categorizing using Elasticsearch

2013-03-08T00:00:00+01:00

I’m fortunate to work at a company that once a month have a ‘hack day’ when we are allowed to just try out new and crazy ideas. As an advocate of using search for so much more than just the ‘search page’ I decided to do a small demo of how to use Elasticsearch to do categorization and I wanted to share some of my ideas.

The approach

We often have a large amount of data that we know falls into different categories and that we can use as a sample space for predicting unseen data. If we index all this data that we have and look at the unseen data as ‘queries’ we should be able to construct queries where the score when querying the known data space represent a similarity between the ‘objects’. However, now that we have a result where each ‘object’ in the result has a similarity score with the ‘queried’ object how should we interpret it? What if we group all items in the results by each category that we know that they fall into and then calculate the avg similarity score within that group and return the category for which the avg score is the highest? Depending on your data and problem space constructing the similarity query might be more or less easy but the beauty of this approach is that is very easy to implement in Elasticsearch using parent-child mappings.

Demo: Implementing a language categorizer

We start by creating an index (I’ve named it ‘myindex’) and add a parent mapping between a category type and a data type:

http://localhost:9200/myindex

{
    "mappings": {
        "data": {
            "_parent": {
                "type": "category"
            }
        }
    }
}

Then we add the known categories:

http://localhost:9200/myindex/category/sv

{
    "name": "Swedish"
}
...

For the ‘known’ data we index it and relate it to the parent category:

http://localhost:9200/myindex/data/1?parent=sv

{
    "text": "det här är en text skriven på svenska"
}
...

Using the has_child-query we can then easily achieve the described approach of searching for categories and have them returned based on the avg score of a child-query issued on the data. (In this case we just do a simple query_string-query with the text we want do do language detection for, ‘en svensk text’):

http://localhost:9200/myindex/category/_search

{
  "query":{
    "has_child": {
      "type": "data",
      "score_type" : "avg",
      "query" : {
        "query_string": {
          "query": "en svensk text"
        }
      }
    }
  }
}

Resulting in:

{
    "took": 27,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.3125,
        "hits": [
            {
                "_index": "myindex",
                "_type": "category",
                "_id": "sv",
                "_score": 0.3125,
                "_source": {
                    "name": "Swedish"
                }
            }
        ]
    }
}

Voila! In just 4 steps you know have your very own language detector.

Discussion

The described language detector might be a bit simplistic but can easily be more advanced by adding more advanced analyzers for your indexed data using stemmers or alike for the appropriate language. However the approach can be tweaked to do much more advanced queries on more complex objects using all the fancy query types available in Elasticsearch (just ignore the ConstantScore-query ;-)). Only your imagination stops you from creating appropriate similarity queries that might give you really good results in just a few easy steps!

So stop looking at your ‘search engine’ as a search page provider and see it as a great tool for not just querying but also a great source for analyzing your data!

Note:

The Elasticsearch0.20.x releases have a bug in the has_child-query calculating the sum and not the avg score when using the score_type=avg. I’ve issued a pull request but if you want to try it out you can clone and build from the master branch where the upgraded Lucene distribution circumvents the problem.

Hierarchical faceting with EPiServer Find

2013-02-26T00:00:00+01:00

A quite common use case for facets is to show a listing of the number of documents in a result for the different categories on the site (quite common is maybe an understatement as this is often the “hello world!” of faceting). A document can occur in maybe one or more categories and this is where your search index really stands out since it doesn’t care if you have one or two categories associated with the document it will return your facet in no time anyway. Sometimes your categories have a hierarchical structure that you want to reflect in your facet. Lets say that you have documents about cars and want to have a category tree based on manufacturer and name, ie. Volvo/XC60. How can you achieve a facet where you get an aggregated count for each level in the category tree? For example:

Volkswagen (7)
    Passat (5)
    Tiguan (2)
Volvo (10)
    V70 (5)
    V60 (2)
    …

The simplest way of doing this is by associating each document with all levels for each of its categories. Then by using a terms facet when fetching your result you will get an aggregated count for each node in your category tree. Voila, there is your hierarchical facet but with just a few lines of code you can get Find to do all that dirty work of your hands and all you have to do is to pass a category string (each level separated by a ‘/’) and return a facet that parse the result and reflects the nested structure of the category tree. I will leave out the implementation details but at HierarchicalFacet2Find you can fetch your own copy of the code that does that work for you.

How to use the HierarchicalFacet2Find extension

Add a Hierarchy property to the document:

public class Document
{
    [Id]
    public string Id { get; set; }

    public Hierarchy Hierarchy { get; set; }
}

Set the hierarchy path:

document.Hierarchy = "A/B/C/D";

Index and request a HierarchicalFacet when searching:

result = client.Search&lt;Document&gt;()
            .HierarchicalFacetFor(x => x.Hierarchy)
            .GetResult();

Fetch it from the result:

facet = result.HierarchicalFacetFor(x => x.Hierarchy)

Loop over the nested hierarchy paths:

foreach(var hierarchyPath in facet)
{
    hierarchyPath.Path;
    hierarchyPath.Count;

    foreach (var subHierarchyPath in hierarchyPath)
    {
        subHierarchyPath.Path;
        subHierarchyPath.Count;
        ...
    }
}

I hope you may find this useful when making your site awesome with EPiServer Find!

Time to live with EPiServer Find

2013-02-15T00:00:00+01:00

The latest release of EPiServer Find contains, apart for a number of bug fixes, one new feature and that is the ability to set a time to live-value on indexed documents. The value is expressed as a TimeSpan and specifies how long the document should reside in the index before it is automatically deleted. This can be really useful if you index documents continuously (say that you index all items that users currently are looking at on your site) but only what to show the latest (i.e. what users are currently looking at on the site) and don’t want to flood your index over time (i.e I don’t care what someone looked at yesterday). Under these circumstances the time to live-feature can really help you by doing that dirty cleanup job that we all hate to to do.

How to use time to live

Given an object with a TimeToLive property (or TimeSpan):

public class WithTimeToLive
{
    [Id]
    public string Id { get; set; }

    public TimeToLive TimeToLive { get; set; }
}

We first need to configure the client and register the TimeToLive-property for the given object type:

client.Conventions.ForInstancesOf&lt;WithTimeToLive&gt;()
    .TimeToLiveIs(x => x.TimeToLive);

When indexing an object we simply register a TimeSpan value to the property specifying how long it should reside in the index:

indexObject = new WithTimeToLive()
    {
        Id = "123",
        TimeToLive = new TimeSpan(0, 5, 0)
    };

I hope you may find this useful when making your site awesome with EPiServer Find!

Note:

The granularity of time to live is 60 seconds meaning that the documents will be deleted within 60 seconds of the actual time to live.

Updated:

Instead of configuring the TimeToLive-property by the conventions it can be annotated by the TimeToLiveAttribute:

public class WithTimeToLive
{
    [Id]
    public string Id { get; set; }
    
    [TimeToLive]
    public TimeToLive TimeToLive { get; set; }
}

or passed in the index call:

client.Index(indexObjext, x => x.TimeToLive = new TimeSpan(0, 5, 0));