AWS Lambda is a serverless compute service that allows for the execution of event triggered units of code called functions. Lambda functions can be written in Java, nodejs, C# and Python and some common use cases are processing items as they are added to Kinesis queues or to S3 buckets. Another nice approach is to cloudify old fashioned cron-jobs using Lambda functions and CloudWatch Events and since the price of running Lambda functions is based on execution time this is much cheaper than having an EC2 virtual machine running 24/7. The execution unit of Lambda functions are called GB-seconds and is calculated as the memory (GB) x execution time (seconds), if your function executes for 0,5 seconds using 0,128 GB of memory this will equal 0,064 GB seconds. Even better is that the first 400,000 GB seconds each month are free. Currently there is a maximum timeout of 300 seconds for a single execution of a function so Lambda functions are not suitable for long running tasks.
Elasticsearch Curator lets you manage your Elasticsearch indices and snapshots and is a handy tool for doing various maintanance tasks. It is a python library that can be used either by directly using the API or by using the CLI and specify tasks using action files. The Curator documentation contains several examples of action files doing varios tasks (Note that you can specify several actions in a file and that they will be executed in order).
Using the Curator Python API in your Lambda Function is pretty straight forward but maybe you want to move existing Curator jobs that are using the the CLI and have actions specified in action files. To do this you can simply execute the Curator CLI from within your Lambda function:
Doing this and packaging the action file together with the Lambda function in the deploy will then allow you to setup functions running different Curator jobs. To simplify it even more I have created a small lambda-curator package on github with a full implementation, including packaging and deployment of the function.
If you are running your Elasticsearch cluster using AWS Elasticsearch you get another nice feature straight out of the box. The Curator requests will automatically be signed by the IAM role of your Lambda function (when having set aws_sign_request in the curator.yml) and will therefore be authenticated against your cluster (using the IAM).
]]>To use it you simply pass a filter when requesting the facet:
and fetch the resulting facet:
In order to specify multiple facets on a single field, each having a different filter, one must specify a custom name:
and fetch the resulting facets:
I hope you may find this useful when making your site awesome with EPiServer Find!
]]>and server.start:
Simply put, the server spawns a new process that starts our server using the command: ‘coffee server.coffee’ and then listen to stdout for the log output ‘App started\n’ and then issues the callback done (this is of course purely an example).
The problem is that Protractor overrides the async callback in beforeEach and never waits for our server to actually start. The reason is that Protractor helps you with handling all the async calls of WebdriverJS and creates its own control flow using promises that overrides Jasmines async callback. Instead of using the Jasmine async callback we need to make the server.start return a promise and then add that to the Protractor control flow.
First we need to import q (the Promise module) into our project using:
npm install q
Then we modify our server.start to return a promise instead of using the callback:
Finally we add that to the Protractor control flow in the beforeEach function:
Voila! Now Protractor waits for our server to actually start before continuing with running the tests. Using this approach we can create integration tests that bootstraps the server before each test and completely isolates tests from each other enabling us to create better test scenarios that don’t depend on what order the tests are runned.
]]>I like using CoffeScript (even though I’m kind of verbose when I use it compared to most as I like keeping the parentheses) and thought it would be great if I could write my Page Objects as CoffeScript classes and chain the functions giving me a fluent syntax like:
So, how do we do this? First we need to register CoffeScript in the protractor configuration to be able to use it in our scenarios:
protractor-conf.js
Note: If you don’t have CoffeScript installed grab it with:
npm install coffee-script
Next we need to create our Page Objects and we will start with the start page:
start_page.coffee
We grab the login link element in the constructor (I’ve just set the id in the html view so that it is easy to get hold of in the tests and making the test a bit more robust if I decide to move the login link). When someone then clicks login we click the link and wait for angular to complete the request and then we return the Page Object for the login page.
The Page Object for the login page looks similar in structure but have a username and password field that the user can fill out (Here I use the By.model to get hold of the input fields for the username and password. Also note that I use return @ to return this in the functions to make it chainable):
login_page.coffee
Now we can simply require the start page Page Object in our scenarios to get the fluent syntax in our testing scenario:
login.scenarios.coffee
This way we get a nice structure and a pretty neat fluent syntax in our testing scenarios. Next up is how to bootstrap your server/database before each test scenario so that you can make integration test scenarios from a consistent state every time. I’ll get back to that in a future post.
]]>First, instead of using the default mapping for the ‘text’ field we use the nGram-tokenizer where we specify that we want to index all 2 and 3 letter ngrams of the text. Every language basically have different frequencies for each letter sequence and we want to find the language not by the words in the query but by how common the letter sequences are (as an example ‘th’ is very common in english but rare in swedish). This way we will actually be able to detect the language for a word sequence even if we haven’t actually seen any of the words in the training set. So, we extend our mappings to use the nGram-tokenizer:
When querying we then want to calculate the average score for each language using the has_child query but we don’t want the query to filter out any hits but rather return a 0-score for a document that doesn’t match (i.e. we don’t want the score for one language to be based on maybe one tenth of all documents for that language and the score for another to be based on half of the documents for that language). To do this we use the ‘boosting’ query where we simply give a boost to all documents matching the query:
Using these simple ‘tweaks’ we have actually created a very accurate language detector. If you want to give it a try you can create training sets from the Leipzig Corpora Collection for the different languages you want to detect.
]]>However, this convention can be a little bit aggressive. As soon as an editor adds a file it is searchable and even though no access control mechanism are overruled some might think it is hidden until it is actually used on the site. So how do we proceed to achieve this?
Built in into the CMS there is the ContentSoftLinkRepository where we can query if files (or any IContent for that matter) is linked from within another IContent. By using this we can then create a file indexing convention that checks if the file is linked from some indexed IContent and if so we index it:
Using this convention only files that are referenced from an indexed IContent, that also is published (as by default also unpublished IContent is indexed to provide better querying in editor mode).
I hope you may find this useful when making your site awesome with EPiServer Find!
]]>Sometimes we need to be able to filter documents based on matching a specific object in a lists of complex objects on the document. Say for instance that we have documents that have a list of all authors that have contributed. Authors that all have a set of properties such as name and address. We then want to find all documents where one of the authors match a specific set of criterias, say all documents that have a swedish author named Henrik. This is what Nested2Find enables you to do. It lets you define nested lists of complex objects on a document for which you then later can specify a matching criteria when querying.
Add the nested conventions to the conventions:
Create an object containing a NestedList<> of objects (NestedList<> is simply a typed List<>):
Index and start filtering:
or:
We often have a large amount of data that we know falls into different categories and that we can use as a sample space for predicting unseen data. If we index all this data that we have and look at the unseen data as ‘queries’ we should be able to construct queries where the score when querying the known data space represent a similarity between the ‘objects’. However, now that we have a result where each ‘object’ in the result has a similarity score with the ‘queried’ object how should we interpret it? What if we group all items in the results by each category that we know that they fall into and then calculate the avg similarity score within that group and return the category for which the avg score is the highest? Depending on your data and problem space constructing the similarity query might be more or less easy but the beauty of this approach is that is very easy to implement in Elasticsearch using parent-child mappings.
We start by creating an index (I’ve named it ‘myindex’) and add a parent mapping between a category type and a data type:
Then we add the known categories:
For the ‘known’ data we index it and relate it to the parent category:
Using the has_child-query we can then easily achieve the described approach of searching for categories and have them returned based on the avg score of a child-query issued on the data. (In this case we just do a simple query_string-query with the text we want do do language detection for, ‘en svensk text’):
Resulting in:
Voila! In just 4 steps you know have your very own language detector.
The described language detector might be a bit simplistic but can easily be more advanced by adding more advanced analyzers for your indexed data using stemmers or alike for the appropriate language. However the approach can be tweaked to do much more advanced queries on more complex objects using all the fancy query types available in Elasticsearch (just ignore the ConstantScore-query ;-)). Only your imagination stops you from creating appropriate similarity queries that might give you really good results in just a few easy steps!
So stop looking at your ‘search engine’ as a search page provider and see it as a great tool for not just querying but also a great source for analyzing your data!
The Elasticsearch0.20.x releases have a bug in the has_child-query calculating the sum and not the avg score when using the score_type=avg. I’ve issued a pull request but if you want to try it out you can clone and build from the master branch where the upgraded Lucene distribution circumvents the problem.
]]>The simplest way of doing this is by associating each document with all levels for each of its categories. Then by using a terms facet when fetching your result you will get an aggregated count for each node in your category tree. Voila, there is your hierarchical facet but with just a few lines of code you can get Find to do all that dirty work of your hands and all you have to do is to pass a category string (each level separated by a ‘/’) and return a facet that parse the result and reflects the nested structure of the category tree. I will leave out the implementation details but at HierarchicalFacet2Find you can fetch your own copy of the code that does that work for you.
Add a Hierarchy property to the document:
Set the hierarchy path:
Index and request a HierarchicalFacet when searching:
Fetch it from the result:
Loop over the nested hierarchy paths:
I hope you may find this useful when making your site awesome with EPiServer Find!
]]>Given an object with a TimeToLive property (or TimeSpan):
We first need to configure the client and register the TimeToLive-property for the given object type:
When indexing an object we simply register a TimeSpan value to the property specifying how long it should reside in the index:
I hope you may find this useful when making your site awesome with EPiServer Find!
The granularity of time to live is 60 seconds meaning that the documents will be deleted within 60 seconds of the actual time to live.
Instead of configuring the TimeToLive-property by the conventions it can be annotated by the TimeToLiveAttribute:
or passed in the index call:
]]>