Skip to content

Ayende @ Rahien
Syndicate content Ayende @ Rahien
Unnatural acts on source code
Updated: 4 hours 28 min ago

Using Lucene – External Indexes

5 hours 38 min ago

Lucene is a document indexing engine, that is its sole goal, and it does so beautifully. The interesting bit about using Lucene is that it probably wouldn’t be your main data store, but it is likely to be an important piece of your architecture.

The major shift in thinking with Lucene is that while indexing is relatively expensive, querying is free (well, not really, but you get my drift). Compare that to a relational database, where it is usually the inserts that are cheap, but queries are usually what cause us issues. RDBMS are also very good in giving us different views on top of our existing data, but the more you want from them, the more they have to do. We hit that query performance limit again. And we haven’t started talking about locking, transactions or concurrency yet.

Lucene doesn’t know how to do things you didn’t set it up to do. But what it does do, it does very fast.

Add to that the fact that in most applications, reads happen far more often than write, and you get a different constraint system. Because queries are expensive on the RDBMS, we try to make few of them, and we try to make a single query do most of the work. That isn’t necessarily the best strategy, but it is a very common one.

With Lucene, it is cheap to query, so it makes a lot more sense to perform several individual queries and process their results together to get the final result that you need. It may require somewhat more work (although there are things like Solr that would do it for you), but it is results in a far faster system performance overall.

In addition to that, since the Lucene index is important, but can always be re-created from source data (it may take some time, though), it doesn’t require all the ceremony associated with DB servers. Instead of buying an expensive server, get a few cheap ones. Lucene scale easily, after all. And since you only use Lucene for indexing, your actual DB querying pattern shift. Instead of making complex queries in the database, you make them in Lucene, and you only hit the DB with queries by primary key, which are the fastest possible way to get the data.

In effect, you outsourced your queries from the expensive machines to the cheap ones, and for a change, you actually got better performance overall

Categories: Blogs

Hibernate Profiler New Feature: Parameters Values

Tue, 03/09/2010 - 13:12

One of the annoying things about the Hibernate port of the profiler was that JDBC didn’t provide us with the parameters values.

Eric has just fixed and that is now live:

hibernate-parameters

Enjoy…

Categories: Blogs

Cut the abstractions by putting test hooks

Tue, 03/09/2010 - 11:00

I have been hearing about testable code for a long time now. It looks somewhat like this, although I had to cut on the number of interfaces along the way.

We go through a lot of contortions to be able to do something fairly simple, avoid hitting a data store in our tests.

This is actually inaccurate, we are putting in a lot of effort into being able to do that without changing production code. There are even a lot of explanations how testable code is decoupled, and easy to change, etc.

In my experience, one common problem is that we put in too much abstraction in our code. Sometimes it actually serve a purpose, but in many cases, it is just something that we do to enable testing. But we still pay the hit in the design complexity anyway.

We can throw all of that away, and keep only what we need to run production code, but that would mean that we would have harder time with the tests. But we can resolve the issue very easily by making my infrastructure aware of testing, such as this example:

image

But now your production code was changed by tests?!

Yes, it was, so? I never really got the problem people had with this, but at this day and age, putting in the hooks to make testing easier just make sense. Yes, you can go with the “let us add abstractions until we can do it”, but it is much cheaper and faster to go with this approach.

Moreover, notice that this is part of the infrastructure code, which I don’t consider as production code (you don’t touch it very often, although of course it has to be production ready), so I don’t have any issue with this.

Nitpicker corner: Let us skip the whole TypeMock discussion, shall we?

TDD fanatic corner: I don’t really care about testable code, I care about tested code. If I have a regression suite, that is just what i need.

Categories: Blogs

Lessons learned from building the NHibernate Profiler

Mon, 03/08/2010 - 11:00

Last week I gave a talk about the things I learned from building NH Prof.

Skills Matter had recorded the session and made it available.

Looking forward for your comments, but I should disclaimer that this was after a full day of teaching and on 50 min of sleep in the last 48 hours

Categories: Blogs

Slaying relational hydras (or dating them)

Mon, 03/08/2010 - 09:46

Sometimes client confidentiality can be really annoying, because the problem sets & suggested solutions that come up are really interesting. That said, since I am interesting in having future clients, it is pretty much a must have. As such, the current post represent a real world customer problem, but probably in a totally different content. In fact, I would be surprised if the customer was able to recognize the problem as his.

That said, the problem is actually quite simple. Consider a dating site, where you can input your details and what you seek, and the site will match you with the appropriate person. I am going to ignore a lot of things here, so if you actually have built a dating site, try not to cringe.

At the most basic level, we have two screens, the My Details screen, where the users can specifies their stats and their preferences:

image

And the results screen, which shows the user the candidate matching their preferences.

There is just one interesting tidbit, the list of qualities is pretty big (hundreds or thousands of potential qualities).

Can you design a relational model that would be a good fit for this? And allow efficient searching?

I gave it some thought, and I can’t think of one, but maybe you can.

I’ll follow up on this post in a day or two, showing how to implement the problem using Raven.

Categories: Blogs

Rhino Divan DB – Performance

Sun, 03/07/2010 - 11:00

The usual caveat applies, I am running this in process, using small data size and small number of documents.

This isn’t so much as real benchmarks, but they are good enough to give me a rough indication about where i am heading, and whatever or not i am going in completely the wrong direction.

Those two should be obvious:

image image

This one is more interesting, RDB doesn’t do immediate indexing, I chose to accept higher CUD throughput and make indexing a background process. That means that the index may be inconsistent for a short while, but it greatly reduce the amount of work required to insert/update a document.

But, what is that short while in which the document and the DB may be inconsistent. The average time seems to be around 25ms in my tests, with some spikes toward 100 ms in some of the starting results. In general, it looks like things are improving the longer the database run. Trying it out over a 5,000 document size give me an average update duration of 27ms, but note that I am testing the absolute worst usage pattern, lot of small documents inserted one at a time with index requests coming in as well.

image

Be that as it may, having inconsistency period measured in a few milliseconds seems acceptable to me. Especially since RDB is nice enough to actually tell me if there are any inconsistencies in the results, so I can chose whatever to accept them or retry the request.

Categories: Blogs

Getting code ready for production

Sat, 03/06/2010 - 11:00

I am currently doing the production-ready pass through the Rhino DivanDB code base, and I thought that this change was interesting enough to post about:

public void Execute()
{
    while(context.DoWork)
    {
        bool foundWork = false;
        transactionalStorage.Batch(actions =>
        {
           var task = actions.GetFirstTask();
           if(task == null)
           {
               actions.Commit(); 
               return;
           }
           foundWork = true;

           task.Execute(context);

           actions.CompleteCurrentTask();

           actions.Commit();
        });
        if(foundWork == false)
            context.WaitForWork();
    }
}

This is “just get things working” phase. When getting a piece of code ready for production, I am looking for several things:

  • If this is running in production, and I get the log file, will I be able to understand what is going on?
  • Should this code handle any exceptions?
  • What happens if I send values from a previous version? From a future version?
  • Am I doing unbounded operations?
  • For error handling, can I avoid any memory allocations?

The result for this piece of code was:

public void Execute()
{
    while(context.DoWork)
    {
        bool foundWork = false;
        transactionalStorage.Batch(actions =>
        {
            var taskAsJson = actions.GetFirstTask();
            if (taskAsJson == null)
            {
                actions.Commit();
                return;
            }
            log.DebugFormat("Executing {0}", taskAsJson);
            foundWork = true;

            Task task;
            try
            {
                task = Task.ToTask(taskAsJson);
                try
                {
                    task.Execute(context);
                }
                catch (Exception e)
                {
                    if (log.IsWarnEnabled)
                    {
                        log.Warn(string.Format("Task {0} has failed and was deleted without completing any work", taskAsJson), e);
                    }
                }
            }
            catch (Exception e)
            {
                log.Error("Could not create instance of a task from " + taskAsJson, e);
            }

            actions.CompleteCurrentTask();
            actions.Commit();
        });
        if(foundWork == false)
            context.WaitForWork();
    }
}

The code size blows up really quickly.

Categories: Blogs

Sometimes I have code blinders on

Fri, 03/05/2010 - 11:00

This is a piece of code that I am using in RDB, at some point, it threw a null reference exception:

image

I am ashamed to admit that I started doing some really deep debugging to understand the bug (this happen only under very strange circumstances).

When I figured out what it was, I was deeply ashamed, this is easy.

Categories: Blogs

Actual scenario testing with Raven

Fri, 03/05/2010 - 01:09

Yesterday I posted about doing scenario testing with Raven, and I showed the concept of what i am doing. This time, I wanted to show what I am actually talking about, and how this is implemented. Here are the current scenarios for Raven.

image

Each scenario is looks something like this (showing PutAndGetDocument here):

image

And the second request:

image

The scenarios are being picked up using:

public class AllScenariosWithoutExplicitScenario
{
    [Theory]
    [PropertyData("ScenariosWithoutExplicitScenario")]
    public void Execute(string file)

        new Scenario(Path.Combine(ScenariosPath, file+".saz")).Execute();
    }

    public static string ScenariosPath
    {
        get
        {
            return Directory.Exists(@"..\..\bin") // running in VS
                       ? @"..\..\Scenarios" : @"..\Raven.Scenarios\Scenarios";
        }
    }

    public static IEnumerable<object[]> ScenariosWithoutExplicitScenario
    {
        get
        {
            foreach (var file in Directory.GetFiles(ScenariosPath,"*.saz"))
            {
                if (typeof(Scenario).Assembly.GetType("Raven.Scenarios." +
                          Path.GetFileNameWithoutExtension(file) +"Scenario") != null)
                    continue;
                yield return new object[] {Path.GetFileNameWithoutExtension(file)};
            };
        }
    }
}

There are two reasons why I am ignoring explicit scenarios. Adding a class for a specific scenario allows me to run the scenario in the debugger, and also allow me to selectively skip certain scenarios if I need to.

Scenario.Execute is fairly involved, it parse the Fiddler’s saz file, build appropriate request and compare to the expect response, it is also smart enough to handle changing things like ETags and pass them along.

The end result is that I can very easily add new scenarios as I get new features to that requires tests.

Categories: Blogs

Is select (System.Uri) broken?

Thu, 03/04/2010 - 20:05

I can’t really figure out what is going on!

Take a look:

image

The value :

http://localhost:58080/indexes/categoriesByName?query=CategoryName%3ABeverages&start=0&pageSize=25

And the problem is that I can’t figure out why calling this once would fail, but calling it the second time would fail. That is leaving aside the fact this looks like a pretty good url to me.

Any ideas? This is perfectly reproducible on one project, but I can’t reproduce this on another project.

Updates:

  • This is System.Uri
  • The issue that it fails the first time, and works the second!
  • The exception is:
  • System.ArgumentNullException: Value cannot be null.
    Parameter name: str
       at System.Security.Permissions.FileIOPermission.HasIllegalCharacters(String[] str)
       at System.Security.Permissions.FileIOPermission.AddPathList(FileIOPermissionAccess access, AccessControlActions control, String[] pathListOrig, Boolean checkForDuplicates, Boolean needFullPath, Boolean copyPathList)
       at System.Security.Permissions.FileIOPermission..ctor(FileIOPermissionAccess access, String path)
       at System.Uri.ParseConfigFile(String file, IdnScopeFromConfig& idnStateConfig, IriParsingFromConfig& iriParsingConfig)
       at System.Uri.GetConfig(UriIdnScope& idnScope, Boolean& iriParsing)
       at System.Uri.InitializeUriConfig()
       at System.Uri.InitializeUri(ParsingError err, UriKind uriKind, UriFormatException& e)
       at System.Uri.CreateThis(String uri, Boolean dontEscape, UriKind uriKind)
       at System.Uri..ctor(String uriString)
       at Raven.Scenarios.Scenario.GetUri_WorkaroundForStrangeBug(String uriString) in C:\Work\ravendb\Raven.Scenarios\Scenario.cs:line 155

  • This is a console application.

Okay, I can reproduce this now, here it how it got there:

public class Strange : MarshalByRefObject
{
    public void WTF()
    {
        Console.WriteLine(AppDomain.CurrentDomain.SetupInformation.ConfigurationFile);
        new Uri("http://localhost:58080/indexes/categoriesByName?query=CategoryName%3ABeverages&start=0&pageSize=25");
    }
}

public class Program
{
    private static void Main()
    {
        var instanceAndUnwrap = (Strange) AppDomain.CreateDomain("test", null, new AppDomainSetup
        {
            ConfigurationFile = ""
        }).CreateInstanceAndUnwrap("ConsoleApplication5", "ConsoleApplication5.Strange");
        instanceAndUnwrap.WTF();
    }
}

That took some time to figure out.

The reason that I got this issue is that I am running this code as part of a unit test, and the xUnit seems to be running my system under the following conditions.

Categories: Blogs

Scenario based testing in Rhino DivanDB

Thu, 03/04/2010 - 11:00

Here is a unit test testing Rhino DivanDB:

image

Here is a test that tests the same thing, using scenario based approach:

image

What are those strange files? Well, let us take a pick at the first one:

0_PutDoc.request 0_PutDoc.response

PUT /docs HTTP/1.1
Content-Length: 283

{
    "_id": "ayende",
    "email": "ayende@ayende.com",
    "projects": [
        "rhino mocks",
        "nhibernate",
        "rhino service bus",
        "rhino divan db",
        "rhino persistent hash table",
        "rhino distributed hash table",
        "rhino etl",
        "rhino security",
        "rampaging rhinos"
    ]
}

HTTP/1.1 201 Created
Connection: close
Content-Length: 15
Content-Type: application/json; charset=utf-8
Date: Sat, 27 Feb 2010 08:12:08 GMT
Server: Kayak

{"id":"ayende"}

Those are just test files, corresponding to the request and the expected response.

RBD’s turn those into tests, by issuing each request in turn and asserting on the actual output. This is slightly more complicated than it seems, because some requests contains things like dates, or generated guids. The scenario runner is aware of those and resolve those automatically. Another issue is dealing with potentially stale requests, especially because we are issuing requests on the same data immediately Again, this is something that the scenario runner handles internally, and we don’t have to worry about it.

There are some things here that may not be immediately apparent. We are doing pure state base testing, in fact, this is black box testing. The scenarios define the external API of the system, which is a nice addition.

We don’t care about the actual implementation, look at the unit test, we need to setup a db instance, start the background threads, etc. If I modify the DocumentDatabase constructor, or the initialization process, I need to touch each test that uses it. I can try to encapsulate that, but in many cases, you really can’t do that upfront. SpinBackgroundWorkers, for example, is something that is required in only some of the unit tests, and it is a late addition. So I would have to go and add it to each of the tests that require it.

Because the scenarios don’t have any intrinsic knowledge about the server, any require change is something that you would have to do in a single location, nothing more.

Users can send me a failure scenarios. I am using this extensively with NH Prof (InitializeOfflineLogging), and it is amazing. When a user runs into a problem, I can tell them, please send me a Fiddler trace of the issue, and I can turn that into a repeatable test in a matter of moments.

I actually thought about using Fiddler’s saz files as the format for my scenarios, but I would have to actually understand them first. :-) It doesn’t look hard, but flat files seemed easier still.

Actually, I went ahead and made the modification, because now i have even less friction, just record a Fiddler session, drop it in a folder, and I have a test. Turned out that the Fiddler format is very easy to work with.

Categories: Blogs

Challenge: Robust enumeration over external code

Thu, 03/04/2010 - 02:54

Here is an interesting little problem:

public class Program
{
    private static void Main()
    {
        foreach (int i in RobustEnumerating(Enumerable.Range(0, 10), FaultyFunc))
        {
            Console.WriteLine(i);
        }
    }

    public static IEnumerable<T> RobustEnumerating<T>(
        IEnumerable<T> input,Func<IEnumerable<T>, IEnumerable<T>> func)
    {
        // how to do this?
        return func(input);

    }

    public static IEnumerable<int> FaultyFunc(IEnumerable<int> source)
    {
        foreach (int i in source)
        {
            yield return i/(i%2);
        }
    }
}

This code should not throw, but print:

1
3
5
7
9

Can you make this happen? You can only change the RobustEnumerating method, nothing else in the code

Categories: Blogs

Git is teh SUCK

Wed, 03/03/2010 - 06:55

Today, I had two separate incidents in which my git repository was corrupted! To the point that nothing, git fsck or git reflog or git just-work-or-i-WILL-shoot-you didn’t work.

The first time, there was no harm done, I just cloned my repository again, and moved on. The second time that it happened, it was after I had ~10 commits locally that weren’t pushed. I had my working copy intact, but I didn’t want to lose the history. I asked around, and got a couple of suggestion to move to mercurial instead, because git has no engineering behind it.

Based on that feedback, I …

Oh, wait, it isn’t this sort of a post.

What I actually did was setup Process Monitor and watched what git.exe was actually doing. I noticed that it was searching for a .git/objects directory, and couldn’t find it anywhere in the path. Indeed, looking there myself, it appeared clear that there was no objects directory under the .git dir. And checking in other repositories showed that they had it. So now I knew why, but I still had no idea who the #*@# decided to randomly @#$%( my repository, totally derailing my productivity.

That is where having multiple personalities come in handy, he did it. The one that isn’t writing this blog post, at some point during the day, there was a need to zip the repository and send it somewhere. Since the working copy is full of crap, that idiot issued the following:

ls -R obj | rm –F

ls -R bin | rm –F

(Not the exact commands, the idiot used the UI to do a search & delete).

You can guess the following from there. At this point, having come to this astounding discovery, I heroically went to the recycle bin, found the objects directory there, and rescued it! All is well, except that there is still a thrashing for uncommon stupidity owed.

And remember, it wasn’t me, it was the other one who did that!

And yes, the spelling mistake in the title is intentional.

Categories: Blogs

DotNetRocks on Domain Specific Languages

Tue, 03/02/2010 - 19:59

Last week I recorded a DNR session about Domain Specific Languages.

I also talked just a bit about writing your own DSL, some challenges that I run into since the book, the Boo language and why it is suitable for DSLs and how the entire process works.

Looking forward for your comments.

Categories: Blogs

Where do git repositories go when they die?

Tue, 03/02/2010 - 14:33

My RDB repository started giving me this error;

fatal: Not a git repository (or any of the parent directories): .git

I don’t think that I did anything to it, but it is still dead.

image

Any ideas how to recover this?

Update: Found why Git doesn't like my repository, it doesn't have .git\objects, but I have no idea where it could have gone to… or why.

Categories: Blogs

Rhino DivanDB &ndash; A full coding sample &ndash; Embedded

Tue, 03/02/2010 - 11:00

Rhino Divan DB is going to come in at least two forms, embedded, and remote. The following is a full example of starting DivanDB, defining a view, adding some documents and then querying the database.

Note that here we want to ensure that we get the most up to date result, so we refuse to accept a potentially stale query.

image

This outputs the right result, by the way :-)

Categories: Blogs

Rhino Divan DB &ndash; Design

Mon, 03/01/2010 - 11:00

One of the things that I wanted to do with RDB is to create an explicit actor model inside the codebase. I have been using a similar structure inside NH Prof, and it has been quite successful. The design goals for RDB is:

Assumptions for the database cosntruction

Get / Put / Delete semantics for Json documents.

All those operations can access batches of documents to work on. Those operations fully implement ACID. Which means that if you got a successful response for a document Put, you can rely on the document always being there.

Those operations should be considered cheap.

Reboot / crash resistant

The DB can crash / restart, but no lose of functionality may occur, but as soon as it restarts, everything goes on as usual. There can be no in memory data structures / work that cannot be recovered from persistent structure.

Views for searching

The DB use views, defined using linq expressions, for supporting search capabilities. Those views are background indexed (so no holding up request processing for views). When you get a result from a queue you always know if the result is stale or not.

Adding a view to an existing database is a cheap operation, regardless of the database size. During view construction, the view can be queried (but its results will be considered stale). Reboot during view construction will not impact the construction process.

Indexing a document twice is a stable operation, which means that a view can always choose to re-index things if it so choose.

Overall design

image

RDB stores two major pieces of information in transactional storage.

Documents, obviously, which are stored in a format that allows to send the document content to the user quickly, and tasks.

Tasks are how RDB maintains state over crashes / reboots, and they also form the base of async work of the database. Any work that is going to take some time for the database to perform is written to transactional storage as a task. Those tasks are things like: “View ‘peopleByName’ should index documents 1 – 42'”.

There are background threads working of off this tasks queue, performing the work and removing the task when they are completed.

The results of each view is written to a Lucene index (one per view).

So far i have the entire structure done, I need to some polishing, and I have a different OSS strategy to go with, but thinks are looking good.

Categories: Blogs

Black box reverse engineering speculation

Sun, 02/28/2010 - 12:10

Terrance has pointed me to some really interesting feature in Solr, called Facets. After reading the documentation, I am going to try and guess how this is implemented, based on my understanding of how Lucene works.

But first, let me explain what Facets are, Facets are a way to break down a search result in a way that would give the user more meaningful results. I haven’t looked at the code, and haven’t read any further than that single link, but i think that I can safely extrapolate from that. I mean, the worst case that could happen is that I would look stupid.

Anyway, Lucene, the underpinning of Solr, is a document indexing engine. It has no ability to do any sort of aggregation, and I am pretty sure that Solr didn’t sneak in something relational when no one was looking. So how can it do these sort of things?

Well, let us look at a simple example: ?q=camera&facet=true&facet.field=manu, which will give us the following results:

<!-- search results snipped -->
<lst name="facet_fields">
  <lst name="manu">
    <int name="Canon USA">17</int>
    <int name="Olympus">12</int>
    <int name="Sony">12</int>
    <int name="Panasonic">9</int>
    <int name="Nikon">4</int>
  </lst>
</lst>

Remember what we said about Lucene being an indexing engine? You can query the index itself very efficiently, and these sort of results are something that Lucene can provide you instantly.

More over, when we start talking about facets prices, which looks something like this;

?q=camera&facet=true&facet.query=price:[* TO 100]
    &facet.query=price:[100 TO 200];&facet.query=[price:200 TO 300]
    &facet.query=price:[300 TO 400];&facet.query=[price:400 TO 500]
    &facet.query=price:[500 TO *]

It gets even nicer. If I would have that problem (which I actually do, but that is a story for another day), I would resolve this using individual multiple Lucnene searches. Something like:

  • type:camera –> get docs
  • type:camera price:[* TO 100] –> but just get count
  • type:camera price:[100 TO 200] –> but just get count

In essence, Solr functions as a query batching mechanism to Lucene, and then message the data to a form that is easy to consume by the front end. That is quite impressive.

By doing this aggregation, Solr can provide some really impressive capabilities, on top of a really simple concept. I am certainly going to attempt something similar for Raven.

Of course, I may have headed in the completely wrong direction, in which case I am busy wiping egg of my face.

Categories: Blogs

User Interface Delegation &ndash; Rhino Divan DB

Sat, 02/27/2010 - 21:47

No, this isn’t a post about how to do UI delegation in software. This is me, looking at HTML code and realizing that I have a deep and troubling disadvantage, while I know how to build UI, I am just not very good at it.

For a certain value of done, Rhino Divan DB is done. Architecture is stable, infrastructure is in place, everything seems to be humming along nicely. There is still some work to be done (error handling, production worthiness, etc) but those are relatively minor things.

The most major item on the table, however, is providing a basic UI on top of the DB. I already have a working prototype, but I came to realize that I am tackling something that I am just not good at.

This is where you, hopefully, comes into play. I am interested in someone with good graph of HTML/JS to build a very simple CRUD interface on top of a JSON API. That is simple enough, but the small caveat is that my dead lines are short, I would like to see something working tomorrow.

Anyone is up for the challenge?

Categories: Blogs

NHibernate donation campaign

Sat, 02/27/2010 - 12:21

NHibernate is the most popular Open Source Object Relational Mapper in the .NET framework. As an Open Source project, all the work done on it is done for free.  We would like to be able to dedicate more time to NHibernate, but even as a labor of love, the amount of time that we can spend on a free project is limited.

In order to facilitate that, we opened a donation campaign that will allow you to donate money to the project.

 NHibernate and make a donation at www.pledgie.com !

What is this money going to be used for?

This money will go directly to NHibernate development, primarily to sponsor the time required development of NHibernate itself.

Donation Matching

Moreover, my company, Hibernating Rhinos, is going to match any donation to this campaign (to a total limit of 5,000$), as a way to give back to the NHibernate project for the excellent software it produced.

In addition to that, my company will also sponsor any resources need for the project, such as production servers (the NHibernate’s Jira is already running on our servers), build machines, etc.

Why should you donate?

If you are a user of NHibernate, you gained a lot from build on such a solid foundation. We ask to you to donate so that we can make the project even better. If your company uses NHibernate, ask it to donate to this campaign.

Thanks,

~Ayende

Categories: Blogs