Building a scalable real-time search architecture with Sphinx

Intro

People store a lot of documents and other business knowledge on Papyrs and so we wanted to add search functionality so people could get to their documents more quickly. Here we’re going to give the technical explanation of how we got it all to work.

Much to our surprise we couldn’t find any package out there that met our (pretty basic) criteria. We were looking for:

  • really fast search (so results can be displayed as you type)
  • real-time indexing of new or changed documents (otherwise people who try our product for the first time won’t find the document they just created)
  • reliable unicode support (7 bits sure ain’t enough for everybody)
  • support for infix searches (important for reasons mentioned later)
  • an indexer and searcher that can scale relatively easily to multiple processes or servers when/if the need arises
  • stable performance (no segfaults please)
  • a search engine that lets us change the schema of the documents we’re indexing without breaking anything.
  • easy integration with a Python web app (using Django)

We looked at a number of search engines:

Lucence, Solr, Sphinx and PostgreSQL Full Text Search. We played with all of them but only Sphinx came close to meeting our criterea above. We’re pretty confident, looking back, that we made the right decision.

General introduction to Sphinx

Sphinx has two parts, an indexer and a search daemon. The search daemon listens for search queries such as “alpha & (delta | gamma)” and goes through the indexes for matches. The indexer reads data from a data source (relational database, XML pipe) and indexes it according to the document schema. When indexing has finished, it rotates (swaps) the index currently used by the search daemon with the new one. The old index is then deleted. This means (re)indexing and searching can happen in parrallel, and even on different physical machines if needed.

Implementation

We have different sorts of documents: Pages, Comments, Attached files, Profiles, and filled out Forms. These documents are non-uniform: different sorts of documents have different attributes. So we don’t want to hard-code the structure of the index in sphinx.conf. Instead we’ll use sphinx XML pipe functionality and generate the schema structure and data from the Django Model as needed. So for each Django Model we create a sphinx index. Then when a user searches we do a search for every document type and combine the results and display them to the user.

We connect Sphinx to Python with the Python library sphinxapi.py included in the Sphinx package. It’s a pretty straightforward mapping of API functions to Python methods. You can set the match mode, how the matches are sorted, which indexes to search through and so on. There are also a number of open source libraries that connect Django and Sphinx. We looked at Django-Sphinx but it hasn’t been maintained in the past couple of years and it doesn’t support XML based data sources (which we want to use). It instead generates a sphinx.conf file with the indexes and schema structures in there.

Generating XML data

So let’s illustrate how XML generation works using an example Comment model. We add a Sphinx metaclass for each Django Model we want to index.

The classes Attr and Field are simple wrapper classes that we use to generate the Sphinx schema from. They also make sure that when the XML data is generated that the data is of the correct type. Sphinx has a very fragile XML parser, so we have to make sure that boolean columns only contain boolean values, that everything is escaped properly and so on.

Using the SphinxSchema definition above we can easily generate the XML schema:

So with the combination of schema and a Django QuerySet we can now generate the XML data for sphinx to index. Pseudocode:

This works but we have to optimize: we don’t want to reindex everything when a single record changes. So we use two indexes for every document type: a main index and a delta index. The main index is the large index that contains everything that hasn’t been touched recently and the delta contains those documents that have been recently created, modified or deleted. The delta index is small and can be re-indexed frequently. The easiest way accomplish this is to give every model an “updated_at” timestamp, and every time a record is changed you update the timestamp.

Then you just partition the indexes into parts: the main index contains all records where [0 <= updated_at <= last_merge_time]. The delta contains all records where [last_merge_time < updated_at <= last_delta_time]. More partitions can be added if needed, but two indexes per document type will probably be good enough unless you have a huge database or documents change very frequently. Anyway, every time a user changes a document the indexer starts and re-indexes the all files that have been changed since last_merge_time and updates last_delta_time to the current time (technically, the time when it *started* delta-indexing, because that's when the database transaction starts). See the illustration:



After an update the delta partition is completely re-indexed. Then the delta and main indexes are merged into one. During this time a few new documents arrive and the process starts anew.

So how do we start the indexer from django? Easy, we just touch(1) a file whenever a document is saved. Django has a post_save signal which we use to catch all save events. We check if the model that's being saved has a SphinxRecord metaclass and if so, we wake the indexer. It's the simplest solution we could think of :).

Abbreviated version of the daemon that spawns the indexer (we left out error checking, logging, etc):

It's just busy waiting until a process touches the PID file, then starts the sphinx indexer. Note that because we spawn new processes we can easily change the python code for updating/merging without having to restart this daemon. Also note that when multiple people touch the pid file the indexer is still only started once. And this way we also know for sure that the delta index and merge processes will never run at the same time.

Let's do a quick back of the envelope estimate: Delta indexing typically takes between 2 and 10 seconds, and if we merge least once every 500 delta indexes, then that's 1 merge roughly every hour. We currently index only a couple million documents and the indexes are only a few gigabytes large. Merging a delta and a main index is essentially the merge step of the merge sort algorithm. The two indexes are just interleaved, so the merge step takes roughly the time needed to copy the files. Copying a few gigabytes worth of indexes every hour is absolutely fine from a performance point of view so this straightforward main+delta solution is good enough for our purposes. And yep, in practice the indexer is running pretty much all day and night, because people are adding documents to Papyrs all the time.

Ghosting

Ghosting is when you delete a document but it still shows up in the search results for a while after. Suppose the main index contains document ids {1, 2, 3} and delta is {4, 5}. Then you change the title of document 2 and as a result it goes to the delta index. So main: {1, 2, 3}, delta: {2, 4, 5}. When you search for the document's new title it shows up exactly as expected. Because document 2 has the same primary key in the main and delta index Sphinx knows only to return the result from the delta index, so you don't get duplicate results. Perfect. Now you delete document 2 and you're left with: main: {1, 2, 3}, delta: {4, 5}. And when you search for the old document title it suddenly shows up, because the document is still in the main index. That's called ghosting and we want to keep it from happening.

The solution: we give every document type an attribute is_deleted. We then search with a sphinx filter is_deleted=False. Sphinx doesn't let us change fields (variable length text) but sphinx does allow us to update boolean values, integers and timestamps in a search index. So, whenever a document is modified we set is_deleted=True in the main index and in the delta index. This ensures that the old document doesn't show up in the search results at all anymore. Then, a few seconds later the new delta index will be ready that contains the updated document.

Permissions

With Papyrs different people in a group have different permissions. So we have to make sure that we display documents to a user if and only if the user has sufficient permissions to at least view that document. So after Sphinx comes up with a list of documents that match what the user searched for, we simply filter out those documents that the user can't access.

Indexing attachments

We index inside attachments, such as PDFs, Excel spreadsheets, Word documents and so on. This means we have to extract the text content of these different document formats. For this we just use the packages out there: ps2text for PDF files, antiword for MS Word documents. However, many of these text extraction tools mangle the text somewhat. Newlines and punctuation go missing, lines are concatenated without spaces between them, and garbage characters end up in the middle of words. We clean up the output by simply removing all suspicious looking characters and stripping all HTML tags from it.

If all content is really clean then you rarely have to search for only part of a word. But when some of the content is a bit messy then infix search becomes really valuable. Half the spaces in a document may be missing and you're still going to find matches with perfect accuracy.

Tips

  • make sure you bind the search daemon to localhost otherwise everybody can connect to it. If you have a dedicated sphinx server, set up an SSH tunnel (i.e. ssh -f -N remote_server -L[remote_port]:localhost:[local_port]) because sphinx doesn't have any built-in authentication or encryption.
  • if sphinx segfaults for unclear reasons it's probably because of the forking model you configured in sphinx.conf.
  • we tried Sphinx' alpha real-time index support, but it was still very unstable (segfault gallore) and it doesn't support infix searching. It's in active development though, so that might be much better soon!
  • compile Sphinx from source with at least libexpat and iconv support.

Conclusion

We've had this setup in production for almost 3 months now and it all works works great. Searches typically take just a few milliseconds and new results are added to the index within 5 seconds on average. We've spent a lot of time to make sure that search "just works". So we thought we might as well document what decisions we made and why. This is the document I wish existed when I started working on Papyrs search.

Phew, that's it. This turned out a lot longer than I had anticipated, but as I'm proofreading this there isn't much I can leave out. Thanks for reading this far. If you found this interesting, please spread the word or leave a comment!

PS: I could open up the source (it's few hundred lines of Python) and throw it on github. I'd have to spend an afternoon refactoring it though, so let me know if you're interested.

More Google Apps Integration

Two weeks ago we added integration with Google Mail to easily handle workflows. Today we’re launching two new features for our Google Apps users: integration with Google Docs & Google Calendar!

Google Calendar Integration

First, Google Calendar support:



Google Calendar events on a Papyrs Page.

You can create as many Google Calendars as you need. You can create Calendars for upcoming milestones, meetings, travel schedule, and so forth. Then you can simply drag a Media Widget on a Papyrs page and pick the Google Calendar of your choosing. People you share the Papyrs page with will then be able to view the events on the calendar.

The old Google Calendar widget (which doesn’t work so well) is still available. It’s now called “Classic Calendar Widget”.

Tip: Papyrs + Thymer

As many of you know already, you can place your Thymer tasks and deadlines on a Google Calendar using iCal (read more). So the next logical step is to put Thymer deadlines and milestones on a Papyrs page. So now you can. Next to each Thymer calendar event you’ll find a small link directly back to the Thymer task. Pretty handy!

Google Docs Integration

Google has some great applications. An online word processor (Google Docs), an online presentation builder (Google Spreadsheets) and so on. And if you want to effectively work together with your colleagues it helps if you can keep related Google Files on a page. With the new Google Docs integration in Papyrs, files from Google Docs can now be added easily to a Papyrs page, just like any other attachment. Simply add an attachment and use the Google Apps tab, where you can browse or search for your Google documents you want to attach:



Browse Google Docs files.

You click on the files and you finally get a list on your page that looks like this:



Documents, Spreadsheets, and Presentations attached to a Papyrs Page.

This makes it easy to create pages to get an overview, organize, and discuss all relevant documents, whether you stored them on your PC, in Google Docs, or already have them on Papyrs. You can take advantage of these features as a Google Apps user or you can attach Google documents from your personal GMail account. Both works, so use whatever suits you best.

Reorder attachments

You can now change the order of attachments with drag and drop. Only a small change, but apparently something quite a few of you wanted!

Note for Google Apps users

The features outlined above won’t work until you grant Papyrs access to your Google Apps account. To grant Papyrs access navigate to
https://www.google.com/a/cpanel/YOURDOMAIN.COM/Dashboard
and then click on the Papyrs logo on the bottom of the page. Then, under Data access you’ll see an option to grant access to Papyrs.

If you have trouble getting these features to work, just drop us a line at team@stunf.com and we’ll figure out what’s wrong.

Also, Google Apps for Business is free for up to 10 users (Google Apps signup).

Little stuff

  • The Papyrs logo (or the Logo of your organization) in the upper left corner of every page is now clickable. It takes you back to your homepage.
  • Image Galleries now work better on mobile devices.
  • Attachment links are now clickable in the notification emails we send out.

More to come…?

We improved the integration of Papyrs with Google Apps by popular demand. So if you’d like to see even more Google Apps integration, let us know. Or maybe you have other ideas for Papyrs. Feature suggestions or perhaps you found a bug. Just shoot us an email.

We’ve been working hard on a lot of things we can’t reveal just yet. Some big, some small. More soon!

New Papyrs Feature: Custom Widgets

Custom Widgets

We added this feature by popular request. Papyrs supports a lot of widgets out of the box but sometimes you want to add something to a page that suits your personal needs completely. This is where Custom Widgets come in. With Custom Widgets you can add pretty much any 3rd party widget to Papyrs. Social media widgets (for example: Facebook, Flickr, LinkedIn), RSS news, polls, maps, and much more. The list is endless!

For most custom widgets, you can simply copy-paste the embed code from the 3rd party’s website into the Custom Widget in Papyrs. If you happen to know how to use some HTML code yourself, you can also use a Custom Widget to customize a Papyrs page even more, and add elements that look exactly the way you want to. Custom widgets can use any HTML or Javascript code, so custom widgets give you complete flexibility. This flexibility does not compromise the security of Papyrs, as we’ve made sure that custom widgets can be safely contained within a page.

So how do you add a custom widget? It’s easy:



Add a custom widget in 3 easy steps.

Below are some ideas of what you can use Custom Widgets for.

Examples



An RSS widget shows the news from the NYTimes (powered by webrss.com)



A poll widget from widgetbox.com



A basic table



Add a LinkedIn profile or company widget to a page

That’s it

We’ve also made a lot of changes behind the scenes and made Papyrs a bit faster. More news next week!

New Papyrs Feature: Show changes between versions of pages

View changes between versions of pages

Today we added a feature to Papyrs that makes it really easy to see how a page has changed over time. It will show you a visual comparison that highlights exactly what has changed.



See at a glance which text has changed (click for a larger screenshot)

Deleted text is in red, newly added text in green. Changes to image galleries, forms, and other widgets are also highlighted for easy comparison.



Select which two versions you wish to compare

Compare pages with a click of a button. That’s it. Have a nice weekend everybody!

Search for Papyrs & more

Today we’re launching a few new features for Papyrs. Let’s start with the biggest one:

Search

With Papyrs it is now possible to search within all your pages, comments, contacts and profiles, attachments, and submitted forms. Just start typing in the search bar and Papyrs returns the results nearly instantaneously. It has find-as-you-type. Never again waste more than a few seconds to find back that document, that attachment to a certain form, or a discussion with your co-worker. Of course Papyrs will only show items in the search results for which you have the necessary permissions.


It means: load letter-sized paper in the Paper Cassette. :)
Search your entrie Papyrs intranet in milliseconds: Papyrs finds your pages, comments, contacts, files, and submitted forms (click for larger screenshot)

Searches within attachments

Papyrs will also searche within attachments. You can find text contained in PDF files, Word documents and Excel spreadsheets. Do you want us to search within other types of documents as well? Just drop us a line at team@stunf.com and let us know.

Advanced Search

There are also a number of advanced search options:

Command Result
Keyword Finds all pages, files, etc, that contain that keyword.
Keyword AnotherKeyword Finds all pages, files, etc, that contain both keywords.
Keyword -AnotherKeyword Finds all pages, files, etc, that contain Keyword but not AnotherKeyword.
Keyword | Anotherkeyword Finds all pages, files, etc, that contain either of the keywords.
(One & Keyword) | Anotherkeyword Use braces for more complex queries…
“exact match” Use quotes to search for an exact sequence of keywords.

Prettier URLs

In the previous version of Papyrs links looked pretty complex. A typical page would have a link that looked like https://yoursite.papyrs.com/page/5441/This-is-title-of-the-page. Hard to remember and hard to type! Also, if you shared the page with others by email or if you made the page open to the public the link would change. It would start with https://yoursite-public instead). That just wasn’t very convenient, so we’ve changed it. Links to your papyrs pages now look like this: https://yoursite.papyrs.com/Your-Page. Simple and easy to type! If the page is shared or public, people you share the page with can access the page with the same link.

When you go to the document’s settings page you can change the location (URL):



Change the location of the page

Smaller changes

1. You can now share profile pages by email address and make profile pages public.

2. You can now link to conversations on pages and on the Activity Feed. Easy if you want to send a link to a conversation via Instant Message or by email.



Link to conversations on the Activity Feed

That’s it

We’re really thrilled to start the year with these new additions, and we think the Search is going to save everybody a lot of time! More new features coming soon!