Author Archives: Diederik

Improving intranet engagement

One of the main hurdles you face when setting up an intranet is user engagement. You have to get people to actively use the systems you put in place. Even if everybody is on the same page and believes that getting organized with an intranet will save time in the long run the motivation may not be there to explore new software and to change old habits.

You set up an intranet and then, after a couple of months everybody has reverted to their old style of working and the company is back at square one. Different versions of files are emailed back and forth like before. Every now and then important documents get lost. There is no clear overview of what’s happening in the organization and the company spent a bunch of money on an intranet solution that didn’t stick. This is a large (and very frequent) problem today.

We’ve been thinking a lot about this problem because we have an intranet product (Papyrs) and it works on a subscription basis. This implies that as long as companies actively use our intranet product they will renew their subscription and we make money. When employee engagement drops off and the intranet stops getting updated with new information the value of the intranet decreases rapidly and the subscription inevitably gets canceled.

This, by the way, is one of the big advantages of subscription software. The software company (that’s us in this case) only makes money for as long as value is provided to the customer. Given that aquiring new customers is much more expensive than keeping existing customers happy (typically by a factor 10) keeping our customers happy is absolutely crucial. This in stark contrast with old enterprise-style intranet packages where the big sale happens up front. After the software company cashes the check it doesn’t matter much whether the customer is still happy 6 months down the road. And so, unsurprisingly those intranet systems are often delivered over time, over budget and to make matters even worse, they match the needs of the people in the organization so poorly that they are then left unused.

So to summarize the problem: An intranet is only valuable when lots of information is stored on it and when this information is kept current. This in turn means that user engagement is vital. Otherwise the company will revert to their ineffective old ways and we’re back to square one.

Engagement

So what keeps people more involved with your intranet? We’re going to look at all these issues from our perspective (as the creators of Papyrs), but the insights apply to intranets generally, and to other forms of social software where user engagement is critical for success.

1. Let people get involved easily

  • User friendly interface
  • Straightforward functionality
  • Users shouldn’t be able to break anything

2. Keep everybody on the same page with email

  • Let people choose the subjects they get emailed about
  • Make it clear who will get email updates and when

3. Encourage everybody to contribute to the intranet

  • Set permissions liberally (your coworkers really aren’t going to vandalize pages)
  • Encourage everybody in your organization to make changes and improvements wherever they see fit

4. Make everything look inviting and appealing

  • Keep all pages organized with a sensible structure.
  • Split up pages that get too large
  • Add links to related pages

5. Fast universal search

  • The more data on your intranet the more important search becomes
  • Find-as-you type search helps a lot
  • It must be fast & reliable. Many intranet solutions have a search box that don’t always return results that you know exist

6. Access from mobile devices

  • So people can access important documents and discussions on the go

We designed Papyrs with the issues above in mind, so our users don’t have to worry at all about most the above. Of course it’s still up to our users to write quality content for the intranet and to keep everything organized but Papyrs does a lot of the heavy lifting for you.

Setting up an intranet for a business isn’t easy. If people refuse to use the intranet software the intranet will fail. If people can’t easily find important information or easily contribute to the internet the intranet will fail. A lot has to go right for an intranet to become a central activity hub for your organization. So when you create an intranet make sure to vet it on the aforementioned points. Or just take the easy road, and sign up for a free Papyrs trial.

Why Backpack is getting phased out

February 2013 Update: We’ve added a way to import your pages from Backpack into Papyrs.

We just read on TheNextWeb that Backpack, a web app from 37signals is no longer accepting new signups. Existing users aren’t left out in the cold, but you can no longer sign up for their service. Backpack is pitched as an “Easy intranet for your business”. Much like our product Papyrs. So we have always considered Papyrs good alternative to Backpack because with both products you can create pages, organize your documents and discuss and plan everything about your organization in one place. Of course there are also plenty of differences. For instance, Backpack had a lot of functionality for calendars and reminders, whereas Papyrs allows far greater flexibility in the sort of pages you can create with anything from Web Forms to Twitter widgets. And as we explain below, we also have a very different vision about intranet software. Now that Backpack is being phased out one of our competitors is, for all intents and purposes, a thing of the past.

So why did 37signals decide to phase out Backpack?

They didn’t say exactly, but we can pretty much guess why. When you first launch a product it has only a few core features. A new product must do a few things, and it must do those few things really well. Then, as you continue working on your product and make improvements to it (based on the feedback you get from your initial customers) the number of features in the product grows slowly but steadily.

Suppose you start with 3 different products that solve 3 distinct problems. The first product is essentially a user-friendly wiki. Companies can dump all their internal documents on there so everybody stays on the same page. The second product is for project management: it has a calendar, tasks, timelines, and some index of active and archived projects. The third product is, let’s say, for collaborative text editing. Multiple people can work on a document simulaneously and see each other’s changes and annotations. So three products: a wiki, a project management tool and a collaborative document editor.

Initially all is great and these three products solve a different issue and everybody’s happy. Happy? Well, no, just *mostly* happy. Because the moment you have customers the feature requests keep rolling in and most feature requests are going to be pretty reasonable. Your users are going to want to have one login for the different products. They’re going to want one unified place where they can see all activity and updates. And if two out of three products have a calendar, why don’t the events of one product show up on the calendar of the other?

And so the apps converge…

Essentially, every time you add functionality the products become more alike. These web 2.0 productivity products naturally converge towards one-another. When you add tasks to the wiki-app, it becomes more like the project managing app. And when you add rich-text notes to the project manager app it becomes more like the wiki-app. The lines between the apps get blurred and if you’re not careful the apps you built that people love slowly grow into monstrosities that attempt to do everything and as a result become complex and clunky and altogether unappealing.

This is what we realized a little over a year ago when we started working on Papyrs. When it’s inevitable that users want a lot of functionality but all users want different functionality you need to make some hard choices. Even though you know you can’t keep everybody happy you still want to keep as many people as happy as you can.

We think there are two good solutions:

1) you figure out which features are most critical and you create a new app that combines the all best bits from the original 3 apps. You just focus on extreme simplicity and ruthlessly cut functionality across the board. This is what 37signals did and they launched Basecamp Next earlier this year. It has some of the functionality of Backpack, but not all. It also has most of the functionality of the original Basecamp, but not all.

2) you create a product that maximizes flexibility and give users the choice which functionality they want to include. This is what we did with Papyrs. Because Papyrs pages are just made of simple widgets that people can just drag&drop onto a page. Our users use the widgets they need and ignore the ones they don’t. The collection of widgets available to them simply grows over time. We have all sorts of customers who use Papyrs in completely different ways. And because of our widget approach we can add functionality to widgets to make our customers happier without making the product more complex for newcomers.

Conclusion

So although we were a little surprised to see Backpack shutting down, it makes a lot of sense. When you have different products and you notice that they keep growing towards one another and get a lot of overlapping functionality you have to make some tough decisions. 37signals decided to create a new product that has most (but not all) of the functionality from their old products Basecamp and Backpack.

We decided with Papyrs to start with a platform that allows us to add functionality where needed without having to make sacrifices in usability. So hopefully we chose wisely and we won’t have to make a Papyrs Next a few years down the road :)

Papyrs Image Improvements

Image widget

We made a number of improvements to the way you deal with images in Papyrs:

1. You can now reorder images with drag & drop, right from your page.



Reorder images with drag & drop

2. You can add links to images.



Click on an image to add or remove a link

3. Images now show up in email updates.



Images show up in emails. Much better this way :)

That’s it for today. More updates coming soon!

Papyrs Interface updates

Quick update, everybody!

Our old popup dialogs, although functional, weren’t exactly shiny. So we decided to give them a much needed face-lift:

Feedback Dialog



Feedback Dialog Before. Yikes!



Feedback Dialog After. Ahh… much better :)

Media Widget Dialog



Media Widget Dialog Before…



… and after

We changed over a dozen dialogs in total. The dialogs now also work much better on mobile devices (such as the iPad) and Papyrs now looks much better in Internet Explorer 9. And as always, more improvements to come.

Building a scalable real-time search architecture with Sphinx

Intro

People store a lot of documents and other business knowledge on Papyrs and so we wanted to add search functionality so people could get to their documents more quickly. Here we’re going to give the technical explanation of how we got it all to work.

Much to our surprise we couldn’t find any package out there that met our (pretty basic) criteria. We were looking for:

  • really fast search (so results can be displayed as you type)
  • real-time indexing of new or changed documents (otherwise people who try our product for the first time won’t find the document they just created)
  • reliable unicode support (7 bits sure ain’t enough for everybody)
  • support for infix searches (important for reasons mentioned later)
  • an indexer and searcher that can scale relatively easily to multiple processes or servers when/if the need arises
  • stable performance (no segfaults please)
  • a search engine that lets us change the schema of the documents we’re indexing without breaking anything.
  • easy integration with a Python web app (using Django)

We looked at a number of search engines:

Lucence, Solr, Sphinx and PostgreSQL Full Text Search. We played with all of them but only Sphinx came close to meeting our criterea above. We’re pretty confident, looking back, that we made the right decision.

General introduction to Sphinx

Sphinx has two parts, an indexer and a search daemon. The search daemon listens for search queries such as “alpha & (delta | gamma)” and goes through the indexes for matches. The indexer reads data from a data source (relational database, XML pipe) and indexes it according to the document schema. When indexing has finished, it rotates (swaps) the index currently used by the search daemon with the new one. The old index is then deleted. This means (re)indexing and searching can happen in parrallel, and even on different physical machines if needed.

Implementation

We have different sorts of documents: Pages, Comments, Attached files, Profiles, and filled out Forms. These documents are non-uniform: different sorts of documents have different attributes. So we don’t want to hard-code the structure of the index in sphinx.conf. Instead we’ll use sphinx XML pipe functionality and generate the schema structure and data from the Django Model as needed. So for each Django Model we create a sphinx index. Then when a user searches we do a search for every document type and combine the results and display them to the user.

We connect Sphinx to Python with the Python library sphinxapi.py included in the Sphinx package. It’s a pretty straightforward mapping of API functions to Python methods. You can set the match mode, how the matches are sorted, which indexes to search through and so on. There are also a number of open source libraries that connect Django and Sphinx. We looked at Django-Sphinx but it hasn’t been maintained in the past couple of years and it doesn’t support XML based data sources (which we want to use). It instead generates a sphinx.conf file with the indexes and schema structures in there.

Generating XML data

So let’s illustrate how XML generation works using an example Comment model. We add a Sphinx metaclass for each Django Model we want to index.

The classes Attr and Field are simple wrapper classes that we use to generate the Sphinx schema from. They also make sure that when the XML data is generated that the data is of the correct type. Sphinx has a very fragile XML parser, so we have to make sure that boolean columns only contain boolean values, that everything is escaped properly and so on.

Using the SphinxSchema definition above we can easily generate the XML schema:

So with the combination of schema and a Django QuerySet we can now generate the XML data for sphinx to index. Pseudocode:

This works but we have to optimize: we don’t want to reindex everything when a single record changes. So we use two indexes for every document type: a main index and a delta index. The main index is the large index that contains everything that hasn’t been touched recently and the delta contains those documents that have been recently created, modified or deleted. The delta index is small and can be re-indexed frequently. The easiest way accomplish this is to give every model an “updated_at” timestamp, and every time a record is changed you update the timestamp.

Then you just partition the indexes into parts: the main index contains all records where [0 <= updated_at <= last_merge_time]. The delta contains all records where [last_merge_time < updated_at <= last_delta_time]. More partitions can be added if needed, but two indexes per document type will probably be good enough unless you have a huge database or documents change very frequently. Anyway, every time a user changes a document the indexer starts and re-indexes the all files that have been changed since last_merge_time and updates last_delta_time to the current time (technically, the time when it *started* delta-indexing, because that's when the database transaction starts). See the illustration:



After an update the delta partition is completely re-indexed. Then the delta and main indexes are merged into one. During this time a few new documents arrive and the process starts anew.

So how do we start the indexer from django? Easy, we just touch(1) a file whenever a document is saved. Django has a post_save signal which we use to catch all save events. We check if the model that's being saved has a SphinxRecord metaclass and if so, we wake the indexer. It's the simplest solution we could think of :).

Abbreviated version of the daemon that spawns the indexer (we left out error checking, logging, etc):

It's just busy waiting until a process touches the PID file, then starts the sphinx indexer. Note that because we spawn new processes we can easily change the python code for updating/merging without having to restart this daemon. Also note that when multiple people touch the pid file the indexer is still only started once. And this way we also know for sure that the delta index and merge processes will never run at the same time.

Let's do a quick back of the envelope estimate: Delta indexing typically takes between 2 and 10 seconds, and if we merge least once every 500 delta indexes, then that's 1 merge roughly every hour. We currently index only a couple million documents and the indexes are only a few gigabytes large. Merging a delta and a main index is essentially the merge step of the merge sort algorithm. The two indexes are just interleaved, so the merge step takes roughly the time needed to copy the files. Copying a few gigabytes worth of indexes every hour is absolutely fine from a performance point of view so this straightforward main+delta solution is good enough for our purposes. And yep, in practice the indexer is running pretty much all day and night, because people are adding documents to Papyrs all the time.

Ghosting

Ghosting is when you delete a document but it still shows up in the search results for a while after. Suppose the main index contains document ids {1, 2, 3} and delta is {4, 5}. Then you change the title of document 2 and as a result it goes to the delta index. So main: {1, 2, 3}, delta: {2, 4, 5}. When you search for the document's new title it shows up exactly as expected. Because document 2 has the same primary key in the main and delta index Sphinx knows only to return the result from the delta index, so you don't get duplicate results. Perfect. Now you delete document 2 and you're left with: main: {1, 2, 3}, delta: {4, 5}. And when you search for the old document title it suddenly shows up, because the document is still in the main index. That's called ghosting and we want to keep it from happening.

The solution: we give every document type an attribute is_deleted. We then search with a sphinx filter is_deleted=False. Sphinx doesn't let us change fields (variable length text) but sphinx does allow us to update boolean values, integers and timestamps in a search index. So, whenever a document is modified we set is_deleted=True in the main index and in the delta index. This ensures that the old document doesn't show up in the search results at all anymore. Then, a few seconds later the new delta index will be ready that contains the updated document.

Permissions

With Papyrs different people in a group have different permissions. So we have to make sure that we display documents to a user if and only if the user has sufficient permissions to at least view that document. So after Sphinx comes up with a list of documents that match what the user searched for, we simply filter out those documents that the user can't access.

Indexing attachments

We index inside attachments, such as PDFs, Excel spreadsheets, Word documents and so on. This means we have to extract the text content of these different document formats. For this we just use the packages out there: ps2text for PDF files, antiword for MS Word documents. However, many of these text extraction tools mangle the text somewhat. Newlines and punctuation go missing, lines are concatenated without spaces between them, and garbage characters end up in the middle of words. We clean up the output by simply removing all suspicious looking characters and stripping all HTML tags from it.

If all content is really clean then you rarely have to search for only part of a word. But when some of the content is a bit messy then infix search becomes really valuable. Half the spaces in a document may be missing and you're still going to find matches with perfect accuracy.

Tips

  • make sure you bind the search daemon to localhost otherwise everybody can connect to it. If you have a dedicated sphinx server, set up an SSH tunnel (i.e. ssh -f -N remote_server -L[remote_port]:localhost:[local_port]) because sphinx doesn't have any built-in authentication or encryption.
  • if sphinx segfaults for unclear reasons it's probably because of the forking model you configured in sphinx.conf.
  • we tried Sphinx' alpha real-time index support, but it was still very unstable (segfault gallore) and it doesn't support infix searching. It's in active development though, so that might be much better soon!
  • compile Sphinx from source with at least libexpat and iconv support.

Conclusion

We've had this setup in production for almost 3 months now and it all works works great. Searches typically take just a few milliseconds and new results are added to the index within 5 seconds on average. We've spent a lot of time to make sure that search "just works". So we thought we might as well document what decisions we made and why. This is the document I wish existed when I started working on Papyrs search.

Phew, that's it. This turned out a lot longer than I had anticipated, but as I'm proofreading this there isn't much I can leave out. Thanks for reading this far. If you found this interesting, please spread the word or leave a comment!

PS: I could open up the source (it's few hundred lines of Python) and throw it on github. I'd have to spend an afternoon refactoring it though, so let me know if you're interested.