Stunf – Page 3 – About the web, software development, productivity, startups and our products.

June 4, 2012September 11, 2014

New feature: Poll widget

Today we’re launching a new widget for Papyrs: the poll widget.

Although it’s already possible to create all kinds of survey forms using the form widgets, a poll can be useful if you want feedback on a single question, and immediately show the results. With Papyrs, you can now add two types of poll widgets. After adding a Simple Poll widget, people can vote for one of the available choices. They can also view the results of the poll, with a bar showing the percentage of the vote for each option.

The other poll widget is the Preference table. Using this widget, people can select multiple options, and a table is shown with the names of people who voted and the selected options. This is great for deciding between multiple alternatives/preferences with a group of people, from planning a date for a meeting to deciding between various designs for a new project.

Some examples:

Poll - Preference table (meeting attendance)

To add a poll widget to your page, drag the Media/Widget to your page, and select the Poll tab. Only users with View & Submit permissions or higher can vote in a poll, but everyone with access to a page can see the results.

We hope you find the new widget useful. More updates coming soon!

May 30, 2012September 11, 2014

Redesigning the Feedback Dialog

We consider our support one of the main “features” of our products. We try to respond as fast as possible, and we’re always curious to hear what you like and don’t like. Although the current feedback process seems to work well, we wanted to see if it could be improved even further.

As an experiment, we redesigned the feedback dialog in Papyrs. There are several goals for the redesign. First of all we want to make it easier to access, so we get more feedback if people have questions or comments. To get a better sense of user satisfaction/overall happiness with using the product, we also added smiley icons. Most people already write things like “we like it!”, but as it only takes a second to select your mood, we expect to get an even better picture. Finally, we also added an option to talk to us directly. Though this is an experiment, we think it might be interesting for questions that would take too long to handle over email, and to get to know our customers better. Because our users are in so many different timezones, when you check the phone icon, we’ll contact you about a date and time that’s convenient for you and call you back.

Here’s the updated feedback dialog, we hope you like it!

April 6, 2012September 11, 2014

Papyrs Interface updates

Quick update, everybody!

Our old popup dialogs, although functional, weren’t exactly shiny. So we decided to give them a much needed face-lift:

Feedback Dialog

Feedback Dialog Before. Yikes!

Feedback Dialog After. Ahh… much better :)

Media Widget Dialog

Media Widget Dialog Before…

… and after

We changed over a dozen dialogs in total. The dialogs now also work much better on mobile devices (such as the iPad) and Papyrs now looks much better in Internet Explorer 9. And as always, more improvements to come.

April 5, 2012September 11, 2014

Building a scalable real-time search architecture with Sphinx

Intro

People store a lot of documents and other business knowledge on Papyrs and so we wanted to add search functionality so people could get to their documents more quickly. Here we’re going to give the technical explanation of how we got it all to work.

Much to our surprise we couldn’t find any package out there that met our (pretty basic) criteria. We were looking for:

really fast search (so results can be displayed as you type)
real-time indexing of new or changed documents (otherwise people who try our product for the first time won’t find the document they just created)
reliable unicode support (7 bits sure ain’t enough for everybody)
support for infix searches (important for reasons mentioned later)
an indexer and searcher that can scale relatively easily to multiple processes or servers when/if the need arises
stable performance (no segfaults please)
a search engine that lets us change the schema of the documents we’re indexing without breaking anything.
easy integration with a Python web app (using Django)

We looked at a number of search engines:

Lucence, Solr, Sphinx and PostgreSQL Full Text Search. We played with all of them but only Sphinx came close to meeting our criterea above. We’re pretty confident, looking back, that we made the right decision.

General introduction to Sphinx

Sphinx has two parts, an indexer and a search daemon. The search daemon listens for search queries such as “alpha & (delta | gamma)” and goes through the indexes for matches. The indexer reads data from a data source (relational database, XML pipe) and indexes it according to the document schema. When indexing has finished, it rotates (swaps) the index currently used by the search daemon with the new one. The old index is then deleted. This means (re)indexing and searching can happen in parrallel, and even on different physical machines if needed.

Implementation

We have different sorts of documents: Pages, Comments, Attached files, Profiles, and filled out Forms. These documents are non-uniform: different sorts of documents have different attributes. So we don’t want to hard-code the structure of the index in sphinx.conf. Instead we’ll use sphinx XML pipe functionality and generate the schema structure and data from the Django Model as needed. So for each Django Model we create a sphinx index. Then when a user searches we do a search for every document type and combine the results and display them to the user.

We connect Sphinx to Python with the Python library sphinxapi.py included in the Sphinx package. It’s a pretty straightforward mapping of API functions to Python methods. You can set the match mode, how the matches are sorted, which indexes to search through and so on. There are also a number of open source libraries that connect Django and Sphinx. We looked at Django-Sphinx but it hasn’t been maintained in the past couple of years and it doesn’t support XML based data sources (which we want to use). It instead generates a sphinx.conf file with the indexes and schema structures in there.

Generating XML data

So let’s illustrate how XML generation works using an example Comment model. We add a Sphinx metaclass for each Django Model we want to index.

The classes Attr and Field are simple wrapper classes that we use to generate the Sphinx schema from. They also make sure that when the XML data is generated that the data is of the correct type. Sphinx has a very fragile XML parser, so we have to make sure that boolean columns only contain boolean values, that everything is escaped properly and so on.

Using the SphinxSchema definition above we can easily generate the XML schema:

So with the combination of schema and a Django QuerySet we can now generate the XML data for sphinx to index. Pseudocode:

This works but we have to optimize: we don’t want to reindex everything when a single record changes. So we use two indexes for every document type: a main index and a delta index. The main index is the large index that contains everything that hasn’t been touched recently and the delta contains those documents that have been recently created, modified or deleted. The delta index is small and can be re-indexed frequently. The easiest way accomplish this is to give every model an “updated_at” timestamp, and every time a record is changed you update the timestamp.

Then you just partition the indexes into parts: the main index contains all records where [0 <= updated_at <= last_merge_time]. The delta contains all records where [last_merge_time < updated_at <= last_delta_time]. More partitions can be added if needed, but two indexes per document type will probably be good enough unless you have a huge database or documents change very frequently. Anyway, every time a user changes a document the indexer starts and re-indexes the all files that have been changed since last_merge_time and updates last_delta_time to the current time (technically, the time when it *started* delta-indexing, because that's when the database transaction starts). See the illustration:

After an update the delta partition is completely re-indexed. Then the delta and main indexes are merged into one. During this time a few new documents arrive and the process starts anew.

So how do we start the indexer from django? Easy, we just touch(1) a file whenever a document is saved. Django has a post_save signal which we use to catch all save events. We check if the model that's being saved has a SphinxRecord metaclass and if so, we wake the indexer. It's the simplest solution we could think of :).

Abbreviated version of the daemon that spawns the indexer (we left out error checking, logging, etc):

It's just busy waiting until a process touches the PID file, then starts the sphinx indexer. Note that because we spawn new processes we can easily change the python code for updating/merging without having to restart this daemon. Also note that when multiple people touch the pid file the indexer is still only started once. And this way we also know for sure that the delta index and merge processes will never run at the same time.

Let's do a quick back of the envelope estimate: Delta indexing typically takes between 2 and 10 seconds, and if we merge least once every 500 delta indexes, then that's 1 merge roughly every hour. We currently index only a couple million documents and the indexes are only a few gigabytes large. Merging a delta and a main index is essentially the merge step of the merge sort algorithm. The two indexes are just interleaved, so the merge step takes roughly the time needed to copy the files. Copying a few gigabytes worth of indexes every hour is absolutely fine from a performance point of view so this straightforward main+delta solution is good enough for our purposes. And yep, in practice the indexer is running pretty much all day and night, because people are adding documents to Papyrs all the time.

Ghosting

Ghosting is when you delete a document but it still shows up in the search results for a while after. Suppose the main index contains document ids {1, 2, 3} and delta is {4, 5}. Then you change the title of document 2 and as a result it goes to the delta index. So main: {1, 2, 3}, delta: {2, 4, 5}. When you search for the document's new title it shows up exactly as expected. Because document 2 has the same primary key in the main and delta index Sphinx knows only to return the result from the delta index, so you don't get duplicate results. Perfect. Now you delete document 2 and you're left with: main: {1, 2, 3}, delta: {4, 5}. And when you search for the old document title it suddenly shows up, because the document is still in the main index. That's called ghosting and we want to keep it from happening.

The solution: we give every document type an attribute is_deleted. We then search with a sphinx filter is_deleted=False. Sphinx doesn't let us change fields (variable length text) but sphinx does allow us to update boolean values, integers and timestamps in a search index. So, whenever a document is modified we set is_deleted=True in the main index and in the delta index. This ensures that the old document doesn't show up in the search results at all anymore. Then, a few seconds later the new delta index will be ready that contains the updated document.

Permissions

With Papyrs different people in a group have different permissions. So we have to make sure that we display documents to a user if and only if the user has sufficient permissions to at least view that document. So after Sphinx comes up with a list of documents that match what the user searched for, we simply filter out those documents that the user can't access.

Indexing attachments

We index inside attachments, such as PDFs, Excel spreadsheets, Word documents and so on. This means we have to extract the text content of these different document formats. For this we just use the packages out there: ps2text for PDF files, antiword for MS Word documents. However, many of these text extraction tools mangle the text somewhat. Newlines and punctuation go missing, lines are concatenated without spaces between them, and garbage characters end up in the middle of words. We clean up the output by simply removing all suspicious looking characters and stripping all HTML tags from it.

If all content is really clean then you rarely have to search for only part of a word. But when some of the content is a bit messy then infix search becomes really valuable. Half the spaces in a document may be missing and you're still going to find matches with perfect accuracy.

Tips

make sure you bind the search daemon to localhost otherwise everybody can connect to it. If you have a dedicated sphinx server, set up an SSH tunnel (i.e. ssh -f -N remote_server -L[remote_port]:localhost:[local_port]) because sphinx doesn't have any built-in authentication or encryption.
if sphinx segfaults for unclear reasons it's probably because of the forking model you configured in sphinx.conf.
we tried Sphinx' alpha real-time index support, but it was still very unstable (segfault gallore) and it doesn't support infix searching. It's in active development though, so that might be much better soon!
compile Sphinx from source with at least libexpat and iconv support.

Conclusion

We've had this setup in production for almost 3 months now and it all works works great. Searches typically take just a few milliseconds and new results are added to the index within 5 seconds on average. We've spent a lot of time to make sure that search "just works". So we thought we might as well document what decisions we made and why. This is the document I wish existed when I started working on Papyrs search.

Phew, that's it. This turned out a lot longer than I had anticipated, but as I'm proofreading this there isn't much I can leave out. Thanks for reading this far. If you found this interesting, please spread the word or leave a comment!

PS: I could open up the source (it's few hundred lines of Python) and throw it on github. I'd have to spend an afternoon refactoring it though, so let me know if you're interested.

Posted on March 22, 2012September 11, 2014
More Google Apps Integration

Two weeks ago we added integration with Google Mail to easily handle workflows. Today we’re launching two new features for our Google Apps users: integration with Google Docs & Google Calendar!

Google Calendar Integration

First, Google Calendar support:

Google Calendar events on a Papyrs Page.

You can create as many Google Calendars as you need. You can create Calendars for upcoming milestones, meetings, travel schedule, and so forth. Then you can simply drag a Media Widget on a Papyrs page and pick the Google Calendar of your choosing. People you share the Papyrs page with will then be able to view the events on the calendar.

The old Google Calendar widget (which doesn’t work so well) is still available. It’s now called “Classic Calendar Widget”.

Tip: Papyrs + Thymer

As many of you know already, you can place your Thymer tasks and deadlines on a Google Calendar using iCal (read more). So the next logical step is to put Thymer deadlines and milestones on a Papyrs page. So now you can. Next to each Thymer calendar event you’ll find a small link directly back to the Thymer task. Pretty handy!

Google Docs Integration

Google has some great applications. An online word processor (Google Docs), an online presentation builder (Google Spreadsheets) and so on. And if you want to effectively work together with your colleagues it helps if you can keep related Google Files on a page. With the new Google Docs integration in Papyrs, files from Google Docs can now be added easily to a Papyrs page, just like any other attachment. Simply add an attachment and use the Google Apps tab, where you can browse or search for your Google documents you want to attach:

Browse Google Docs files.

You click on the files and you finally get a list on your page that looks like this:

Documents, Spreadsheets, and Presentations attached to a Papyrs Page.

This makes it easy to create pages to get an overview, organize, and discuss all relevant documents, whether you stored them on your PC, in Google Docs, or already have them on Papyrs. You can take advantage of these features as a Google Apps user or you can attach Google documents from your personal GMail account. Both works, so use whatever suits you best.

Reorder attachments

You can now change the order of attachments with drag and drop. Only a small change, but apparently something quite a few of you wanted!

Note for Google Apps users

The features outlined above won’t work until you grant Papyrs access to your Google Apps account. To grant Papyrs access navigate to
https://www.google.com/a/cpanel/YOURDOMAIN.COM/Dashboard
and then click on the Papyrs logo on the bottom of the page. Then, under Data access you’ll see an option to grant access to Papyrs.

If you have trouble getting these features to work, just drop us a line at team@stunf.com and we’ll figure out what’s wrong.

Also, Google Apps for Business is free for up to 10 users (Google Apps signup).

Little stuff

The Papyrs logo (or the Logo of your organization) in the upper left corner of every page is now clickable. It takes you back to your homepage.
Image Galleries now work better on mobile devices.
Attachment links are now clickable in the notification emails we send out.

More to come…?

We improved the integration of Papyrs with Google Apps by popular demand. So if you’d like to see even more Google Apps integration, let us know. Or maybe you have other ideas for Papyrs. Feature suggestions or perhaps you found a bug. Just shoot us an email.

We’ve been working hard on a lot of things we can’t reveal just yet. Some big, some small. More soon!

Posts navigation

Previous page Page 1 Page 2 Page 3 Page 4 … Page 12 Next page