GDPR for Cloud Software Providers

Introduction

In 2016 the European Union created the General Data Protection Regulation (GDPR), and a few months from now, on May 25, 2018 to be precise, it will go into full effect. This regulation has sweeping implications for any EU business that deals with personal data and for any business outside of the EU that stores or processes data of EU citizens. Did you suddenly get emails from webapps that are going to delete your account after years of non-use? That’s right, that’s GDPR at work.

As a cloud intranet software provider the GDPR regulation affects everything we do, so we knew we had to start our preparations for compliance way in advance of the May 2018 deadline. In the past year we’ve been very busy and in this document we describe the measures and precautions we’ve had to take, and we reflect on some of the lesson we’ve learned.

We’ve created this document in the hope that it is of value to others that are grappling with the implications of GDPR. We found that it is easy to overlook complete areas that are affected by GDPR. And we got it relatively easy: we are a 100% digital cloud-based company. We don’t have filing cabinets filled with documents, or old computers stuck in dusty closets. The only thing we had to worry about is the software we wrote ourselves. And yet, becoming GDPR compliant was still a ton of work. So if we had to go through all this trouble as a small company with only a couple of products, we cannot imagine how big the impact of GDPR will be for larger companies that have many employees and products that go back decades. Suffice to say, GDPR is no joke.

Unlike most of our blogposts, this one is going to be a little more technical. With that said, let’s dive in!

Data Storage

The first and most critical thing to figure out is what kinds of data we store, and in which systems. We’ve been in business for 10 years which means that we have accumulated a lot of cruft. Old services, like backup systems, that have been replaced by newer services, old database tables, old diagnostics data, and so on. So first we had to make an exhaustive list of all places where data is stored, and then we had to figure out an appropriate GDPR strategy for each one. Following the YAGNI mantra, deletion was our solution of choice.

Our primary product is a drag&drop intranet SaaS webapp called Papyrs. Users can create pages, add file attachments, image galleries, comments, forms, and so on. Because we respect the privacy of our customers we don’t know if any of the data stored on our service by our customers is sensitive data protected by GDPR. Consequently we take the most conservative assumption: everything stored by our customers is critically confidential data, or PII (personally identifiable information) in GDPR parlance. This means that when a customer closes their account all their data has to be permanently incinerated.

That’s easier said than done. Papyrs has several types of data. We’ll briefly cover each type.

File uploads

We don’t use a storage service like Amazon S3 so we don’t have to worry about an orphaned bucket floating around somewhere. All files customers upload to Papyrs end up in the customer’s directory on our own CDN (mirrored using rsync). The metadata on each file (e.g. permissions, statistics, ownership) are stored in the database: every file has a corresponding database record. The database is transactional, but the filesystem is not. When an upload fails the database transaction is rolled back, but a partially uploaded or broken file may remain. In an ideal world the database and the filesystem would guaranteed to be in sync, but we have to make do with the abstractions we have. Partial uploads still get deleted eventually and it cannot lead to errors visible to the end user, but we still have to be mindful of the subtleties of dealing with the filesystem.

Papyrs dynamically resizes images as needed (to improve load times and to save bandwidth for mobile devices), which means that for every image uploaded to Papyrs there can be many cached thumbnails. The story here is similar to that of regular file uploads, except that we have to be extra careful to make sure all the right thumbnails get deleted.

The lesson here is to make sure you keep track of stray files, and that you don’t delete a row from a database until you’re absolutely sure all the corresponding files on the filesystem are gone. Otherwise it may too late to identify which files have to be deleted!

Fishermen don’t have to worry about GDPR

Log files

All systems that make up the Papyrs service keep log files that contain information about system health, performance, errors and warnings, and other diagnostic information. Although typically these log files contain nothing even remotely sensitive or confidential, some information in log files (like IP addresses) may be considered PII under GDPR. Figuring out exactly which lines in which log files relate to which user on Papyrs is difficult and exactly the kind of privacy-invading work we don’t want to to. So instead we fall back on our pessimistic strategy: we treat all data, except for the data we know for sure is innocuous, as confidential. Our solution: automatically delete (rotate) all log files older than a few weeks. In practice we rarely needed diagnostic information older than that, anyway.

Here the Python RotatingFileHandler and TimedRotatingFileHandler modules are a great solution. Of course we also rely on Debian’s logrotate whenever possible.

But what to do with services that don’t listen to the SIGHUP signal sent by logrotate, and that don’t have any in-built way to rotate their own log files? Well, there is an easy trick:

$ tail -n 1000 logfile.log | sponge logfile.log

This truncates the file and puts the last 1000 lines back into the logfile. You can’t beat the simplicity, but if you want to cull the beginning of your log files this way you have to be mindful of one thing: while most logging systems with multiple writers use a mutex to synchronize writing to the same file, this bash hack doesn’t. So you may end up with a malformed line or two in your log file as two processes write to the same logfile simultaneously. If that’s unacceptable, you can instead create your own lazy log rotation by sticking this in a cron file:

$ cp logfile.log > logfile.log.1 && echo “” > logfile.log

The copy here is deliberate: you can’t move the file. This is because the services that are writing to the logfile will just keep writing to their open file handle, blithely unaware that the file has been moved, renamed, or even deleted. The logrotate option COPYTRUNCATE works like this too.

Cache files and other temporary data

Memcached, message queues, and other systems for transient storage can easily end up containing confidential data. Luckily, we didn’t have to do anything here. Everything we store in memcached expires quickly, so we don’t have to worry about stale data residing in an in-memory store. We made that decision a long time ago for performance reasons. Too many web applications end up relying on memcached for performance, so much so that when memcached resets the database and workers can no longer keep up with all the traffic. We didn’t want to worry about that, so our memcached data can be purged at any time. Everything in our message queue has a deterministic lifetime as well.

This is one of those cases where good architectural design choices end up paying off in unexpected ways! Always a nice surprise.

Full Text Search

We use Sphinx for full text search. Sphinx is terrific. Performance is great and the quality and ranking of the search results is superior. One problem is that we have one Sphinx database for all our user data, and a full text search database like Sphinx isn’t designed for quickly deleting a range of records. Re-creating the entire full text index isn’t an option, as that would take too long.

The only updates sphinx knows is “merge & swap”. A main index and a delta index are merged together into a new index and then the new index is swapped with the old one. The delta index can contain delete indicators. Then during the merge phase those rows with delete indicators get dropped from the main index.

Relational Database

Deleting user data from a database is either trivial or practically impossible, depending on what it means to delete. Files on a filesystem can be securely deleted if needed. Or you can simply create an encrypted directory and throw away the key to get rid of the data securely. But when rows are deleted from a database like postgres or mysql the data isn’t guaranteed to be gone from the hard drive. The records may persist in the binary logs used for database mirroring, or the data may simply persist as junk data inside the database, even after issuing a regular VACUUM.

How far do we want to go here? Do we have to issue a VACUUM FULL (which re-creates the entire database table) to securely delete data? Does that really provide additional guarantees when data may still persist on sectors of hard drives? We don’t know. The GDPR informs us that we have to “erase” data (including copies and replications thereof), but we have no idea what it actually means in the context of software. We think deletion to the point where we cannot recover the data — even if we wanted to — qualifies as real erasure.

I’m sure the fine folks who wrote this regulation thought carefully about what it really means to erase data. They’re just sparing us the technical details so we can experience the joy of figuring it out for ourselves.

Backups

Backups and GDPR are, on a fundamental level, diametrically opposed to each other. With a good backup solution you know that even if disaster strikes no user data is lost and the service can be entirely restored. To comply with GDPR a user has the “right to be forgotten” which means their data cannot be recovered and is lost forever. So how can we satisfy these conflicting goals? Clearly we cannot just destroy our backups, but we can’t keep customer data in perpetuity either.

Our solution: have backups rotate on a strict schedule, and delete old backups, including remote and offsite backups. The only difficulty is that this adds a delay to irrevocable data deletion. If, somehow, data ends up in a backup set that should already have been deleted, then it could take another 6 months after we fix the mistake for the data to be entirely gone from all backups.

There is no perfectly satisfactory solution here. For future products we’ll probably create backups strictly on a per-customer basis. Then we can easily backup/restore an individual customer’s data as needed, and if a customer exercises their GDPR “right to be forgotten” then we can immediately purge all their data from our systems.

Third party services

We don’t share any customer data with 3rd party services. We don’t share data with advertisers. We don’t use any software for sales lead generation. We like to build everything ourselves, and that served us well so far. The only 3rd party service we use is Pingdom, and they don’t have access to any confidential data of ours.

Data Export

GDPR gives customers the “right to data portability”. We already had a Papyrs Backup system in place, so we didn’t have to do anything here.

PCs, Laptops and Phones

Here we just have to take some common sense precautions. There shouldn’t be any sensitive customer data on our work devices, but old devices get securely wiped just in case. We use full disk encryption on all machines (macOS FileVault and TrueCrypt) and iPhones because they are encrypted. If somehow an old device ends up in the trash we can be confident no data leak will happen as a consequence.

Employee records

We don’t just have GDPR obligations toward our customers, but to employees and contractors as well. They, too, have a right to know what data we have on them, and they have a right to be forgotten after their employment has ended. There’s not much to say here, except that having a central place where all data is stored makes deletion a whole lot easier.

Customer Rights

So far we’ve mostly talked about the technical implications of GDPR. GDPR also specifies that our customers have a number of concrete rights: We have an obligation to tell customers what data we have on them and how we use it. We have an obligation to notify our customers when data leaks occur. Customers have a right to tell us not to process their data.

We can imagine these rights lead to a lot of difficulty for many businesses, but not for us. Thankfully we don’t need to collect any personal data from our customers except for a name and address for billing purposes, so we don’t. Our business model is really straightforward and we don’t need to engage in any shenanigans. Customers are free to come and go as they please, and they can take their data with them.

Appreciation of the GDPR landscape

Conclusion

The main thing we learned in the past year while preparing for GDPR is that it’s much easier to design a new service with GDPR in mind than to retrofit GDPR compliance onto systems that were created many years ago. We are really lucky that we already stored 90% of our data in a structured way where every piece of data had a known owner (so we can easily figure out what to delete). However, we had to make sure that all data on our servers was accounted for, not 90%. If you have a pile of mystery data somewhere — even when nobody has access to it — that’s still a GDPR liability. As the saying goes, when you’re 90% done you still have 90% of the work ahead of you.

We hear that companies are only now starting with their GDPR preparations, but time is running out. GDPR compliance isn’t something that can be outsourced or something a small task force can do by itself, because GDPR affects affects customer support, sales, operations, marketing, HR, and every other part of your organization. GDPR affects all data, and data is everywhere.

If your organization has to be GDPR-ready in a few months and you haven’t started yet, I wish you luck :)

Thanks for reading!

October 10, 2012September 11, 2014

Improving intranet engagement

One of the main hurdles you face when setting up an intranet is user engagement. You have to get people to actively use the systems you put in place. Even if everybody is on the same page and believes that getting organized with an intranet will save time in the long run the motivation may not be there to explore new software and to change old habits.

You set up an intranet and then, after a couple of months everybody has reverted to their old style of working and the company is back at square one. Different versions of files are emailed back and forth like before. Every now and then important documents get lost. There is no clear overview of what’s happening in the organization and the company spent a bunch of money on an intranet solution that didn’t stick. This is a large (and very frequent) problem today.

We’ve been thinking a lot about this problem because we have an intranet product (Papyrs) and it works on a subscription basis. This implies that as long as companies actively use our intranet product they will renew their subscription and we make money. When employee engagement drops off and the intranet stops getting updated with new information the value of the intranet decreases rapidly and the subscription inevitably gets canceled.

This, by the way, is one of the big advantages of subscription software. The software company (that’s us in this case) only makes money for as long as value is provided to the customer. Given that aquiring new customers is much more expensive than keeping existing customers happy (typically by a factor 10) keeping our customers happy is absolutely crucial. This in stark contrast with old enterprise-style intranet packages where the big sale happens up front. After the software company cashes the check it doesn’t matter much whether the customer is still happy 6 months down the road. And so, unsurprisingly those intranet systems are often delivered over time, over budget and to make matters even worse, they match the needs of the people in the organization so poorly that they are then left unused.

So to summarize the problem: An intranet is only valuable when lots of information is stored on it and when this information is kept current. This in turn means that user engagement is vital. Otherwise the company will revert to their ineffective old ways and we’re back to square one.

Engagement

So what keeps people more involved with your intranet? We’re going to look at all these issues from our perspective (as the creators of Papyrs), but the insights apply to intranets generally, and to other forms of social software where user engagement is critical for success.

1. Let people get involved easily

User friendly interface
Straightforward functionality
Users shouldn’t be able to break anything

2. Keep everybody on the same page with email

Let people choose the subjects they get emailed about
Make it clear who will get email updates and when

3. Encourage everybody to contribute to the intranet

Set permissions liberally (your coworkers really aren’t going to vandalize pages)
Encourage everybody in your organization to make changes and improvements wherever they see fit

4. Make everything look inviting and appealing

Keep all pages organized with a sensible structure.
Split up pages that get too large
Add links to related pages

5. Fast universal search

The more data on your intranet the more important search becomes
Find-as-you type search helps a lot
It must be fast & reliable. Many intranet solutions have a search box that don’t always return results that you know exist

6. Access from mobile devices

So people can access important documents and discussions on the go

We designed Papyrs with the issues above in mind, so our users don’t have to worry at all about most the above. Of course it’s still up to our users to write quality content for the intranet and to keep everything organized but Papyrs does a lot of the heavy lifting for you.

Setting up an intranet for a business isn’t easy. If people refuse to use the intranet software the intranet will fail. If people can’t easily find important information or easily contribute to the internet the intranet will fail. A lot has to go right for an intranet to become a central activity hub for your organization. So when you create an intranet make sure to vet it on the aforementioned points. Or just take the easy road, and sign up for a free Papyrs trial.

June 27, 2012September 11, 2014

Why Backpack is getting phased out

February 2013 Update: We’ve added a way to import your pages from Backpack into Papyrs.

We just read on TheNextWeb that Backpack, a web app from 37signals is no longer accepting new signups. Existing users aren’t left out in the cold, but you can no longer sign up for their service. Backpack is pitched as an “Easy intranet for your business”. Much like our product Papyrs. So we have always considered Papyrs good alternative to Backpack because with both products you can create pages, organize your documents and discuss and plan everything about your organization in one place. Of course there are also plenty of differences. For instance, Backpack had a lot of functionality for calendars and reminders, whereas Papyrs allows far greater flexibility in the sort of pages you can create with anything from Web Forms to Twitter widgets. And as we explain below, we also have a very different vision about intranet software. Now that Backpack is being phased out one of our competitors is, for all intents and purposes, a thing of the past.

So why did 37signals decide to phase out Backpack?

They didn’t say exactly, but we can pretty much guess why. When you first launch a product it has only a few core features. A new product must do a few things, and it must do those few things really well. Then, as you continue working on your product and make improvements to it (based on the feedback you get from your initial customers) the number of features in the product grows slowly but steadily.

Suppose you start with 3 different products that solve 3 distinct problems. The first product is essentially a user-friendly wiki. Companies can dump all their internal documents on there so everybody stays on the same page. The second product is for project management: it has a calendar, tasks, timelines, and some index of active and archived projects. The third product is, let’s say, for collaborative text editing. Multiple people can work on a document simulaneously and see each other’s changes and annotations. So three products: a wiki, a project management tool and a collaborative document editor.

Initially all is great and these three products solve a different issue and everybody’s happy. Happy? Well, no, just *mostly* happy. Because the moment you have customers the feature requests keep rolling in and most feature requests are going to be pretty reasonable. Your users are going to want to have one login for the different products. They’re going to want one unified place where they can see all activity and updates. And if two out of three products have a calendar, why don’t the events of one product show up on the calendar of the other?

And so the apps converge…

Essentially, every time you add functionality the products become more alike. These web 2.0 productivity products naturally converge towards one-another. When you add tasks to the wiki-app, it becomes more like the project managing app. And when you add rich-text notes to the project manager app it becomes more like the wiki-app. The lines between the apps get blurred and if you’re not careful the apps you built that people love slowly grow into monstrosities that attempt to do everything and as a result become complex and clunky and altogether unappealing.

This is what we realized a little over a year ago when we started working on Papyrs. When it’s inevitable that users want a lot of functionality but all users want different functionality you need to make some hard choices. Even though you know you can’t keep everybody happy you still want to keep as many people as happy as you can.

We think there are two good solutions:

1) you figure out which features are most critical and you create a new app that combines the all best bits from the original 3 apps. You just focus on extreme simplicity and ruthlessly cut functionality across the board. This is what 37signals did and they launched Basecamp Next earlier this year. It has some of the functionality of Backpack, but not all. It also has most of the functionality of the original Basecamp, but not all.

2) you create a product that maximizes flexibility and give users the choice which functionality they want to include. This is what we did with Papyrs. Because Papyrs pages are just made of simple widgets that people can just drag&drop onto a page. Our users use the widgets they need and ignore the ones they don’t. The collection of widgets available to them simply grows over time. We have all sorts of customers who use Papyrs in completely different ways. And because of our widget approach we can add functionality to widgets to make our customers happier without making the product more complex for newcomers.

Conclusion

So although we were a little surprised to see Backpack shutting down, it makes a lot of sense. When you have different products and you notice that they keep growing towards one another and get a lot of overlapping functionality you have to make some tough decisions. 37signals decided to create a new product that has most (but not all) of the functionality from their old products Basecamp and Backpack.

We decided with Papyrs to start with a platform that allows us to add functionality where needed without having to make sacrifices in usability. So hopefully we chose wisely and we won’t have to make a Papyrs Next a few years down the road :)

June 7, 2012September 11, 2014

Papyrs Image Improvements

Image widget

We made a number of improvements to the way you deal with images in Papyrs:

1. You can now reorder images with drag & drop, right from your page.

Reorder images with drag & drop

2. You can add links to images.

Click on an image to add or remove a link

3. Images now show up in email updates.

Images show up in emails. Much better this way :)

That’s it for today. More updates coming soon!

April 6, 2012September 11, 2014

Papyrs Interface updates

Quick update, everybody!

Our old popup dialogs, although functional, weren’t exactly shiny. So we decided to give them a much needed face-lift:

Feedback Dialog

Feedback Dialog Before. Yikes!

Feedback Dialog After. Ahh… much better :)

Media Widget Dialog

Media Widget Dialog Before…

… and after

We changed over a dozen dialogs in total. The dialogs now also work much better on mobile devices (such as the iPad) and Papyrs now looks much better in Internet Explorer 9. And as always, more improvements to come.