GDPR for Cloud Software Providers

Introduction

In 2016 the European Union created the General Data Protection Regulation (GDPR), and a few months from now, on May 25, 2018 to be precise, it will go into full effect. This regulation has sweeping implications for any EU business that deals with personal data and for any business outside of the EU that stores or processes data of EU citizens. Did you suddenly get emails from webapps that are going to delete your account after years of non-use? That’s right, that’s GDPR at work.

As a cloud intranet software provider the GDPR regulation affects everything we do, so we knew we had to start our preparations for compliance way in advance of the May 2018 deadline. In the past year we’ve been very busy and in this document we describe the measures and precautions we’ve had to take, and we reflect on some of the lesson we’ve learned.

We’ve created this document in the hope that it is of value to others that are grappling with the implications of GDPR. We found that it is easy to overlook complete areas that are affected by GDPR. And we got it relatively easy: we are a 100% digital cloud-based company. We don’t have filing cabinets filled with documents, or old computers stuck in dusty closets. The only thing we had to worry about is the software we wrote ourselves. And yet, becoming GDPR compliant was still a ton of work. So if we had to go through all this trouble as a small company with only a couple of products, we cannot imagine how big the impact of GDPR will be for larger companies that have many employees and products that go back decades. Suffice to say, GDPR is no joke.

Unlike most of our blogposts, this one is going to be a little more technical. With that said, let’s dive in!

Data Storage

The first and most critical thing to figure out is what kinds of data we store, and in which systems. We’ve been in business for 10 years which means that we have accumulated a lot of cruft. Old services, like backup systems, that have been replaced by newer services, old database tables, old diagnostics data, and so on. So first we had to make an exhaustive list of all places where data is stored, and then we had to figure out an appropriate GDPR strategy for each one. Following the YAGNI mantra, deletion was our solution of choice.

Our primary product is a drag&drop intranet SaaS webapp called Papyrs. Users can create pages, add file attachments, image galleries, comments, forms, and so on. Because we respect the privacy of our customers we don’t know if any of the data stored on our service by our customers is sensitive data protected by GDPR. Consequently we take the most conservative assumption: everything stored by our customers is critically confidential data, or PII (personally identifiable information) in GDPR parlance. This means that when a customer closes their account all their data has to be permanently incinerated.

That’s easier said than done. Papyrs has several types of data. We’ll briefly cover each type.

File uploads

We don’t use a storage service like Amazon S3 so we don’t have to worry about an orphaned bucket floating around somewhere. All files customers upload to Papyrs end up in the customer’s directory on our own CDN (mirrored using rsync). The metadata on each file (e.g. permissions, statistics, ownership) are stored in the database: every file has a corresponding database record. The database is transactional, but the filesystem is not. When an upload fails the database transaction is rolled back, but a partially uploaded or broken file may remain. In an ideal world the database and the filesystem would guaranteed to be in sync, but we have to make do with the abstractions we have. Partial uploads still get deleted eventually and it cannot lead to errors visible to the end user, but we still have to be mindful of the subtleties of dealing with the filesystem.

Papyrs dynamically resizes images as needed (to improve load times and to save bandwidth for mobile devices), which means that for every image uploaded to Papyrs there can be many cached thumbnails. The story here is similar to that of regular file uploads, except that we have to be extra careful to make sure all the right thumbnails get deleted.

The lesson here is to make sure you keep track of stray files, and that you don’t delete a row from a database until you’re absolutely sure all the corresponding files on the filesystem are gone. Otherwise it may too late to identify which files have to be deleted!


fish-no-gdprFishermen don’t have to worry about GDPR

Log files

All systems that make up the Papyrs service keep log files that contain information about system health, performance, errors and warnings, and other diagnostic information. Although typically these log files contain nothing even remotely sensitive or confidential, some information in log files (like IP addresses) may be considered PII under GDPR. Figuring out exactly which lines in which log files relate to which user on Papyrs is difficult and exactly the kind of privacy-invading work we don’t want to to. So instead we fall back on our pessimistic strategy: we treat all data, except for the data we know for sure is innocuous, as confidential. Our solution: automatically delete (rotate) all log files older than a few weeks. In practice we rarely needed diagnostic information older than that, anyway.

Here the Python RotatingFileHandler and TimedRotatingFileHandler modules are a great solution. Of course we also rely on Debian’s logrotate whenever possible.

But what to do with services that don’t listen to the SIGHUP signal sent by logrotate, and that don’t have any in-built way to rotate their own log files? Well, there is an easy trick:

$ tail -n 1000 logfile.log | sponge logfile.log

This truncates the file and puts the last 1000 lines back into the logfile. You can’t beat the simplicity, but if you want to cull the beginning of your log files this way you have to be mindful of one thing: while most logging systems with multiple writers use a mutex to synchronize writing to the same file, this bash hack doesn’t. So you may end up with a malformed line or two in your log file as two processes write to the same logfile simultaneously. If that’s unacceptable, you can instead create your own lazy log rotation by sticking this in a cron file:

$ cp logfile.log > logfile.log.1 && echo “” > logfile.log

The copy here is deliberate: you can’t move the file. This is because the services that are writing to the logfile will just keep writing to their open file handle, blithely unaware that the file has been moved, renamed, or even deleted. The logrotate option COPYTRUNCATE works like this too.

Cache files and other temporary data

Memcached, message queues, and other systems for transient storage can easily end up containing confidential data. Luckily, we didn’t have to do anything here. Everything we store in memcached expires quickly, so we don’t have to worry about stale data residing in an in-memory store. We made that decision a long time ago for performance reasons. Too many web applications end up relying on memcached for performance, so much so that when memcached resets the database and workers can no longer keep up with all the traffic. We didn’t want to worry about that, so our memcached data can be purged at any time. Everything in our message queue has a deterministic lifetime as well.

This is one of those cases where good architectural design choices end up paying off in unexpected ways! Always a nice surprise.

Full Text Search

We use Sphinx for full text search. Sphinx is terrific. Performance is great and the quality and ranking of the search results is superior. One problem is that we have one Sphinx database for all our user data, and a full text search database like Sphinx isn’t designed for quickly deleting a range of records. Re-creating the entire full text index isn’t an option, as that would take too long.

The only updates sphinx knows is “merge & swap”. A main index and a delta index are merged together into a new index and then the new index is swapped with the old one. The delta index can contain delete indicators. Then during the merge phase those rows with delete indicators get dropped from the main index.

Relational Database

Deleting user data from a database is either trivial or practically impossible, depending on what it means to delete. Files on a filesystem can be securely deleted if needed. Or you can simply create an encrypted directory and throw away the key to get rid of the data securely. But when rows are deleted from a database like postgres or mysql the data isn’t guaranteed to be gone from the hard drive. The records may persist in the binary logs used for database mirroring, or the data may simply persist as junk data inside the database, even after issuing a regular VACUUM.

How far do we want to go here? Do we have to issue a VACUUM FULL (which re-creates the entire database table) to securely delete data? Does that really provide additional guarantees when data may still persist on sectors of hard drives? We don’t know. The GDPR informs us that we have to “erase” data (including copies and replications thereof), but we have no idea what it actually means in the context of software. We think deletion to the point where we cannot recover the data — even if we wanted to — qualifies as real erasure.

I’m sure the fine folks who wrote this regulation thought carefully about what it really means to erase data. They’re just sparing us the technical details so we can experience the joy of figuring it out for ourselves.

Backups

Backups and GDPR are, on a fundamental level, diametrically opposed to each other. With a good backup solution you know that even if disaster strikes no user data is lost and the service can be entirely restored. To comply with GDPR a user has the “right to be forgotten” which means their data cannot be recovered and is lost forever. So how can we satisfy these conflicting goals? Clearly we cannot just destroy our backups, but we can’t keep customer data in perpetuity either.

Our solution: have backups rotate on a strict schedule, and delete old backups, including remote and offsite backups. The only difficulty is that this adds a delay to irrevocable data deletion. If, somehow, data ends up in a backup set that should already have been deleted, then it could take another 6 months after we fix the mistake for the data to be entirely gone from all backups.

There is no perfectly satisfactory solution here. For future products we’ll probably create backups strictly on a per-customer basis. Then we can easily backup/restore an individual customer’s data as needed, and if a customer exercises their GDPR “right to be forgotten” then we can immediately purge all their data from our systems.

Third party services

We don’t share any customer data with 3rd party services. We don’t share data with advertisers. We don’t use any software for sales lead generation. We like to build everything ourselves, and that served us well so far. The only 3rd party service we use is Pingdom, and they don’t have access to any confidential data of ours.

Data Export

GDPR gives customers the “right to data portability”. We already had a Papyrs Backup system in place, so we didn’t have to do anything here.

PCs, Laptops and Phones

Here we just have to take some common sense precautions. There shouldn’t be any sensitive customer data on our work devices, but old devices get securely wiped just in case. We use full disk encryption on all machines (macOS FileVault and TrueCrypt) and iPhones because they are encrypted. If somehow an old device ends up in the trash we can be confident no data leak will happen as a consequence.

Employee records

We don’t just have GDPR obligations toward our customers, but to employees and contractors as well. They, too, have a right to know what data we have on them, and they have a right to be forgotten after their employment has ended. There’s not much to say here, except that having a central place where all data is stored makes deletion a whole lot easier.

Customer Rights

So far we’ve mostly talked about the technical implications of GDPR. GDPR also specifies that our customers have a number of concrete rights: We have an obligation to tell customers what data we have on them and how we use it. We have an obligation to notify our customers when data leaks occur. Customers have a right to tell us not to process their data.

We can imagine these rights lead to a lot of difficulty for many businesses, but not for us. Thankfully we don’t need to collect any personal data from our customers except for a name and address for billing purposes, so we don’t. Our business model is really straightforward and we don’t need to engage in any shenanigans. Customers are free to come and go as they please, and they can take their data with them.


gdpr-landscapeAppreciation of the GDPR landscape

Conclusion

The main thing we learned in the past year while preparing for GDPR is that it’s much easier to design a new service with GDPR in mind than to retrofit GDPR compliance onto systems that were created many years ago. We are really lucky that we already stored 90% of our data in a structured way where every piece of data had a known owner (so we can easily figure out what to delete). However, we had to make sure that all data on our servers was accounted for, not 90%. If you have a pile of mystery data somewhere — even when nobody has access to it — that’s still a GDPR liability. As the saying goes, when you’re 90% done you still have 90% of the work ahead of you.

We hear that companies are only now starting with their GDPR preparations, but time is running out. GDPR compliance isn’t something that can be outsourced or something a small task force can do by itself, because GDPR affects affects customer support, sales, operations, marketing, HR, and every other part of your organization. GDPR affects all data, and data is everywhere.

If your organization has to be GDPR-ready in a few months and you haven’t started yet, I wish you luck :)

Thanks for reading!

Email as a platform

Google just announced Actions in the Inbox.

By adding some markup (Microdata or JSON-LD) to the HTML emails you send out, Gmail can now display quick action buttons next to your email.

Some might see this as simply another random extension, or a Gmail gimmick à la Gmail Labs extensions. We rather see it as part of a trend in which communication channels and devices are changed into platforms supporting an entire “ecosystem of apps”. Mobile phones used to be for calling, emails used to be for sending someone a text message (and I guess sunglasses used to be for just protecting your eyes :). Not anymore. All these things have turned into platforms with many apps running on top. All this added functionality makes it much more interesting for consumers, and more exciting for developers to hack new apps on top of. The companies behind them of course know that they can win the real battles by having the better ecosystem (would you switch to Windows Phone if it doesn’t have your favorite YouTube app, or to Hotmail if it doesn’t have your Movies Info app).

Many people would agree that email as it works today just isn’t good enough anymore. People try to use it for much more than it was originally intended. It’s no longer just a way to send people a message, it’s also a todo list, a CRM application, a help desk, a way to organize events, split bills, and much more. Because of this, the way email works desperately needs to be changed, at least extended.

One approach is to completely reinvent email as we know it, Email 2.0. We think this is a step in the wrong direction. One famous example of a very cool but failed attempt is Google Wave. It’s exciting to work on these projects, and of course developers love to completely rewrite and reinvent things, but let’s not forget email is already very popular and it’s hard to make everybody switch. It’s also not needed, which brings us to the other alternative: there are many ways email can be extended with new (open!) standards. It’s called Actions now, but it might look more like complete “email apps” in the future. We already know people want to use their email more effectively, and in fact using some clever hacks, some “apps” were already built and proven to be popular (Rapportive, for example). Plus, as a platform, it’s already used before as a “human API” to build services on (like sending an email to trigger some action like creating a blog post, an invoice, or – like with a weekend project of ours – create an online form).

One of the reasons email is so popular is because it just works, everywhere. So of course we need to deal with things like mobile views and graceful degradation to make sure that remains to be the case. But a player like Gmail is big enough to take the first step into allowing developers to build more apps around email, using new, open standards. And I believe that with us many other developers would be happy to write them, and make email much more powerful and effective than it is today.

Papyrs Blog

Just a quick note — to better structure our blog posts, we’ve launched a separate blog for Papyrs. If you’re looking for new Papyrs features announcements, tips and related intranet news: you can find it at blog.papyrs.com.

Improving intranet engagement

One of the main hurdles you face when setting up an intranet is user engagement. You have to get people to actively use the systems you put in place. Even if everybody is on the same page and believes that getting organized with an intranet will save time in the long run the motivation may not be there to explore new software and to change old habits.

You set up an intranet and then, after a couple of months everybody has reverted to their old style of working and the company is back at square one. Different versions of files are emailed back and forth like before. Every now and then important documents get lost. There is no clear overview of what’s happening in the organization and the company spent a bunch of money on an intranet solution that didn’t stick. This is a large (and very frequent) problem today.

We’ve been thinking a lot about this problem because we have an intranet product (Papyrs) and it works on a subscription basis. This implies that as long as companies actively use our intranet product they will renew their subscription and we make money. When employee engagement drops off and the intranet stops getting updated with new information the value of the intranet decreases rapidly and the subscription inevitably gets canceled.

This, by the way, is one of the big advantages of subscription software. The software company (that’s us in this case) only makes money for as long as value is provided to the customer. Given that aquiring new customers is much more expensive than keeping existing customers happy (typically by a factor 10) keeping our customers happy is absolutely crucial. This in stark contrast with old enterprise-style intranet packages where the big sale happens up front. After the software company cashes the check it doesn’t matter much whether the customer is still happy 6 months down the road. And so, unsurprisingly those intranet systems are often delivered over time, over budget and to make matters even worse, they match the needs of the people in the organization so poorly that they are then left unused.

So to summarize the problem: An intranet is only valuable when lots of information is stored on it and when this information is kept current. This in turn means that user engagement is vital. Otherwise the company will revert to their ineffective old ways and we’re back to square one.

Engagement

So what keeps people more involved with your intranet? We’re going to look at all these issues from our perspective (as the creators of Papyrs), but the insights apply to intranets generally, and to other forms of social software where user engagement is critical for success.

1. Let people get involved easily

  • User friendly interface
  • Straightforward functionality
  • Users shouldn’t be able to break anything

2. Keep everybody on the same page with email

  • Let people choose the subjects they get emailed about
  • Make it clear who will get email updates and when

3. Encourage everybody to contribute to the intranet

  • Set permissions liberally (your coworkers really aren’t going to vandalize pages)
  • Encourage everybody in your organization to make changes and improvements wherever they see fit

4. Make everything look inviting and appealing

  • Keep all pages organized with a sensible structure.
  • Split up pages that get too large
  • Add links to related pages

5. Fast universal search

  • The more data on your intranet the more important search becomes
  • Find-as-you type search helps a lot
  • It must be fast & reliable. Many intranet solutions have a search box that don’t always return results that you know exist

6. Access from mobile devices

  • So people can access important documents and discussions on the go

We designed Papyrs with the issues above in mind, so our users don’t have to worry at all about most the above. Of course it’s still up to our users to write quality content for the intranet and to keep everything organized but Papyrs does a lot of the heavy lifting for you.

Setting up an intranet for a business isn’t easy. If people refuse to use the intranet software the intranet will fail. If people can’t easily find important information or easily contribute to the internet the intranet will fail. A lot has to go right for an intranet to become a central activity hub for your organization. So when you create an intranet make sure to vet it on the aforementioned points. Or just take the easy road, and sign up for a free Papyrs trial.