Privacy-Preserving Key-Value Stores: Randomization in Practice
Modern data analytics systems are composed of two types of nodes: compute and storage (e.g., Amazon S3, Redis, MongoDB, etc.). The storage nodes typically offer a key-value interface and are often used to store data encoded in a columnar format (e.g., Parquet files). Due to growing data sizes in datacenters, there is an increasing interest in using specialized hardware devices, namely Field Programmable Gate Arrays (FPGAs), to implement the storage nodes in an energy-efficient manner. Example projects include Caribou  and BlueDBM . These works bring new opportunities in near-data processing, that is, in executing part of the application logic inside the storage node, close to the data where bandwidths are higher than on the network, reducing this way data movement bottlenecks.
In parallel to the effort of building energy efficient storage nodes, privacy is starting to play an increasingly important role in the datacenter. With stricter data protection regulations, such as GDPR, companies have to invest more in technology that allows them to enforce privacy rules in an automated fashion. One way to achieve this goal is through transparent “data perturbations” that mask the true values of data entries in use-cases where the exact ones are not strictly required, for instance, to train an internal machine learning model on customer data.
In this project the goal is to add a module to an existing FPGA-based key-value store  that, upon receiving read requests from the client, can replace on-the-fly values of a column in a Parquet file with randomized data that has the same distribution as the original column. The value distribution of the columns is precomputed on the client library side and stored as a histogram together with the Parquet file. After implementing this module, there are two possible directions to the work: 1) to gather statistics on the data distributions in the FPGA (when a column is written from the client), which enables the storage node to provide this privacy-preserving functionality without any assistance from the client library, and 2) instead of replacing the data entirely with a random variable from a given distribution, to add noise of a specific distribution to the data in order to prepare it for use in differential privacy  applications. Both research directions have a good chance of publishing at an FPGA conference.
Skills needed: VHDL/Verilog coding, Debugging FPGA projects at least in simulation, Ideally some Go and Python experience
Skills to be acquired: Designing HW/SW systems, Working with network-facing FPGA designs, Possibly HLS coding