Name: nss_memcache
Author: Will Drewry <will@alum.bu.edu>
Copyright: 2006 Will Drewry
License: GPLv2
Date: 23 Oct 2006

1. Summary
2. Rationale
   1. Background
   2. Why not use X?
3. Use cases
4. Scope
5. Design
   1. NSS module
   2. Caching daemon
6. Implementation
   1. Components
   2. Plan
   3. Code
   4. Security Concerns
7. Status
8. Future work

Summary
=======

nss_memcache is a NSS (name switch service) module that caches responses
sourced from any other modules used on the system.  The cache data is stored in
a local memcached instance separated by user.


Rationale
=========


Background
----------

The NSS system is an approach for cleanly extending name services using
modules.  It is used in both GNU-based systems and on Solaris.  Services using
the NSS system are often handled using modules in a first-in-first-out
sense.  If the required data is resolvable by the first listed module (in
/etc/nsswitch.conf), that data will be returned from that module and all other
modules will be skipped.  

In general, the FIFO-approach works quite well, but it gets complicated when
external name resolution systems get involved.  When the NSS modules are mapped
into a process's memory and resolution takes place, a distinct connection is
made to the external system per-process and, at worst, per-request.  Because of
this, it is highly desirable to add some form of caching.

Enter `nscd', the name service caching daemon.  nscd is the solution to this
problem.  It caches module data. It is multi-threaded. It fails a lot.

nscd works outside the NSS system design by intercepting data via UNIX socket.
Not only does it not follow the general design, it also is rather complex.
Problems on both Linux and Solaris have been seen for years.


So why nss_memcache?
--------------------

With the advent of easy-to-use thread local storage (TLS), it is easier to
write a caching solution that works within the NSS system itself.  It becomes
easy for the systems administrator to select which services to cache within
nsswitch.conf instead of a separate config.  In addition, it allows for a
cleaner separation of caching code and the main NSS system.


Use cases
=========

* User A works in a large environment that relies on nss_ldap for
  passwd and group data.  User B types "ls -la" on the prompt but currently has
  to wait for each nss_ldap lookup even though most processes are owned by
  "root" or "userb". (rel1)

* User B administers a small shell server which relies on a postgresql database
  using nss_pgsql and uses nscd to avoid constant lookups. Currently, nscd fails
  blocking many processes' progress on the system. Users are complaining.
  (rel1)

* User C administers a high load internal mail server which uses nss_mysql and
  nscd. When mysql blocks or nscd hangs, mail gets delayed or even rejected. User D
  is considering moving to nss_db but doesn't want to deal with the fuss.
  (rel1)

* User D works in a large environment that relies on nss_mysql for
  passwd data, and User A attempts to use bash username completion (cd ~<TAB>).
  User A doesn't want to get coffee while waiting for getentpw() to iterate.
  (rel2: per-user, controlled cache dumps)

* User E runs a large scale install of OpenLDAP for account management and is
  tired of setting up replication slaves. User E would prefer a limited number
  of slaves and more automatically configured caches. (rel3: kerberized remote caches)


Scope
=====

The scope of this project is to supply a caching subsystem similar to "nscd"
but without relying on any special libc hooks.  Ideally, it should be a stable
robust drop-in replacement for "nscd".  Future work may expand this scope to
consider a distributed caching subsystem.

Design
======

The nss_memcache project is comprised of two components:
- NSS system module: nss_memcache
- A caching daemon


NSS module: nss_memcache
------------------------

The nss_memcache component has several constraints which must be met:
- must work within the NSS system
- thread-safe
- will not block on failure

The largest difference between nss_memcache and nscd is that nss_memcache will
not rely on libc to provide a special interface for caching.  Like nscd, there
is a caching daemon, as described below, but unlike nscd, there is a NSS
component.  In particular, the module not only performs lookups against the
cache daemon, it also performs updates.  This is done by falling through to the
other NSS modules on the system, such as NIS or LDAP.  Once data is returned, the
cache daemon is updated and the value is returned to the caller.

The basic algorithm for this modules is as follows:
1. Receive request from client code
2. Query cache daemon for match
3. On match, return
4. On no match, call the same function that was called. E.g. getpwuid()
5. When the call reaches nss_memcache again, return unavailable
6. If the call returns a match, updates the cache server and return
7. If not, update the cache server with a negative match and return

This approach introduces challenges around proper locking and thread-safe
behavior.  In particular, calls back into the same function, e.g. getpwuid(),
should fall through if a fallthrough should indeed be happening, but in a
threaded environment, this may not be the case.  To get around this problem,
a static thread local "fallthrough" variable is used to note when the called
function has fallen through to the other NSS modules. Updates to this thread
local variable is then done using libc_locks. The locks ensure that user-level
threaded applications, which do not benefit from thread-local storage, do not
cause unexpected updates.  However, this means that user-level threaded applications
may not benefit as much from nss_cache. (More attention must be paid to this.)

With the first two constraints addressed, the third is somewhat
straightforward.  nss_memcache will avoid blocking program execution on failure
primarily through the use of non-blocking I/O.  The only acceptable place for
nss_memcache to block is on callbacks into other NSS modules.

With this design in mind, nss_memcache should be used ahead of all modules to
be cached on the system -- E.g.,

  passwd: memcache [NOTFOUND=return] ldap nis compat


Caching daemon: memcached
-------------------------

As the name of the project indicates, memcached was chosen as the caching
daemon for use with nss_memcache. Initially, this project was to be called
nss_cache and have a custom caching daemon written around libevent.

Enter memcache.  memcache is generic object caching system written around
libevent.  Not only does it meet the basic criteria of needing a fast,
single-threaded caching daemon, it also already supports UNIX domain sockets
and has a simple protocol and several existing APIs for multiple languages. In
addition, memcache has been used behind production web services, like
LiveJournal, for several years.  It is actively supported and has a good track
record of robustness.

memcache is not perfect, however.  It lacks any awareness of users and has no
way to enforce security.  This limitation, however is easy enough to overcome.
The memcache daemon will be extended to retrieve peer credentials (e.g. using
SO_PEERCRED) over UNIX sockets.  This extension will allow the separation of
user cache spaces with a minimal amount of added work.  When nss_memcache
connects to memcached over the socket, it will check the credentials and prefix
the specified key (get, set, etc) with user's uid number.

All "get", "set", "incr", and "decr" requests over the UNIX socket will have
the key prefixed with the caller's uid.  This is true in every case except when
the uid number is 0. In those cases, no prefix will be enforced.  This will
allow the root user to easily prepopulate caches or even forcibly clear
user-specific caches.  In addition, "get" requests using user-separation will
first check for a non-prefixed key set by the superuser prior to searching for
a user-specific key.  (This functionality may be offloaded to nss_memcache, but
how is still unclear.)


Implementation
==============

The entire project will be written in C with an eye towards readability and
extensibility.

Components
----------

The memcached patches should be as simple and unobtrusive as possible.  It is
highly desirable to get them included in the upstream by the maintainers of
memcached.

nss_memcache should be templated as much as possible.  Given that memcache
accepts arbitrary objects, it seems that the largest challenge will be coming
up with ways to marshal the data in a generic fashion.

The libmemcache API was originally slated to be used in nss_memcache, but due
to a high number of libmemcache-induced bugs, it has been set aside. Instead, a
very lightweight memcache C API will be used.


Plan
----

The initial goals of this project are to create a robust drop-in for passwd and
group caching. Once this goal is achieved, this project will be showcased to a
few potential test users for feedback and comments.

From there, the code will be polished further and any future enhancements, such as
distributed (kerberized) caching, will be added.

Code
----

This code will live on code.google.com for now.


Security Concerns
-----------------

- abuse
  - one user claiming all the memcache daemon's allocated memory: ???
    - may be solvable with user bucketing?
  - user cache tainting: memcached should separate by UID #
- sensitive data: avoid caching passwords
- bad code: :-(


Status
======
- proof-of-concept nss_memcache using TLS and libc locks
 (may be able to ditch the libc locks since TLS is in use...maybe)
- nss_memcached getpwuid_r() working
- libmemcache sucks - need to replace with libmemcache-lite


Future work
===========
Now:
- Make flags 32-bit and hold the UID of the caller ??
- Make nss_memcache do two lookups per req (0,getuid())
- Add key pattern match *get() to memcache()
- Make lookups (get, bget) check a master entry then a user-specific entry:
  e.g. client requests should first check root-owned (non-prefixed) "key"
       then lookup "<uid>:key".

Future:
- libmemcache-lite -- add hashing, auth, etc
- nss_dmemcache:
  - network based memcache with buckets for distributed servers
  - kerberos or ssh-agent used for user cache separation on updates/lookups
  - admin credentials can push "master" value updates
    - replace LDAP replication with server-per-office plus nss_kmemcache servers