5.3. Semaphores and Mutexes
So let us look at how we
can
add locking to
scull. Our goal is to make our operations on the
scull data structure
atomic, meaning that the entire operation
happens at once as far as other threads of execution are concerned.
For our memory leak example, we need to ensure that if one thread
finds that a particular chunk of memory must be allocated, it has the
opportunity to perform that allocation before any other thread can
make that test. To this end, we must set up critical
sections: code that can be executed by only one thread at
any given time.
Not all
critical sections are the same, so
the kernel provides different primitives for different needs. In this
case, every access to the scull data structure
happens in process context as a result of a direct user request; no
accesses will be made from interrupt handlers or other asynchronous
contexts. There are no particular latency (response time)
requirements; application programmers understand that I/O requests
are not usually satisfied immediately. Furthermore, the
scull is not holding any other critical system
resource while it is accessing its own data structures. What all this
means is that if the scull driver goes to sleep
while waiting for its turn to access the data structure, nobody is
going to mind.
"Go to sleep" is a well-defined
term in this context. When a Linux process reaches a point where it
cannot make any further processes, it goes to sleep (or
"blocks"), yielding the processor
to somebody else until some future time when it can get work done
again. Processes often sleep when waiting for I/O to complete. As we
get deeper into the kernel, we will encounter a number of situations
where we cannot sleep. The write method in
scull is not one of those situations, however.
So we can use a locking mechanism that might cause the process to
sleep while waiting for access to the critical section.
Just as importantly, we will be performing an operation (memory
allocation with kmalloc) that could
sleep—so
sleeps
are a possibility in any case. If our critical sections are to work
properly, we must use a locking primitive that works when a thread
that owns the lock sleeps. Not all locking mechanisms can be used
where sleeping is a possibility (we'll see some that
don't later in this chapter). For our present needs,
however, the mechanism that fits best is a
semaphore.
Semaphores are a well-understood concept in computer science. At its
core, a semaphore is a single integer value combined with a pair of
functions
that are typically called P and
V. A process wishing to enter a critical section
will call P on the relevant semaphore; if the
semaphore's value is greater than zero, that value
is decremented by one and the process continues. If, instead, the
semaphore's value is 0 (or less),
the process must wait until somebody else releases the semaphore.
Unlocking
a semaphore is accomplished by calling V; this
function increments the value of the semaphore and, if necessary,
wakes up processes that are waiting.
When semaphores are used for mutual
exclusion—keeping multiple processes from running
within a critical section simultaneously—their value will be
initially set to 1. Such a semaphore can be held
only by a single process or thread at any given time. A semaphore
used in this mode is sometimes called a mutex,
which is, of course, an abbreviation for "mutual
exclusion." Almost all semaphores found in the Linux
kernel are used for mutual exclusion.
5.3.1. The Linux Semaphore Implementation
The Linux kernel
provides
an implementation of semaphores that conforms to the above semantics,
although the terminology is a little different. To use semaphores,
kernel code must include
<asm/semaphore.h>. The relevant type is
struct semaphore; actual
semaphores can be declared and initialized in a few ways. One is to
create a semaphore directly, then set it up with
sema_init:
void sema_init(struct semaphore *sem, int val);
where val is the initial value to assign to a
semaphore.
Usually, however, semaphores are used in a mutex mode. To make this
common case a little easier, the kernel has provided a set of helper
functions and macros. Thus, a mutex can be declared and initialized
with one of the following:
DECLARE_MUTEX(name);
DECLARE_MUTEX_LOCKED(name);
Here, the result is a semaphore variable (called
name) that is initialized to 1
(with DECLARE_MUTEX)
or
0 (with DECLARE_MUTEX_LOCKED).
In the latter case, the mutex starts out in a locked state; it will
have to be explicitly unlocked before any thread will be allowed
access.
If the mutex must be initialized at runtime (which is the case if it
is allocated dynamically, for example), use one of the following:
void init_MUTEX(struct semaphore *sem);
void init_MUTEX_LOCKED(struct semaphore *sem);
In the Linux world, the P function is called
down—or some variation of that name. Here,
"down" refers to the fact that the
function decrements the value of the semaphore and, perhaps after
putting the caller to sleep for a while to wait for the semaphore to
become available, grants access to the protected resources. There are
three versions of down:
void down(struct semaphore *sem);
int down_interruptible(struct semaphore *sem);
int down_trylock(struct semaphore *sem);
down decrements
the
value of the semaphore and waits as long as need be.
down_interruptible does the same, but the
operation is interruptible. The interruptible version is almost
always the one you will want; it allows a user-space process that is
waiting on a semaphore to be interrupted by the user. You do not, as
a general rule, want to use noninterruptible operations unless there
truly is no alternative. Non-interruptible operations are a good way
to create unkillable processes (the dreaded "D
state" seen in ps), and annoy
your users. Using down_interruptible requires
some extra care, however, if the operation is interrupted, the
function returns a nonzero value, and the caller does
not hold the semaphore. Proper use of
down_interruptible requires always checking the
return value and responding accordingly.
The final version (down_trylock) never sleeps;
if the semaphore is not available at the time of the call,
down_trylock returns immediately with a nonzero
return value.
Once a thread has successfully called one of the versions of
down, it is said to be
"holding" the semaphore (or to have
"taken out" or
"acquired" the semaphore). That
thread is now entitled to access the critical section protected by
the semaphore. When the operations requiring mutual exclusion are
complete, the semaphore must be returned. The Linux equivalent to
V is up:
void up(struct semaphore *sem);
Once up has been called,
the caller no longer holds the semaphore.
As you would expect, any thread that takes out a semaphore is
required to release it with one (and only one) call to
up. Special care is often required in error
paths; if an error is encountered while a semaphore is held, that
semaphore
must be released before returning the error status to the caller.
Failure to free a semaphore is an easy error to make; the result
(processes hanging in seemingly unrelated places) can be hard to
reproduce and track down.
5.3.2. Using Semaphores in scull
The semaphore mechanism gives scull
a
tool that can be used to avoid race conditions while accessing the
scull_dev data structure. But it is up to us to
use that tool correctly. The keys to proper use of locking primitives
are to specify exactly which resources are to be protected and to
make sure that every access to those resources uses the proper
locking. In our example driver, everything of interest is contained
within the scull_dev structure, so that is the
logical scope for our locking regime.
Let's look again at that structure:
struct scull_dev {
struct scull_qset *data; /* Pointer to first quantum set */
int quantum; /* the current quantum size */
int qset; /* the current array size */
unsigned long size; /* amount of data stored here */
unsigned int access_key; /* used by sculluid and scullpriv */
struct semaphore sem; /* mutual exclusion semaphore */
struct cdev cdev; /* Char device structure */
};
Toward the bottom of the structure is a member called
sem which is, of course, our semaphore. We have
chosen to use a separate semaphore for each virtual
scull device. It would have been equally correct
to use a single, global semaphore. The various
scull devices share no resources in common,
however, and there is no reason to make one process wait while
another process is working with a different
scull device. Using a separate semaphore for
each device allows operations on different devices to proceed in
parallel and, therefore, improves performance.
Semaphores must be initialized before use. scull
performs this initialization at load time in this loop:
for (i = 0; i < scull_nr_devs; i++) {
scull_devices[i].quantum = scull_quantum;
scull_devices[i].qset = scull_qset;
init_MUTEX(&scull_devices[i].sem);
scull_setup_cdev(&scull_devices[i], i);
}
Note that the semaphore must be initialized
before the scull device is
made available to the rest of the system. Therefore,
init_MUTEX is called before
scull_setup_cdev. Performing these operations in
the opposite order would create a race condition where the semaphore
could be accessed before it is ready.
Next, we must go through the code and make sure that no accesses to
the scull_dev data structure are made without
holding the semaphore. Thus, for example,
scull_write begins with this code:
if (down_interruptible(&dev->sem))
return -ERESTARTSYS;
Note the check on the return value of
down_interruptible; if it returns nonzero, the
operation was interrupted. The usual thing to do in this situation is
to return -ERESTARTSYS. Upon seeing this return
code, the higher layers of the kernel will either restart the call
from the beginning or return the error to the user. If you return
-ERESTARTSYS, you must first undo any user-visible
changes that might have been made, so that the right thing happens
when the system call is retried. If you cannot undo things in this
manner, you should return -EINTR instead.
scull_write must release the semaphore whether
or not it was able to carry out its other tasks successfully. If all
goes well, execution falls into the final few lines of the function:
out:
up(&dev->sem);
return retval;
This code frees the semaphore and returns whatever status is called
for. There are several places in scull_write
where things can go wrong; these include memory allocation failures
or a fault while trying to copy data from user space. In those cases,
the code performs a goto out, ensuring that the
proper cleanup is done.
5.3.3. Reader/Writer Semaphores
Semaphores perform mutual
exclusion for all callers, regardless of what each thread may want to
do. Many tasks break down into two distinct types of work, however:
tasks that only need to read the protected data structures and those
that must make changes. It is often possible to allow multiple
concurrent readers, as long as nobody is trying to make any changes.
Doing so can optimize performance significantly; read-only tasks can
get their work done in parallel without having to wait for other
readers to exit the critical section.
The Linux kernel provides a special type of semaphore called a
rwsem (or "reader/writer
semaphore") for this situation. The use of
rwsems in drivers is relatively
rare, but they are occasionally useful.
Code using rwsems must include
<linux/rwsem.h>. The relevant data type
for reader/writer semaphores is struct
rw_semaphore; an rwsem must be explicitly
initialized at runtime with:
void init_rwsem(struct rw_semaphore *sem);
A newly initialized rwsem is available for the next task (reader or
writer) that comes along. The interface for code needing read-only
access is:
void down_read(struct rw_semaphore *sem);
int down_read_trylock(struct rw_semaphore *sem);
void up_read(struct rw_semaphore *sem);
A call to down_read provides read-only access to
the protected resources, possibly concurrently with other readers.
Note that down_read may put the calling process
into an uninterruptible sleep. down_read_trylock
will not wait if read access is unavailable; it returns nonzero if
access was granted, 0 otherwise. Note that the
convention for down_read_trylock differs from
that of most kernel functions, where success is indicated by a return
value of 0. A
rwsem obtained with
down_read must eventually be freed with
up_read.
The interface for writers is similar:
void down_write(struct rw_semaphore *sem);
int down_write_trylock(struct rw_semaphore *sem);
void up_write(struct rw_semaphore *sem);
void downgrade_write(struct rw_semaphore *sem);
down_write,
down_write_trylock, and
up_write all behave just like their reader
counterparts, except, of course, that they provide write access. If
you have a situation where a writer lock is needed for a quick
change, followed by a longer period of read-only access, you can use
downgrade_write to allow other readers in once
you have finished making changes.
An rwsem allows either one writer or an unlimited number of readers
to hold the semaphore. Writers get priority; as soon as a writer
tries to enter the critical section, no readers will be allowed in
until all writers have completed their work. This implementation can
lead to reader starvation—where readers
are denied access for a long time—if you have a large number of
writers contending for the semaphore. For this reason, rwsems are
best used when write access is required only rarely, and writer
access is held for short periods of time.
|