aboutsummaryrefslogtreecommitdiff
path: root/docs/devel
diff options
context:
space:
mode:
authorPaolo Bonzini <pbonzini@redhat.com>2017-06-06 16:46:26 +0200
committerPaolo Bonzini <pbonzini@redhat.com>2017-06-07 18:22:03 +0200
commitac06724a715864942e2b5e28f92d5d5421f0a0b0 (patch)
tree8eeb9a6aeff09669b65573b1d856426cdf87d8bd /docs/devel
parent90bb0c04214545beb75044a2742f711335103269 (diff)
docs: create config/, devel/ and spin/ subdirectories
Developer documentation should be its own manual. As a start, move all developer-oriented files to a separate directory. Also move non-text files to their own directories: docs/config/ for QEMU -readconfig input, and docs/spin/ for formal models to be used with the SPIN model checker. Reviewed-by: Daniel P. Berrange <berrange@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Diffstat (limited to 'docs/devel')
-rw-r--r--docs/devel/atomics.txt388
-rw-r--r--docs/devel/bitmaps.md505
-rw-r--r--docs/devel/blkdebug.txt162
-rw-r--r--docs/devel/blkverify.txt69
-rw-r--r--docs/devel/build-system.txt512
-rw-r--r--docs/devel/lockcnt.txt277
-rw-r--r--docs/devel/memory.txt316
-rw-r--r--docs/devel/migration.txt555
-rw-r--r--docs/devel/multi-thread-tcg.txt350
-rw-r--r--docs/devel/multiple-iothreads.txt137
-rw-r--r--docs/devel/qapi-code-gen.txt1310
-rw-r--r--docs/devel/rcu.txt390
-rw-r--r--docs/devel/tracing.txt442
-rw-r--r--docs/devel/virtio-migration.txt108
-rw-r--r--docs/devel/writing-qmp-commands.txt607
15 files changed, 6128 insertions, 0 deletions
diff --git a/docs/devel/atomics.txt b/docs/devel/atomics.txt
new file mode 100644
index 0000000000..3ef5d85b1b
--- /dev/null
+++ b/docs/devel/atomics.txt
@@ -0,0 +1,388 @@
+CPUs perform independent memory operations effectively in random order.
+but this can be a problem for CPU-CPU interaction (including interactions
+between QEMU and the guest). Multi-threaded programs use various tools
+to instruct the compiler and the CPU to restrict the order to something
+that is consistent with the expectations of the programmer.
+
+The most basic tool is locking. Mutexes, condition variables and
+semaphores are used in QEMU, and should be the default approach to
+synchronization. Anything else is considerably harder, but it's
+also justified more often than one would like. The two tools that
+are provided by qemu/atomic.h are memory barriers and atomic operations.
+
+Macros defined by qemu/atomic.h fall in three camps:
+
+- compiler barriers: barrier();
+
+- weak atomic access and manual memory barriers: atomic_read(),
+ atomic_set(), smp_rmb(), smp_wmb(), smp_mb(), smp_mb_acquire(),
+ smp_mb_release(), smp_read_barrier_depends();
+
+- sequentially consistent atomic access: everything else.
+
+
+COMPILER MEMORY BARRIER
+=======================
+
+barrier() prevents the compiler from moving the memory accesses either
+side of it to the other side. The compiler barrier has no direct effect
+on the CPU, which may then reorder things however it wishes.
+
+barrier() is mostly used within qemu/atomic.h itself. On some
+architectures, CPU guarantees are strong enough that blocking compiler
+optimizations already ensures the correct order of execution. In this
+case, qemu/atomic.h will reduce stronger memory barriers to simple
+compiler barriers.
+
+Still, barrier() can be useful when writing code that can be interrupted
+by signal handlers.
+
+
+SEQUENTIALLY CONSISTENT ATOMIC ACCESS
+=====================================
+
+Most of the operations in the qemu/atomic.h header ensure *sequential
+consistency*, where "the result of any execution is the same as if the
+operations of all the processors were executed in some sequential order,
+and the operations of each individual processor appear in this sequence
+in the order specified by its program".
+
+qemu/atomic.h provides the following set of atomic read-modify-write
+operations:
+
+ void atomic_inc(ptr)
+ void atomic_dec(ptr)
+ void atomic_add(ptr, val)
+ void atomic_sub(ptr, val)
+ void atomic_and(ptr, val)
+ void atomic_or(ptr, val)
+
+ typeof(*ptr) atomic_fetch_inc(ptr)
+ typeof(*ptr) atomic_fetch_dec(ptr)
+ typeof(*ptr) atomic_fetch_add(ptr, val)
+ typeof(*ptr) atomic_fetch_sub(ptr, val)
+ typeof(*ptr) atomic_fetch_and(ptr, val)
+ typeof(*ptr) atomic_fetch_or(ptr, val)
+ typeof(*ptr) atomic_xchg(ptr, val)
+ typeof(*ptr) atomic_cmpxchg(ptr, old, new)
+
+all of which return the old value of *ptr. These operations are
+polymorphic; they operate on any type that is as wide as an int.
+
+Sequentially consistent loads and stores can be done using:
+
+ atomic_fetch_add(ptr, 0) for loads
+ atomic_xchg(ptr, val) for stores
+
+However, they are quite expensive on some platforms, notably POWER and
+ARM. Therefore, qemu/atomic.h provides two primitives with slightly
+weaker constraints:
+
+ typeof(*ptr) atomic_mb_read(ptr)
+ void atomic_mb_set(ptr, val)
+
+The semantics of these primitives map to Java volatile variables,
+and are strongly related to memory barriers as used in the Linux
+kernel (see below).
+
+As long as you use atomic_mb_read and atomic_mb_set, accesses cannot
+be reordered with each other, and it is also not possible to reorder
+"normal" accesses around them.
+
+However, and this is the important difference between
+atomic_mb_read/atomic_mb_set and sequential consistency, it is important
+for both threads to access the same volatile variable. It is not the
+case that everything visible to thread A when it writes volatile field f
+becomes visible to thread B after it reads volatile field g. The store
+and load have to "match" (i.e., be performed on the same volatile
+field) to achieve the right semantics.
+
+
+These operations operate on any type that is as wide as an int or smaller.
+
+
+WEAK ATOMIC ACCESS AND MANUAL MEMORY BARRIERS
+=============================================
+
+Compared to sequentially consistent atomic access, programming with
+weaker consistency models can be considerably more complicated.
+In general, if the algorithm you are writing includes both writes
+and reads on the same side, it is generally simpler to use sequentially
+consistent primitives.
+
+When using this model, variables are accessed with atomic_read() and
+atomic_set(), and restrictions to the ordering of accesses is enforced
+using the memory barrier macros: smp_rmb(), smp_wmb(), smp_mb(),
+smp_mb_acquire(), smp_mb_release(), smp_read_barrier_depends().
+
+atomic_read() and atomic_set() prevents the compiler from using
+optimizations that might otherwise optimize accesses out of existence
+on the one hand, or that might create unsolicited accesses on the other.
+In general this should not have any effect, because the same compiler
+barriers are already implied by memory barriers. However, it is useful
+to do so, because it tells readers which variables are shared with
+other threads, and which are local to the current thread or protected
+by other, more mundane means.
+
+Memory barriers control the order of references to shared memory.
+They come in six kinds:
+
+- smp_rmb() guarantees that all the LOAD operations specified before
+ the barrier will appear to happen before all the LOAD operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+ In other words, smp_rmb() puts a partial ordering on loads, but is not
+ required to have any effect on stores.
+
+- smp_wmb() guarantees that all the STORE operations specified before
+ the barrier will appear to happen before all the STORE operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+ In other words, smp_wmb() puts a partial ordering on stores, but is not
+ required to have any effect on loads.
+
+- smp_mb_acquire() guarantees that all the LOAD operations specified before
+ the barrier will appear to happen before all the LOAD or STORE operations
+ specified after the barrier with respect to the other components of
+ the system.
+
+- smp_mb_release() guarantees that all the STORE operations specified *after*
+ the barrier will appear to happen after all the LOAD or STORE operations
+ specified *before* the barrier with respect to the other components of
+ the system.
+
+- smp_mb() guarantees that all the LOAD and STORE operations specified
+ before the barrier will appear to happen before all the LOAD and
+ STORE operations specified after the barrier with respect to the other
+ components of the system.
+
+ smp_mb() puts a partial ordering on both loads and stores. It is
+ stronger than both a read and a write memory barrier; it implies both
+ smp_mb_acquire() and smp_mb_release(), but it also prevents STOREs
+ coming before the barrier from overtaking LOADs coming after the
+ barrier and vice versa.
+
+- smp_read_barrier_depends() is a weaker kind of read barrier. On
+ most processors, whenever two loads are performed such that the
+ second depends on the result of the first (e.g., the first load
+ retrieves the address to which the second load will be directed),
+ the processor will guarantee that the first LOAD will appear to happen
+ before the second with respect to the other components of the system.
+ However, this is not always true---for example, it was not true on
+ Alpha processors. Whenever this kind of access happens to shared
+ memory (that is not protected by a lock), a read barrier is needed,
+ and smp_read_barrier_depends() can be used instead of smp_rmb().
+
+ Note that the first load really has to have a _data_ dependency and not
+ a control dependency. If the address for the second load is dependent
+ on the first load, but the dependency is through a conditional rather
+ than actually loading the address itself, then it's a _control_
+ dependency and a full read barrier or better is required.
+
+
+This is the set of barriers that is required *between* two atomic_read()
+and atomic_set() operations to achieve sequential consistency:
+
+ | 2nd operation |
+ |-----------------------------------------------|
+ 1st operation | (after last) | atomic_read | atomic_set |
+ ---------------+----------------+-------------+----------------|
+ (before first) | | none | smp_mb_release |
+ ---------------+----------------+-------------+----------------|
+ atomic_read | smp_mb_acquire | smp_rmb | ** |
+ ---------------+----------------+-------------+----------------|
+ atomic_set | none | smp_mb()*** | smp_wmb() |
+ ---------------+----------------+-------------+----------------|
+
+ * Or smp_read_barrier_depends().
+
+ ** This requires a load-store barrier. This is achieved by
+ either smp_mb_acquire() or smp_mb_release().
+
+ *** This requires a store-load barrier. On most machines, the only
+ way to achieve this is a full barrier.
+
+
+You can see that the two possible definitions of atomic_mb_read()
+and atomic_mb_set() are the following:
+
+ 1) atomic_mb_read(p) = atomic_read(p); smp_mb_acquire()
+ atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v); smp_mb()
+
+ 2) atomic_mb_read(p) = smp_mb() atomic_read(p); smp_mb_acquire()
+ atomic_mb_set(p, v) = smp_mb_release(); atomic_set(p, v);
+
+Usually the former is used, because smp_mb() is expensive and a program
+normally has more reads than writes. Therefore it makes more sense to
+make atomic_mb_set() the more expensive operation.
+
+There are two common cases in which atomic_mb_read and atomic_mb_set
+generate too many memory barriers, and thus it can be useful to manually
+place barriers instead:
+
+- when a data structure has one thread that is always a writer
+ and one thread that is always a reader, manual placement of
+ memory barriers makes the write side faster. Furthermore,
+ correctness is easy to check for in this case using the "pairing"
+ trick that is explained below:
+
+ thread 1 thread 1
+ ------------------------- ------------------------
+ (other writes)
+ smp_mb_release()
+ atomic_mb_set(&a, x) atomic_set(&a, x)
+ smp_wmb()
+ atomic_mb_set(&b, y) atomic_set(&b, y)
+
+ =>
+ thread 2 thread 2
+ ------------------------- ------------------------
+ y = atomic_mb_read(&b) y = atomic_read(&b)
+ smp_rmb()
+ x = atomic_mb_read(&a) x = atomic_read(&a)
+ smp_mb_acquire()
+
+ Note that the barrier between the stores in thread 1, and between
+ the loads in thread 2, has been optimized here to a write or a
+ read memory barrier respectively. On some architectures, notably
+ ARMv7, smp_mb_acquire and smp_mb_release are just as expensive as
+ smp_mb, but smp_rmb and/or smp_wmb are more efficient.
+
+- sometimes, a thread is accessing many variables that are otherwise
+ unrelated to each other (for example because, apart from the current
+ thread, exactly one other thread will read or write each of these
+ variables). In this case, it is possible to "hoist" the implicit
+ barriers provided by atomic_mb_read() and atomic_mb_set() outside
+ a loop. For example, the above definition atomic_mb_read() gives
+ the following transformation:
+
+ n = 0; n = 0;
+ for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
+ n += atomic_mb_read(&a[i]); n += atomic_read(&a[i]);
+ smp_mb_acquire();
+
+ Similarly, atomic_mb_set() can be transformed as follows:
+ smp_mb():
+
+ smp_mb_release();
+ for (i = 0; i < 10; i++) => for (i = 0; i < 10; i++)
+ atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
+ smp_mb();
+
+
+The two tricks can be combined. In this case, splitting a loop in
+two lets you hoist the barriers out of the loops _and_ eliminate the
+expensive smp_mb():
+
+ smp_mb_release();
+ for (i = 0; i < 10; i++) { => for (i = 0; i < 10; i++)
+ atomic_mb_set(&a[i], false); atomic_set(&a[i], false);
+ atomic_mb_set(&b[i], false); smb_wmb();
+ } for (i = 0; i < 10; i++)
+ atomic_set(&a[i], false);
+ smp_mb();
+
+ The other thread can still use atomic_mb_read()/atomic_mb_set()
+
+
+Memory barrier pairing
+----------------------
+
+A useful rule of thumb is that memory barriers should always, or almost
+always, be paired with another barrier. In the case of QEMU, however,
+note that the other barrier may actually be in a driver that runs in
+the guest!
+
+For the purposes of pairing, smp_read_barrier_depends() and smp_rmb()
+both count as read barriers. A read barrier shall pair with a write
+barrier or a full barrier; a write barrier shall pair with a read
+barrier or a full barrier. A full barrier can pair with anything.
+For example:
+
+ thread 1 thread 2
+ =============== ===============
+ a = 1;
+ smp_wmb();
+ b = 2; x = b;
+ smp_rmb();
+ y = a;
+
+Note that the "writing" thread is accessing the variables in the
+opposite order as the "reading" thread. This is expected: stores
+before the write barrier will normally match the loads after the
+read barrier, and vice versa. The same is true for more than 2
+access and for data dependency barriers:
+
+ thread 1 thread 2
+ =============== ===============
+ b[2] = 1;
+ smp_wmb();
+ x->i = 2;
+ smp_wmb();
+ a = x; x = a;
+ smp_read_barrier_depends();
+ y = x->i;
+ smp_read_barrier_depends();
+ z = b[y];
+
+smp_wmb() also pairs with atomic_mb_read() and smp_mb_acquire().
+and smp_rmb() also pairs with atomic_mb_set() and smp_mb_release().
+
+
+COMPARISON WITH LINUX KERNEL MEMORY BARRIERS
+============================================
+
+Here is a list of differences between Linux kernel atomic operations
+and memory barriers, and the equivalents in QEMU:
+
+- atomic operations in Linux are always on a 32-bit int type and
+ use a boxed atomic_t type; atomic operations in QEMU are polymorphic
+ and use normal C types.
+
+- Originally, atomic_read and atomic_set in Linux gave no guarantee
+ at all. Linux 4.1 updated them to implement volatile
+ semantics via ACCESS_ONCE (or the more recent READ/WRITE_ONCE).
+
+ QEMU's atomic_read/set implement, if the compiler supports it, C11
+ atomic relaxed semantics, and volatile semantics otherwise.
+ Both semantics prevent the compiler from doing certain transformations;
+ the difference is that atomic accesses are guaranteed to be atomic,
+ while volatile accesses aren't. Thus, in the volatile case we just cross
+ our fingers hoping that the compiler will generate atomic accesses,
+ since we assume the variables passed are machine-word sized and
+ properly aligned.
+ No barriers are implied by atomic_read/set in either Linux or QEMU.
+
+- atomic read-modify-write operations in Linux are of three kinds:
+
+ atomic_OP returns void
+ atomic_OP_return returns new value of the variable
+ atomic_fetch_OP returns the old value of the variable
+ atomic_cmpxchg returns the old value of the variable
+
+ In QEMU, the second kind does not exist. Currently Linux has
+ atomic_fetch_or only. QEMU provides and, or, inc, dec, add, sub.
+
+- different atomic read-modify-write operations in Linux imply
+ a different set of memory barriers; in QEMU, all of them enforce
+ sequential consistency, which means they imply full memory barriers
+ before and after the operation.
+
+- Linux does not have an equivalent of atomic_mb_set(). In particular,
+ note that smp_store_mb() is a little weaker than atomic_mb_set().
+ atomic_mb_read() compiles to the same instructions as Linux's
+ smp_load_acquire(), but this should be treated as an implementation
+ detail. QEMU does have atomic_load_acquire() and atomic_store_release()
+ macros, but for now they are only used within atomic.h. This may
+ change in the future.
+
+
+SOURCES
+=======
+
+* Documentation/memory-barriers.txt from the Linux kernel
+
+* "The JSR-133 Cookbook for Compiler Writers", available at
+ http://g.oswego.edu/dl/jmm/cookbook.html
diff --git a/docs/devel/bitmaps.md b/docs/devel/bitmaps.md
new file mode 100644
index 0000000000..a2e8d51163
--- /dev/null
+++ b/docs/devel/bitmaps.md
@@ -0,0 +1,505 @@
+<!--
+Copyright 2015 John Snow <jsnow@redhat.com> and Red Hat, Inc.
+All rights reserved.
+
+This file is licensed via The FreeBSD Documentation License, the full text of
+which is included at the end of this document.
+-->
+
+# Dirty Bitmaps and Incremental Backup
+
+* Dirty Bitmaps are objects that track which data needs to be backed up for the
+ next incremental backup.
+
+* Dirty bitmaps can be created at any time and attached to any node
+ (not just complete drives.)
+
+## Dirty Bitmap Names
+
+* A dirty bitmap's name is unique to the node, but bitmaps attached to different
+ nodes can share the same name.
+
+* Dirty bitmaps created for internal use by QEMU may be anonymous and have no
+ name, but any user-created bitmaps may not be. There can be any number of
+ anonymous bitmaps per node.
+
+* The name of a user-created bitmap must not be empty ("").
+
+## Bitmap Modes
+
+* A Bitmap can be "frozen," which means that it is currently in-use by a backup
+ operation and cannot be deleted, renamed, written to, reset,
+ etc.
+
+* The normal operating mode for a bitmap is "active."
+
+## Basic QMP Usage
+
+### Supported Commands ###
+
+* block-dirty-bitmap-add
+* block-dirty-bitmap-remove
+* block-dirty-bitmap-clear
+
+### Creation
+
+* To create a new bitmap, enabled, on the drive with id=drive0:
+
+```json
+{ "execute": "block-dirty-bitmap-add",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+}
+```
+
+* This bitmap will have a default granularity that matches the cluster size of
+ its associated drive, if available, clamped to between [4KiB, 64KiB].
+ The current default for qcow2 is 64KiB.
+
+* To create a new bitmap that tracks changes in 32KiB segments:
+
+```json
+{ "execute": "block-dirty-bitmap-add",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0",
+ "granularity": 32768
+ }
+}
+```
+
+### Deletion
+
+* Bitmaps that are frozen cannot be deleted.
+
+* Deleting the bitmap does not impact any other bitmaps attached to the same
+ node, nor does it affect any backups already created from this node.
+
+* Because bitmaps are only unique to the node to which they are attached,
+ you must specify the node/drive name here, too.
+
+```json
+{ "execute": "block-dirty-bitmap-remove",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+}
+```
+
+### Resetting
+
+* Resetting a bitmap will clear all information it holds.
+
+* An incremental backup created from an empty bitmap will copy no data,
+ as if nothing has changed.
+
+```json
+{ "execute": "block-dirty-bitmap-clear",
+ "arguments": {
+ "node": "drive0",
+ "name": "bitmap0"
+ }
+}
+```
+
+## Transactions
+
+### Justification
+
+Bitmaps can be safely modified when the VM is paused or halted by using
+the basic QMP commands. For instance, you might perform the following actions:
+
+1. Boot the VM in a paused state.
+2. Create a full drive backup of drive0.
+3. Create a new bitmap attached to drive0.
+4. Resume execution of the VM.
+5. Incremental backups are ready to be created.
+
+At this point, the bitmap and drive backup would be correctly in sync,
+and incremental backups made from this point forward would be correctly aligned
+to the full drive backup.
+
+This is not particularly useful if we decide we want to start incremental
+backups after the VM has been running for a while, for which we will need to
+perform actions such as the following:
+
+1. Boot the VM and begin execution.
+2. Using a single transaction, perform the following operations:
+ * Create bitmap0.
+ * Create a full drive backup of drive0.
+3. Incremental backups are now ready to be created.
+
+### Supported Bitmap Transactions
+
+* block-dirty-bitmap-add
+* block-dirty-bitmap-clear
+
+The usages are identical to their respective QMP commands, but see below
+for examples.
+
+### Example: New Incremental Backup
+
+As outlined in the justification, perhaps we want to create a new incremental
+backup chain attached to a drive.
+
+```json
+{ "execute": "transaction",
+ "arguments": {
+ "actions": [
+ {"type": "block-dirty-bitmap-add",
+ "data": {"node": "drive0", "name": "bitmap0"} },
+ {"type": "drive-backup",
+ "data": {"device": "drive0", "target": "/path/to/full_backup.img",
+ "sync": "full", "format": "qcow2"} }
+ ]
+ }
+}
+```
+
+### Example: New Incremental Backup Anchor Point
+
+Maybe we just want to create a new full backup with an existing bitmap and
+want to reset the bitmap to track the new chain.
+
+```json
+{ "execute": "transaction",
+ "arguments": {
+ "actions": [
+ {"type": "block-dirty-bitmap-clear",
+ "data": {"node": "drive0", "name": "bitmap0"} },
+ {"type": "drive-backup",
+ "data": {"device": "drive0", "target": "/path/to/new_full_backup.img",
+ "sync": "full", "format": "qcow2"} }
+ ]
+ }
+}
+```
+
+## Incremental Backups
+
+The star of the show.
+
+**Nota Bene!** Only incremental backups of entire drives are supported for now.
+So despite the fact that you can attach a bitmap to any arbitrary node, they are
+only currently useful when attached to the root node. This is because
+drive-backup only supports drives/devices instead of arbitrary nodes.
+
+### Example: First Incremental Backup
+
+1. Create a full backup and sync it to the dirty bitmap, as in the transactional
+examples above; or with the VM offline, manually create a full copy and then
+create a new bitmap before the VM begins execution.
+
+ * Let's assume the full backup is named 'full_backup.img'.
+ * Let's assume the bitmap you created is 'bitmap0' attached to 'drive0'.
+
+2. Create a destination image for the incremental backup that utilizes the
+full backup as a backing image.
+
+ * Let's assume it is named 'incremental.0.img'.
+
+ ```sh
+ # qemu-img create -f qcow2 incremental.0.img -b full_backup.img -F qcow2
+ ```
+
+3. Issue the incremental backup command:
+
+ ```json
+ { "execute": "drive-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "incremental.0.img",
+ "format": "qcow2",
+ "sync": "incremental",
+ "mode": "existing"
+ }
+ }
+ ```
+
+### Example: Second Incremental Backup
+
+1. Create a new destination image for the incremental backup that points to the
+ previous one, e.g.: 'incremental.1.img'
+
+ ```sh
+ # qemu-img create -f qcow2 incremental.1.img -b incremental.0.img -F qcow2
+ ```
+
+2. Issue a new incremental backup command. The only difference here is that we
+ have changed the target image below.
+
+ ```json
+ { "execute": "drive-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "incremental.1.img",
+ "format": "qcow2",
+ "sync": "incremental",
+ "mode": "existing"
+ }
+ }
+ ```
+
+## Errors
+
+* In the event of an error that occurs after a backup job is successfully
+ launched, either by a direct QMP command or a QMP transaction, the user
+ will receive a BLOCK_JOB_COMPLETE event with a failure message, accompanied
+ by a BLOCK_JOB_ERROR event.
+
+* In the case of an event being cancelled, the user will receive a
+ BLOCK_JOB_CANCELLED event instead of a pair of COMPLETE and ERROR events.
+
+* In either case, the incremental backup data contained within the bitmap is
+ safely rolled back, and the data within the bitmap is not lost. The image
+ file created for the failed attempt can be safely deleted.
+
+* Once the underlying problem is fixed (e.g. more storage space is freed up),
+ you can simply retry the incremental backup command with the same bitmap.
+
+### Example
+
+1. Create a target image:
+
+ ```sh
+ # qemu-img create -f qcow2 incremental.0.img -b full_backup.img -F qcow2
+ ```
+
+2. Attempt to create an incremental backup via QMP:
+
+ ```json
+ { "execute": "drive-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "incremental.0.img",
+ "format": "qcow2",
+ "sync": "incremental",
+ "mode": "existing"
+ }
+ }
+ ```
+
+3. Receive an event notifying us of failure:
+
+ ```json
+ { "timestamp": { "seconds": 1424709442, "microseconds": 844524 },
+ "data": { "speed": 0, "offset": 0, "len": 67108864,
+ "error": "No space left on device",
+ "device": "drive1", "type": "backup" },
+ "event": "BLOCK_JOB_COMPLETED" }
+ ```
+
+4. Delete the failed incremental, and re-create the image.
+
+ ```sh
+ # rm incremental.0.img
+ # qemu-img create -f qcow2 incremental.0.img -b full_backup.img -F qcow2
+ ```
+
+5. Retry the command after fixing the underlying problem,
+ such as freeing up space on the backup volume:
+
+ ```json
+ { "execute": "drive-backup",
+ "arguments": {
+ "device": "drive0",
+ "bitmap": "bitmap0",
+ "target": "incremental.0.img",
+ "format": "qcow2",
+ "sync": "incremental",
+ "mode": "existing"
+ }
+ }
+ ```
+
+6. Receive confirmation that the job completed successfully:
+
+ ```json
+ { "timestamp": { "seconds": 1424709668, "microseconds": 526525 },
+ "data": { "device": "drive1", "type": "backup",
+ "speed": 0, "len": 67108864, "offset": 67108864},
+ "event": "BLOCK_JOB_COMPLETED" }
+ ```
+
+### Partial Transactional Failures
+
+* Sometimes, a transaction will succeed in launching and return success,
+ but then later the backup jobs themselves may fail. It is possible that
+ a management application may have to deal with a partial backup failure
+ after a successful transaction.
+
+* If multiple backup jobs are specified in a single transaction, when one of
+ them fails, it will not interact with the other backup jobs in any way.
+
+* The job(s) that succeeded will clear the dirty bitmap associated with the
+ operation, but the job(s) that failed will not. It is not "safe" to delete
+ any incremental backups that were created successfully in this scenario,
+ even though others failed.
+
+#### Example
+
+* QMP example highlighting two backup jobs:
+
+ ```json
+ { "execute": "transaction",
+ "arguments": {
+ "actions": [
+ { "type": "drive-backup",
+ "data": { "device": "drive0", "bitmap": "bitmap0",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d0-incr-1.qcow2" } },
+ { "type": "drive-backup",
+ "data": { "device": "drive1", "bitmap": "bitmap1",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d1-incr-1.qcow2" } },
+ ]
+ }
+ }
+ ```
+
+* QMP example response, highlighting one success and one failure:
+ * Acknowledgement that the Transaction was accepted and jobs were launched:
+ ```json
+ { "return": {} }
+ ```
+
+ * Later, QEMU sends notice that the first job was completed:
+ ```json
+ { "timestamp": { "seconds": 1447192343, "microseconds": 615698 },
+ "data": { "device": "drive0", "type": "backup",
+ "speed": 0, "len": 67108864, "offset": 67108864 },
+ "event": "BLOCK_JOB_COMPLETED"
+ }
+ ```
+
+ * Later yet, QEMU sends notice that the second job has failed:
+ ```json
+ { "timestamp": { "seconds": 1447192399, "microseconds": 683015 },
+ "data": { "device": "drive1", "action": "report",
+ "operation": "read" },
+ "event": "BLOCK_JOB_ERROR" }
+ ```
+
+ ```json
+ { "timestamp": { "seconds": 1447192399, "microseconds": 685853 },
+ "data": { "speed": 0, "offset": 0, "len": 67108864,
+ "error": "Input/output error",
+ "device": "drive1", "type": "backup" },
+ "event": "BLOCK_JOB_COMPLETED" }
+
+* In the above example, "d0-incr-1.qcow2" is valid and must be kept,
+ but "d1-incr-1.qcow2" is invalid and should be deleted. If a VM-wide
+ incremental backup of all drives at a point-in-time is to be made,
+ new backups for both drives will need to be made, taking into account
+ that a new incremental backup for drive0 needs to be based on top of
+ "d0-incr-1.qcow2."
+
+### Grouped Completion Mode
+
+* While jobs launched by transactions normally complete or fail on their own,
+ it is possible to instruct them to complete or fail together as a group.
+
+* QMP transactions take an optional properties structure that can affect
+ the semantics of the transaction.
+
+* The "completion-mode" transaction property can be either "individual"
+ which is the default, legacy behavior described above, or "grouped,"
+ a new behavior detailed below.
+
+* Delayed Completion: In grouped completion mode, no jobs will report
+ success until all jobs are ready to report success.
+
+* Grouped failure: If any job fails in grouped completion mode, all remaining
+ jobs will be cancelled. Any incremental backups will restore their dirty
+ bitmap objects as if no backup command was ever issued.
+
+ * Regardless of if QEMU reports a particular incremental backup job as
+ CANCELLED or as an ERROR, the in-memory bitmap will be restored.
+
+#### Example
+
+* Here's the same example scenario from above with the new property:
+
+ ```json
+ { "execute": "transaction",
+ "arguments": {
+ "actions": [
+ { "type": "drive-backup",
+ "data": { "device": "drive0", "bitmap": "bitmap0",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d0-incr-1.qcow2" } },
+ { "type": "drive-backup",
+ "data": { "device": "drive1", "bitmap": "bitmap1",
+ "format": "qcow2", "mode": "existing",
+ "sync": "incremental", "target": "d1-incr-1.qcow2" } },
+ ],
+ "properties": {
+ "completion-mode": "grouped"
+ }
+ }
+ }
+ ```
+
+* QMP example response, highlighting a failure for drive2:
+ * Acknowledgement that the Transaction was accepted and jobs were launched:
+ ```json
+ { "return": {} }
+ ```
+
+ * Later, QEMU sends notice that the second job has errored out,
+ but that the first job was also cancelled:
+ ```json
+ { "timestamp": { "seconds": 1447193702, "microseconds": 632377 },
+ "data": { "device": "drive1", "action": "report",
+ "operation": "read" },
+ "event": "BLOCK_JOB_ERROR" }
+ ```
+
+ ```json
+ { "timestamp": { "seconds": 1447193702, "microseconds": 640074 },
+ "data": { "speed": 0, "offset": 0, "len": 67108864,
+ "error": "Input/output error",
+ "device": "drive1", "type": "backup" },
+ "event": "BLOCK_JOB_COMPLETED" }
+ ```
+
+ ```json
+ { "timestamp": { "seconds": 1447193702, "microseconds": 640163 },
+ "data": { "device": "drive0", "type": "backup", "speed": 0,
+ "len": 67108864, "offset": 16777216 },
+ "event": "BLOCK_JOB_CANCELLED" }
+ ```
+
+<!--
+The FreeBSD Documentation License
+
+Redistribution and use in source (Markdown) and 'compiled' forms (SGML, HTML,
+PDF, PostScript, RTF and so forth) with or without modification, are permitted
+provided that the following conditions are met:
+
+Redistributions of source code (Markdown) must retain the above copyright
+notice, this list of conditions and the following disclaimer of this file
+unmodified.
+
+Redistributions in compiled form (transformed to other DTDs, converted to PDF,
+PostScript, RTF and other formats) must reproduce the above copyright notice,
+this list of conditions and the following disclaimer in the documentation and/or
+other materials provided with the distribution.
+
+THIS DOCUMENTATION IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
+AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
+FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
+SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
+CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
+OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF
+THIS DOCUMENTATION, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+-->
diff --git a/docs/devel/blkdebug.txt b/docs/devel/blkdebug.txt
new file mode 100644
index 0000000000..43d8e8f9c6
--- /dev/null
+++ b/docs/devel/blkdebug.txt
@@ -0,0 +1,162 @@
+Block I/O error injection using blkdebug
+----------------------------------------
+Copyright (C) 2014-2015 Red Hat Inc
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+The blkdebug block driver is a rule-based error injection engine. It can be
+used to exercise error code paths in block drivers including ENOSPC (out of
+space) and EIO.
+
+This document gives an overview of the features available in blkdebug.
+
+Background
+----------
+Block drivers have many error code paths that handle I/O errors. Image formats
+are especially complex since metadata I/O errors during cluster allocation or
+while updating tables happen halfway through request processing and require
+discipline to keep image files consistent.
+
+Error injection allows test cases to trigger I/O errors at specific points.
+This way, all error paths can be tested to make sure they are correct.
+
+Rules
+-----
+The blkdebug block driver takes a list of "rules" that tell the error injection
+engine when to fail an I/O request.
+
+Each I/O request is evaluated against the rules. If a rule matches the request
+then its "action" is executed.
+
+Rules can be placed in a configuration file; the configuration file
+follows the same .ini-like format used by QEMU's -readconfig option, and
+each section of the file represents a rule.
+
+The following configuration file defines a single rule:
+
+ $ cat blkdebug.conf
+ [inject-error]
+ event = "read_aio"
+ errno = "28"
+
+This rule fails all aio read requests with ENOSPC (28). Note that the errno
+value depends on the host. On Linux, see
+/usr/include/asm-generic/errno-base.h for errno values.
+
+Invoke QEMU as follows:
+
+ $ qemu-system-x86_64
+ -drive if=none,cache=none,file=blkdebug:blkdebug.conf:test.img,id=drive0 \
+ -device virtio-blk-pci,drive=drive0,id=virtio-blk-pci0
+
+Rules support the following attributes:
+
+ event - which type of operation to match (e.g. read_aio, write_aio,
+ flush_to_os, flush_to_disk). See the "Events" section for
+ information on events.
+
+ state - (optional) the engine must be in this state number in order for this
+ rule to match. See the "State transitions" section for information
+ on states.
+
+ errno - the numeric errno value to return when a request matches this rule.
+ The errno values depend on the host since the numeric values are not
+ standarized in the POSIX specification.
+
+ sector - (optional) a sector number that the request must overlap in order to
+ match this rule
+
+ once - (optional, default "off") only execute this action on the first
+ matching request
+
+ immediately - (optional, default "off") return a NULL BlockAIOCB
+ pointer and fail without an errno instead. This
+ exercises the code path where BlockAIOCB fails and the
+ caller's BlockCompletionFunc is not invoked.
+
+Events
+------
+Block drivers provide information about the type of I/O request they are about
+to make so rules can match specific types of requests. For example, the qcow2
+block driver tells blkdebug when it accesses the L1 table so rules can match
+only L1 table accesses and not other metadata or guest data requests.
+
+The core events are:
+
+ read_aio - guest data read
+
+ write_aio - guest data write
+
+ flush_to_os - write out unwritten block driver state (e.g. cached metadata)
+
+ flush_to_disk - flush the host block device's disk cache
+
+See qapi/block-core.json:BlkdebugEvent for the full list of events.
+You may need to grep block driver source code to understand the
+meaning of specific events.
+
+State transitions
+-----------------
+There are cases where more power is needed to match a particular I/O request in
+a longer sequence of requests. For example:
+
+ write_aio
+ flush_to_disk
+ write_aio
+
+How do we match the 2nd write_aio but not the first? This is where state
+transitions come in.
+
+The error injection engine has an integer called the "state" that always starts
+initialized to 1. The state integer is internal to blkdebug and cannot be
+observed from outside but rules can interact with it for powerful matching
+behavior.
+
+Rules can be conditional on the current state and they can transition to a new
+state.
+
+When a rule's "state" attribute is non-zero then the current state must equal
+the attribute in order for the rule to match.
+
+For example, to match the 2nd write_aio:
+
+ [set-state]
+ event = "write_aio"
+ state = "1"
+ new_state = "2"
+
+ [inject-error]
+ event = "write_aio"
+ state = "2"
+ errno = "5"
+
+The first write_aio request matches the set-state rule and transitions from
+state 1 to state 2. Once state 2 has been entered, the set-state rule no
+longer matches since it requires state 1. But the inject-error rule now
+matches the next write_aio request and injects EIO (5).
+
+State transition rules support the following attributes:
+
+ event - which type of operation to match (e.g. read_aio, write_aio,
+ flush_to_os, flush_to_disk). See the "Events" section for
+ information on events.
+
+ state - (optional) the engine must be in this state number in order for this
+ rule to match
+
+ new_state - transition to this state number
+
+Suspend and resume
+------------------
+Exercising code paths in block drivers may require specific ordering amongst
+concurrent requests. The "breakpoint" feature allows requests to be halted on
+a blkdebug event and resumed later. This makes it possible to achieve
+deterministic ordering when multiple requests are in flight.
+
+Breakpoints on blkdebug events are associated with a user-defined "tag" string.
+This tag serves as an identifier by which the request can be resumed at a later
+point.
+
+See the qemu-io(1) break, resume, remove_break, and wait_break commands for
+details.
diff --git a/docs/devel/blkverify.txt b/docs/devel/blkverify.txt
new file mode 100644
index 0000000000..d556dc4e6d
--- /dev/null
+++ b/docs/devel/blkverify.txt
@@ -0,0 +1,69 @@
+= Block driver correctness testing with blkverify =
+
+== Introduction ==
+
+This document describes how to use the blkverify protocol to test that a block
+driver is operating correctly.
+
+It is difficult to test and debug block drivers against real guests. Often
+processes inside the guest will crash because corrupt sectors were read as part
+of the executable. Other times obscure errors are raised by a program inside
+the guest. These issues are extremely hard to trace back to bugs in the block
+driver.
+
+Blkverify solves this problem by catching data corruption inside QEMU the first
+time bad data is read and reporting the disk sector that is corrupted.
+
+== How it works ==
+
+The blkverify protocol has two child block devices, the "test" device and the
+"raw" device. Read/write operations are mirrored to both devices so their
+state should always be in sync.
+
+The "raw" device is a raw image, a flat file, that has identical starting
+contents to the "test" image. The idea is that the "raw" device will handle
+read/write operations correctly and not corrupt data. It can be used as a
+reference for comparison against the "test" device.
+
+After a mirrored read operation completes, blkverify will compare the data and
+raise an error if it is not identical. This makes it possible to catch the
+first instance where corrupt data is read.
+
+== Example ==
+
+Imagine raw.img has 0xcd repeated throughout its first sector:
+
+ $ ./qemu-io -c 'read -v 0 512' raw.img
+ 00000000: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ 00000010: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ [...]
+ 000001e0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ 000001f0: cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd cd ................
+ read 512/512 bytes at offset 0
+ 512.000000 bytes, 1 ops; 0.0000 sec (97.656 MiB/sec and 200000.0000 ops/sec)
+
+And test.img is corrupt, its first sector is zeroed when it shouldn't be:
+
+ $ ./qemu-io -c 'read -v 0 512' test.img
+ 00000000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ 00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ [...]
+ 000001e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ 000001f0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
+ read 512/512 bytes at offset 0
+ 512.000000 bytes, 1 ops; 0.0000 sec (81.380 MiB/sec and 166666.6667 ops/sec)
+
+This error is caught by blkverify:
+
+ $ ./qemu-io -c 'read 0 512' blkverify:a.img:b.img
+ blkverify: read sector_num=0 nb_sectors=4 contents mismatch in sector 0
+
+A more realistic scenario is verifying the installation of a guest OS:
+
+ $ ./qemu-img create raw.img 16G
+ $ ./qemu-img create -f qcow2 test.qcow2 16G
+ $ x86_64-softmmu/qemu-system-x86_64 -cdrom debian.iso \
+ -drive file=blkverify:raw.img:test.qcow2
+
+If the installation is aborted when blkverify detects corruption, use qemu-io
+to explore the contents of the disk image at the sector in question.
diff --git a/docs/devel/build-system.txt b/docs/devel/build-system.txt
new file mode 100644
index 0000000000..2af1e668c5
--- /dev/null
+++ b/docs/devel/build-system.txt
@@ -0,0 +1,512 @@
+ The QEMU build system architecture
+ ==================================
+
+This document aims to help developers understand the architecture of the
+QEMU build system. As with projects using GNU autotools, the QEMU build
+system has two stages, first the developer runs the "configure" script
+to determine the local build environment characteristics, then they run
+"make" to build the project. There is about where the similarities with
+GNU autotools end, so try to forget what you know about them.
+
+
+Stage 1: configure
+==================
+
+The QEMU configure script is written directly in shell, and should be
+compatible with any POSIX shell, hence it uses #!/bin/sh. An important
+implication of this is that it is important to avoid using bash-isms on
+development platforms where bash is the primary host.
+
+In contrast to autoconf scripts, QEMU's configure is expected to be
+silent while it is checking for features. It will only display output
+when an error occurs, or to show the final feature enablement summary
+on completion.
+
+Adding new checks to the configure script usually comprises the
+following tasks:
+
+ - Initialize one or more variables with the default feature state.
+
+ Ideally features should auto-detect whether they are present,
+ so try to avoid hardcoding the initial state to either enabled
+ or disabled, as that forces the user to pass a --enable-XXX
+ / --disable-XXX flag on every invocation of configure.
+
+ - Add support to the command line arg parser to handle any new
+ --enable-XXX / --disable-XXX flags required by the feature XXX.
+
+ - Add information to the help output message to report on the new
+ feature flag.
+
+ - Add code to perform the actual feature check. As noted above, try to
+ be fully dynamic in checking enablement/disablement.
+
+ - Add code to print out the feature status in the configure summary
+ upon completion.
+
+ - Add any new makefile variables to $config_host_mak on completion.
+
+
+Taking (a simplified version of) the probe for gnutls from configure,
+we have the following pieces:
+
+ # Initial variable state
+ gnutls=""
+
+ ..snip..
+
+ # Configure flag processing
+ --disable-gnutls) gnutls="no"
+ ;;
+ --enable-gnutls) gnutls="yes"
+ ;;
+
+ ..snip..
+
+ # Help output feature message
+ gnutls GNUTLS cryptography support
+
+ ..snip..
+
+ # Test for gnutls
+ if test "$gnutls" != "no"; then
+ if ! $pkg_config --exists "gnutls"; then
+ gnutls_cflags=`$pkg_config --cflags gnutls`
+ gnutls_libs=`$pkg_config --libs gnutls`
+ libs_softmmu="$gnutls_libs $libs_softmmu"
+ libs_tools="$gnutls_libs $libs_tools"
+ QEMU_CFLAGS="$QEMU_CFLAGS $gnutls_cflags"
+ gnutls="yes"
+ elif test "$gnutls" = "yes"; then
+ feature_not_found "gnutls" "Install gnutls devel"
+ else
+ gnutls="no"
+ fi
+ fi
+
+ ..snip..
+
+ # Completion feature summary
+ echo "GNUTLS support $gnutls"
+
+ ..snip..
+
+ # Define make variables
+ if test "$gnutls" = "yes" ; then
+ echo "CONFIG_GNUTLS=y" >> $config_host_mak
+ fi
+
+
+Helper functions
+----------------
+
+The configure script provides a variety of helper functions to assist
+developers in checking for system features:
+
+ - do_cc $ARGS...
+
+ Attempt to run the system C compiler passing it $ARGS...
+
+ - do_cxx $ARGS...
+
+ Attempt to run the system C++ compiler passing it $ARGS...
+
+ - compile_object $CFLAGS
+
+ Attempt to compile a test program with the system C compiler using
+ $CFLAGS. The test program must have been previously written to a file
+ called $TMPC.
+
+ - compile_prog $CFLAGS $LDFLAGS
+
+ Attempt to compile a test program with the system C compiler using
+ $CFLAGS and link it with the system linker using $LDFLAGS. The test
+ program must have been previously written to a file called $TMPC.
+
+ - has $COMMAND
+
+ Determine if $COMMAND exists in the current environment, either as a
+ shell builtin, or executable binary, returning 0 on success.
+
+ - path_of $COMMAND
+
+ Return the fully qualified path of $COMMAND, printing it to stdout,
+ and returning 0 on success.
+
+ - check_define $NAME
+
+ Determine if the macro $NAME is defined by the system C compiler
+
+ - check_include $NAME
+
+ Determine if the include $NAME file is available to the system C
+ compiler
+
+ - write_c_skeleton
+
+ Write a minimal C program main() function to the temporary file
+ indicated by $TMPC
+
+ - feature_not_found $NAME $REMEDY
+
+ Print a message to stderr that the feature $NAME was not available
+ on the system, suggesting the user try $REMEDY to address the
+ problem.
+
+ - error_exit $MESSAGE $MORE...
+
+ Print $MESSAGE to stderr, followed by $MORE... and then exit from the
+ configure script with non-zero status
+
+ - query_pkg_config $ARGS...
+
+ Run pkg-config passing it $ARGS. If QEMU is doing a static build,
+ then --static will be automatically added to $ARGS
+
+
+Stage 2: makefiles
+==================
+
+The use of GNU make is required with the QEMU build system.
+
+Although the source code is spread across multiple subdirectories, the
+build system should be considered largely non-recursive in nature, in
+contrast to common practices seen with automake. There is some recursive
+invocation of make, but this is related to the things being built,
+rather than the source directory structure.
+
+QEMU currently supports both VPATH and non-VPATH builds, so there are
+three general ways to invoke configure & perform a build.
+
+ - VPATH, build artifacts outside of QEMU source tree entirely
+
+ cd ../
+ mkdir build
+ cd build
+ ../qemu/configure
+ make
+
+ - VPATH, build artifacts in a subdir of QEMU source tree
+
+ mkdir build
+ cd build
+ ../configure
+ make
+
+ - non-VPATH, build artifacts everywhere
+
+ ./configure
+ make
+
+The QEMU maintainers generally recommend that a VPATH build is used by
+developers. Patches to QEMU are expected to ensure VPATH build still
+works.
+
+
+Module structure
+----------------
+
+There are a number of key outputs of the QEMU build system:
+
+ - Tools - qemu-img, qemu-nbd, qga (guest agent), etc
+ - System emulators - qemu-system-$ARCH
+ - Userspace emulators - qemu-$ARCH
+ - Unit tests
+
+The source code is highly modularized, split across many files to
+facilitate building of all of these components with as little duplicated
+compilation as possible. There can be considered to be two distinct
+groups of files, those which are independent of the QEMU emulation
+target and those which are dependent on the QEMU emulation target.
+
+In the target-independent set lives various general purpose helper code,
+such as error handling infrastructure, standard data structures,
+platform portability wrapper functions, etc. This code can be compiled
+once only and the .o files linked into all output binaries.
+
+In the target-dependent set lives CPU emulation, device emulation and
+much glue code. This sometimes also has to be compiled multiple times,
+once for each target being built.
+
+The utility code that is used by all binaries is built into a
+static archive called libqemuutil.a, which is then linked to all the
+binaries. In order to provide hooks that are only needed by some of the
+binaries, code in libqemuutil.a may depend on other functions that are
+not fully implemented by all QEMU binaries. To deal with this there is a
+second library called libqemustub.a which provides dummy stubs for all
+these functions. These will get lazy linked into the binary if the real
+implementation is not present. In this way, the libqemustub.a static
+library can be thought of as a portable implementation of the weak
+symbols concept. All binaries should link to both libqemuutil.a and
+libqemustub.a. e.g.
+
+ qemu-img$(EXESUF): qemu-img.o ..snip.. libqemuutil.a libqemustub.a
+
+
+Windows platform portability
+----------------------------
+
+On Windows, all binaries have the suffix '.exe', so all Makefile rules
+which create binaries must include the $(EXESUF) variable on the binary
+name. e.g.
+
+ qemu-img$(EXESUF): qemu-img.o ..snip..
+
+This expands to '.exe' on Windows, or '' on other platforms.
+
+A further complication for the system emulator binaries is that
+two separate binaries need to be generated.
+
+The main binary (e.g. qemu-system-x86_64.exe) is linked against the
+Windows console runtime subsystem. These are expected to be run from a
+command prompt window, and so will print stderr to the console that
+launched them.
+
+The second binary generated has a 'w' on the end of its name (e.g.
+qemu-system-x86_64w.exe) and is linked against the Windows graphical
+runtime subsystem. These are expected to be run directly from the
+desktop and will open up a dedicated console window for stderr output.
+
+The Makefile.target will generate the binary for the graphical subsystem
+first, and then use objcopy to relink it against the console subsystem
+to generate the second binary.
+
+
+Object variable naming
+----------------------
+
+The QEMU convention is to define variables to list different groups of
+object files. These are named with the convention $PREFIX-obj-y. For
+example the libqemuutil.a file will be linked with all objects listed
+in a variable 'util-obj-y'. So, for example, util/Makefile.obj will
+contain a set of definitions looking like
+
+ util-obj-y += bitmap.o bitops.o hbitmap.o
+ util-obj-y += fifo8.o
+ util-obj-y += acl.o
+ util-obj-y += error.o qemu-error.o
+
+When there is an object file which needs to be conditionally built based
+on some characteristic of the host system, the configure script will
+define a variable for the conditional. For example, on Windows it will
+define $(CONFIG_POSIX) with a value of 'n' and $(CONFIG_WIN32) with a
+value of 'y'. It is now possible to use the config variables when
+listing object files. For example,
+
+ util-obj-$(CONFIG_WIN32) += oslib-win32.o qemu-thread-win32.o
+ util-obj-$(CONFIG_POSIX) += oslib-posix.o qemu-thread-posix.o
+
+On Windows this expands to
+
+ util-obj-y += oslib-win32.o qemu-thread-win32.o
+ util-obj-n += oslib-posix.o qemu-thread-posix.o
+
+Since libqemutil.a links in $(util-obj-y), the POSIX specific files
+listed against $(util-obj-n) are ignored on the Windows platform builds.
+
+
+CFLAGS / LDFLAGS / LIBS handling
+--------------------------------
+
+There are many different binaries being built with differing purposes,
+and some of them might even be 3rd party libraries pulled in via git
+submodules. As such the use of the global CFLAGS variable is generally
+avoided in QEMU, since it would apply to too many build targets.
+
+Flags that are needed by any QEMU code (i.e. everything *except* GIT
+submodule projects) are put in $(QEMU_CFLAGS) variable. For linker
+flags the $(LIBS) variable is sometimes used, but a couple of more
+targeted variables are preferred. $(libs_softmmu) is used for
+libraries that must be linked to system emulator targets, $(LIBS_TOOLS)
+is used for tools like qemu-img, qemu-nbd, etc and $(LIBS_QGA) is used
+for the QEMU guest agent. There is currently no specific variable for
+the userspace emulator targets as the global $(LIBS), or more targeted
+variables shown below, are sufficient.
+
+In addition to these variables, it is possible to provide cflags and
+libs against individual source code files, by defining variables of the
+form $FILENAME-cflags and $FILENAME-libs. For example, the curl block
+driver needs to link to the libcurl library, so block/Makefile defines
+some variables:
+
+ curl.o-cflags := $(CURL_CFLAGS)
+ curl.o-libs := $(CURL_LIBS)
+
+The scope is a little different between the two variables. The libs get
+used when linking any target binary that includes the curl.o object
+file, while the cflags get used when compiling the curl.c file only.
+
+
+Statically defined files
+------------------------
+
+The following key files are statically defined in the source tree, with
+the rules needed to build QEMU. Their behaviour is influenced by a
+number of dynamically created files listed later.
+
+- Makefile
+
+The main entry point used when invoking make to build all the components
+of QEMU. The default 'all' target will naturally result in the build of
+every component. The various tools and helper binaries are built
+directly via a non-recursive set of rules.
+
+Each system/userspace emulation target needs to have a slightly
+different set of make rules / variables. Thus, make will be recursively
+invoked for each of the emulation targets.
+
+The recursive invocation will end up processing the toplevel
+Makefile.target file (more on that later).
+
+
+- */Makefile.objs
+
+Since the source code is spread across multiple directories, the rules
+for each file are similarly modularized. Thus each subdirectory
+containing .c files will usually also contain a Makefile.objs file.
+These files are not directly invoked by a recursive make, but instead
+they are imported by the top level Makefile and/or Makefile.target
+
+Each Makefile.objs usually just declares a set of variables listing the
+.o files that need building from the source files in the directory. They
+will also define any custom linker or compiler flags. For example in
+block/Makefile.objs
+
+ block-obj-$(CONFIG_LIBISCSI) += iscsi.o
+ block-obj-$(CONFIG_CURL) += curl.o
+
+ ..snip...
+
+ iscsi.o-cflags := $(LIBISCSI_CFLAGS)
+ iscsi.o-libs := $(LIBISCSI_LIBS)
+ curl.o-cflags := $(CURL_CFLAGS)
+ curl.o-libs := $(CURL_LIBS)
+
+If there are any rules defined in the Makefile.objs file, they should
+all use $(obj) as a prefix to the target, e.g.
+
+ $(obj)/generated-tcg-tracers.h: $(obj)/generated-tcg-tracers.h-timestamp
+
+
+- Makefile.target
+
+This file provides the entry point used to build each individual system
+or userspace emulator target. Each enabled target has its own
+subdirectory. For example if configure is run with the argument
+'--target-list=x86_64-softmmu', then a sub-directory 'x86_64-softmu'
+will be created, containing a 'Makefile' which symlinks back to
+Makefile.target
+
+So when the recursive '$(MAKE) -C x86_64-softmmu' is invoked, it ends up
+using Makefile.target for the build rules.
+
+
+- rules.mak
+
+This file provides the generic helper rules for invoking build tools, in
+particular the compiler and linker. This also contains the magic (hairy)
+'unnest-vars' function which is used to merge the variable definitions
+from all Makefile.objs in the source tree down into the main Makefile
+context.
+
+
+- default-configs/*.mak
+
+The files under default-configs/ control what emulated hardware is built
+into each QEMU system and userspace emulator targets. They merely
+contain a long list of config variable definitions. For example,
+default-configs/x86_64-softmmu.mak has:
+
+ include pci.mak
+ include sound.mak
+ include usb.mak
+ CONFIG_QXL=$(CONFIG_SPICE)
+ CONFIG_VGA_ISA=y
+ CONFIG_VGA_CIRRUS=y
+ CONFIG_VMWARE_VGA=y
+ CONFIG_VIRTIO_VGA=y
+ ...snip...
+
+These files rarely need changing unless new devices / hardware need to
+be enabled for a particular system/userspace emulation target
+
+
+- tests/Makefile
+
+Rules for building the unit tests. This file is included directly by the
+top level Makefile, so anything defined in this file will influence the
+entire build system. Care needs to be taken when writing rules for tests
+to ensure they only apply to the unit test execution / build.
+
+- tests/docker/Makefile.include
+
+Rules for Docker tests. Like tests/Makefile, this file is included
+directly by the top level Makefile, anything defined in this file will
+influence the entire build system.
+
+- po/Makefile
+
+Rules for building and installing the binary message catalogs from the
+text .po file sources. This almost never needs changing for any reason.
+
+
+Dynamically created files
+-------------------------
+
+The following files are generated dynamically by configure in order to
+control the behaviour of the statically defined makefiles. This avoids
+the need for QEMU makefiles to go through any pre-processing as seen
+with autotools, where Makefile.am generates Makefile.in which generates
+Makefile.
+
+
+- config-host.mak
+
+When configure has determined the characteristics of the build host it
+will write a long list of variables to config-host.mak file. This
+provides the various install directories, compiler / linker flags and a
+variety of CONFIG_* variables related to optionally enabled features.
+This is imported by the top level Makefile in order to tailor the build
+output.
+
+The variables defined here are those which are applicable to all QEMU
+build outputs. Variables which are potentially different for each
+emulator target are defined by the next file...
+
+It is also used as a dependency checking mechanism. If make sees that
+the modification timestamp on configure is newer than that on
+config-host.mak, then configure will be re-run.
+
+
+- config-host.h
+
+The config-host.h file is used by source code to determine what features
+are enabled. It is generated from the contents of config-host.mak using
+the scripts/create_config program. This extracts all the CONFIG_* variables,
+most of the HOST_* variables and a few other misc variables from
+config-host.mak, formatting them as C preprocessor macros.
+
+
+- $TARGET-NAME/config-target.mak
+
+TARGET-NAME is the name of a system or userspace emulator, for example,
+x86_64-softmmu denotes the system emulator for the x86_64 architecture.
+This file contains the variables which need to vary on a per-target
+basis. For example, it will indicate whether KVM or Xen are enabled for
+the target and any other potential custom libraries needed for linking
+the target.
+
+
+- $TARGET-NAME/config-devices.mak
+
+TARGET-NAME is again the name of a system or userspace emulator. The
+config-devices.mak file is automatically generated by make using the
+scripts/make_device_config.sh program, feeding it the
+default-configs/$TARGET-NAME file as input.
+
+
+- $TARGET-NAME/Makefile
+
+This is the entrypoint used when make recurses to build a single system
+or userspace emulator target. It is merely a symlink back to the
+Makefile.target in the top level.
diff --git a/docs/devel/lockcnt.txt b/docs/devel/lockcnt.txt
new file mode 100644
index 0000000000..2a79b3205b
--- /dev/null
+++ b/docs/devel/lockcnt.txt
@@ -0,0 +1,277 @@
+DOCUMENTATION FOR LOCKED COUNTERS (aka QemuLockCnt)
+===================================================
+
+QEMU often uses reference counts to track data structures that are being
+accessed and should not be freed. For example, a loop that invoke
+callbacks like this is not safe:
+
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+
+QLIST_FOREACH_SAFE protects against deletion of the current node (ioh)
+by stashing away its "next" pointer. However, ioh->fd_write could
+actually delete the next node from the list. The simplest way to
+avoid this is to mark the node as deleted, and remove it from the
+list in the above loop:
+
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ } else {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+ }
+
+If however this loop must also be reentrant, i.e. it is possible that
+ioh->fd_write invokes the loop again, some kind of counting is needed:
+
+ walking_handlers++;
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ if (walking_handlers == 1) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ }
+ } else {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+ }
+ walking_handlers--;
+
+One may think of using the RCU primitives, rcu_read_lock() and
+rcu_read_unlock(); effectively, the RCU nesting count would take
+the place of the walking_handlers global variable. Indeed,
+reference counting and RCU have similar purposes, but their usage in
+general is complementary:
+
+- reference counting is fine-grained and limited to a single data
+ structure; RCU delays reclamation of *all* RCU-protected data
+ structures;
+
+- reference counting works even in the presence of code that keeps
+ a reference for a long time; RCU critical sections in principle
+ should be kept short;
+
+- reference counting is often applied to code that is not thread-safe
+ but is reentrant; in fact, usage of reference counting in QEMU predates
+ the introduction of threads by many years. RCU is generally used to
+ protect readers from other threads freeing memory after concurrent
+ modifications to a data structure.
+
+- reclaiming data can be done by a separate thread in the case of RCU;
+ this can improve performance, but also delay reclamation undesirably.
+ With reference counting, reclamation is deterministic.
+
+This file documents QemuLockCnt, an abstraction for using reference
+counting in code that has to be both thread-safe and reentrant.
+
+
+QemuLockCnt concepts
+--------------------
+
+A QemuLockCnt comprises both a counter and a mutex; it has primitives
+to increment and decrement the counter, and to take and release the
+mutex. The counter notes how many visits to the data structures are
+taking place (the visits could be from different threads, or there could
+be multiple reentrant visits from the same thread). The basic rules
+governing the counter/mutex pair then are the following:
+
+- Data protected by the QemuLockCnt must not be freed unless the
+ counter is zero and the mutex is taken.
+
+- A new visit cannot be started while the counter is zero and the
+ mutex is taken.
+
+Most of the time, the mutex protects all writes to the data structure,
+not just frees, though there could be cases where this is not necessary.
+
+Reads, instead, can be done without taking the mutex, as long as the
+readers and writers use the same macros that are used for RCU, for
+example atomic_rcu_read, atomic_rcu_set, QLIST_FOREACH_RCU, etc. This is
+because the reads are done outside a lock and a set or QLIST_INSERT_HEAD
+can happen concurrently with the read. The RCU API ensures that the
+processor and the compiler see all required memory barriers.
+
+This could be implemented simply by protecting the counter with the
+mutex, for example:
+
+ // (1)
+ qemu_mutex_lock(&walking_handlers_mutex);
+ walking_handlers++;
+ qemu_mutex_unlock(&walking_handlers_mutex);
+
+ ...
+
+ // (2)
+ qemu_mutex_lock(&walking_handlers_mutex);
+ if (--walking_handlers == 0) {
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ }
+ }
+ }
+ qemu_mutex_unlock(&walking_handlers_mutex);
+
+Here, no frees can happen in the code represented by the ellipsis.
+If another thread is executing critical section (2), that part of
+the code cannot be entered, because the thread will not be able
+to increment the walking_handlers variable. And of course
+during the visit any other thread will see a nonzero value for
+walking_handlers, as in the single-threaded code.
+
+Note that it is possible for multiple concurrent accesses to delay
+the cleanup arbitrarily; in other words, for the walking_handlers
+counter to never become zero. For this reason, this technique is
+more easily applicable if concurrent access to the structure is rare.
+
+However, critical sections are easy to forget since you have to do
+them for each modification of the counter. QemuLockCnt ensures that
+all modifications of the counter take the lock appropriately, and it
+can also be more efficient in two ways:
+
+- it avoids taking the lock for many operations (for example
+ incrementing the counter while it is non-zero);
+
+- on some platforms, one can implement QemuLockCnt to hold the lock
+ and the mutex in a single word, making the fast path no more expensive
+ than simply managing a counter using atomic operations (see
+ docs/atomics.txt). This can be very helpful if concurrent access to
+ the data structure is expected to be rare.
+
+
+Using the same mutex for frees and writes can still incur some small
+inefficiencies; for example, a visit can never start if the counter is
+zero and the mutex is taken---even if the mutex is taken by a write,
+which in principle need not block a visit of the data structure.
+However, these are usually not a problem if any of the following
+assumptions are valid:
+
+- concurrent access is possible but rare
+
+- writes are rare
+
+- writes are frequent, but this kind of write (e.g. appending to a
+ list) has a very small critical section.
+
+For example, QEMU uses QemuLockCnt to manage an AioContext's list of
+bottom halves and file descriptor handlers. Modifications to the list
+of file descriptor handlers are rare. Creation of a new bottom half is
+frequent and can happen on a fast path; however: 1) it is almost never
+concurrent with a visit to the list of bottom halves; 2) it only has
+three instructions in the critical path, two assignments and a smp_wmb().
+
+
+QemuLockCnt API
+---------------
+
+The QemuLockCnt API is described in include/qemu/thread.h.
+
+
+QemuLockCnt usage
+-----------------
+
+This section explains the typical usage patterns for QemuLockCnt functions.
+
+Setting a variable to a non-NULL value can be done between
+qemu_lockcnt_lock and qemu_lockcnt_unlock:
+
+ qemu_lockcnt_lock(&xyz_lockcnt);
+ if (!xyz) {
+ new_xyz = g_new(XYZ, 1);
+ ...
+ atomic_rcu_set(&xyz, new_xyz);
+ }
+ qemu_lockcnt_unlock(&xyz_lockcnt);
+
+Accessing the value can be done between qemu_lockcnt_inc and
+qemu_lockcnt_dec:
+
+ qemu_lockcnt_inc(&xyz_lockcnt);
+ if (xyz) {
+ XYZ *p = atomic_rcu_read(&xyz);
+ ...
+ /* Accesses can now be done through "p". */
+ }
+ qemu_lockcnt_dec(&xyz_lockcnt);
+
+Freeing the object can similarly use qemu_lockcnt_lock and
+qemu_lockcnt_unlock, but you also need to ensure that the count
+is zero (i.e. there is no concurrent visit). Because qemu_lockcnt_inc
+takes the QemuLockCnt's lock, the count cannot become non-zero while
+the object is being freed. Freeing an object looks like this:
+
+ qemu_lockcnt_lock(&xyz_lockcnt);
+ if (!qemu_lockcnt_count(&xyz_lockcnt)) {
+ g_free(xyz);
+ xyz = NULL;
+ }
+ qemu_lockcnt_unlock(&xyz_lockcnt);
+
+If an object has to be freed right after a visit, you can combine
+the decrement, the locking and the check on count as follows:
+
+ qemu_lockcnt_inc(&xyz_lockcnt);
+ if (xyz) {
+ XYZ *p = atomic_rcu_read(&xyz);
+ ...
+ /* Accesses can now be done through "p". */
+ }
+ if (qemu_lockcnt_dec_and_lock(&xyz_lockcnt)) {
+ g_free(xyz);
+ xyz = NULL;
+ qemu_lockcnt_unlock(&xyz_lockcnt);
+ }
+
+QemuLockCnt can also be used to access a list as follows:
+
+ qemu_lockcnt_inc(&io_handlers_lockcnt);
+ QLIST_FOREACH_RCU(ioh, &io_handlers, pioh) {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+
+ if (qemu_lockcnt_dec_and_lock(&io_handlers_lockcnt)) {
+ QLIST_FOREACH_SAFE(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ }
+ }
+ qemu_lockcnt_unlock(&io_handlers_lockcnt);
+ }
+
+Again, the RCU primitives are used because new items can be added to the
+list during the walk. QLIST_FOREACH_RCU ensures that the processor and
+the compiler see the appropriate memory barriers.
+
+An alternative pattern uses qemu_lockcnt_dec_if_lock:
+
+ qemu_lockcnt_inc(&io_handlers_lockcnt);
+ QLIST_FOREACH_SAFE_RCU(ioh, &io_handlers, next, pioh) {
+ if (ioh->deleted) {
+ if (qemu_lockcnt_dec_if_lock(&io_handlers_lockcnt)) {
+ QLIST_REMOVE(ioh, next);
+ g_free(ioh);
+ qemu_lockcnt_inc_and_unlock(&io_handlers_lockcnt);
+ }
+ } else {
+ if (ioh->revents & G_IO_OUT) {
+ ioh->fd_write(ioh->opaque);
+ }
+ }
+ }
+ qemu_lockcnt_dec(&io_handlers_lockcnt);
+
+Here you can use qemu_lockcnt_dec instead of qemu_lockcnt_dec_and_lock,
+because there is no special task to do if the count goes from 1 to 0.
diff --git a/docs/devel/memory.txt b/docs/devel/memory.txt
new file mode 100644
index 0000000000..811b1bd3c5
--- /dev/null
+++ b/docs/devel/memory.txt
@@ -0,0 +1,316 @@
+The memory API
+==============
+
+The memory API models the memory and I/O buses and controllers of a QEMU
+machine. It attempts to allow modelling of:
+
+ - ordinary RAM
+ - memory-mapped I/O (MMIO)
+ - memory controllers that can dynamically reroute physical memory regions
+ to different destinations
+
+The memory model provides support for
+
+ - tracking RAM changes by the guest
+ - setting up coalesced memory for kvm
+ - setting up ioeventfd regions for kvm
+
+Memory is modelled as an acyclic graph of MemoryRegion objects. Sinks
+(leaves) are RAM and MMIO regions, while other nodes represent
+buses, memory controllers, and memory regions that have been rerouted.
+
+In addition to MemoryRegion objects, the memory API provides AddressSpace
+objects for every root and possibly for intermediate MemoryRegions too.
+These represent memory as seen from the CPU or a device's viewpoint.
+
+Types of regions
+----------------
+
+There are multiple types of memory regions (all represented by a single C type
+MemoryRegion):
+
+- RAM: a RAM region is simply a range of host memory that can be made available
+ to the guest.
+ You typically initialize these with memory_region_init_ram(). Some special
+ purposes require the variants memory_region_init_resizeable_ram(),
+ memory_region_init_ram_from_file(), or memory_region_init_ram_ptr().
+
+- MMIO: a range of guest memory that is implemented by host callbacks;
+ each read or write causes a callback to be called on the host.
+ You initialize these with memory_region_init_io(), passing it a
+ MemoryRegionOps structure describing the callbacks.
+
+- ROM: a ROM memory region works like RAM for reads (directly accessing
+ a region of host memory), and forbids writes. You initialize these with
+ memory_region_init_rom().
+
+- ROM device: a ROM device memory region works like RAM for reads
+ (directly accessing a region of host memory), but like MMIO for
+ writes (invoking a callback). You initialize these with
+ memory_region_init_rom_device().
+
+- IOMMU region: an IOMMU region translates addresses of accesses made to it
+ and forwards them to some other target memory region. As the name suggests,
+ these are only needed for modelling an IOMMU, not for simple devices.
+ You initialize these with memory_region_init_iommu().
+
+- container: a container simply includes other memory regions, each at
+ a different offset. Containers are useful for grouping several regions
+ into one unit. For example, a PCI BAR may be composed of a RAM region
+ and an MMIO region.
+
+ A container's subregions are usually non-overlapping. In some cases it is
+ useful to have overlapping regions; for example a memory controller that
+ can overlay a subregion of RAM with MMIO or ROM, or a PCI controller
+ that does not prevent card from claiming overlapping BARs.
+
+ You initialize a pure container with memory_region_init().
+
+- alias: a subsection of another region. Aliases allow a region to be
+ split apart into discontiguous regions. Examples of uses are memory banks
+ used when the guest address space is smaller than the amount of RAM
+ addressed, or a memory controller that splits main memory to expose a "PCI
+ hole". Aliases may point to any type of region, including other aliases,
+ but an alias may not point back to itself, directly or indirectly.
+ You initialize these with memory_region_init_alias().
+
+- reservation region: a reservation region is primarily for debugging.
+ It claims I/O space that is not supposed to be handled by QEMU itself.
+ The typical use is to track parts of the address space which will be
+ handled by the host kernel when KVM is enabled.
+ You initialize these with memory_region_init_reservation(), or by
+ passing a NULL callback parameter to memory_region_init_io().
+
+It is valid to add subregions to a region which is not a pure container
+(that is, to an MMIO, RAM or ROM region). This means that the region
+will act like a container, except that any addresses within the container's
+region which are not claimed by any subregion are handled by the
+container itself (ie by its MMIO callbacks or RAM backing). However
+it is generally possible to achieve the same effect with a pure container
+one of whose subregions is a low priority "background" region covering
+the whole address range; this is often clearer and is preferred.
+Subregions cannot be added to an alias region.
+
+Region names
+------------
+
+Regions are assigned names by the constructor. For most regions these are
+only used for debugging purposes, but RAM regions also use the name to identify
+live migration sections. This means that RAM region names need to have ABI
+stability.
+
+Region lifecycle
+----------------
+
+A region is created by one of the memory_region_init*() functions and
+attached to an object, which acts as its owner or parent. QEMU ensures
+that the owner object remains alive as long as the region is visible to
+the guest, or as long as the region is in use by a virtual CPU or another
+device. For example, the owner object will not die between an
+address_space_map operation and the corresponding address_space_unmap.
+
+After creation, a region can be added to an address space or a
+container with memory_region_add_subregion(), and removed using
+memory_region_del_subregion().
+
+Various region attributes (read-only, dirty logging, coalesced mmio,
+ioeventfd) can be changed during the region lifecycle. They take effect
+as soon as the region is made visible. This can be immediately, later,
+or never.
+
+Destruction of a memory region happens automatically when the owner
+object dies.
+
+If however the memory region is part of a dynamically allocated data
+structure, you should call object_unparent() to destroy the memory region
+before the data structure is freed. For an example see VFIOMSIXInfo
+and VFIOQuirk in hw/vfio/pci.c.
+
+You must not destroy a memory region as long as it may be in use by a
+device or CPU. In order to do this, as a general rule do not create or
+destroy memory regions dynamically during a device's lifetime, and only
+call object_unparent() in the memory region owner's instance_finalize
+callback. The dynamically allocated data structure that contains the
+memory region then should obviously be freed in the instance_finalize
+callback as well.
+
+If you break this rule, the following situation can happen:
+
+- the memory region's owner had a reference taken via memory_region_ref
+ (for example by address_space_map)
+
+- the region is unparented, and has no owner anymore
+
+- when address_space_unmap is called, the reference to the memory region's
+ owner is leaked.
+
+
+There is an exception to the above rule: it is okay to call
+object_unparent at any time for an alias or a container region. It is
+therefore also okay to create or destroy alias and container regions
+dynamically during a device's lifetime.
+
+This exceptional usage is valid because aliases and containers only help
+QEMU building the guest's memory map; they are never accessed directly.
+memory_region_ref and memory_region_unref are never called on aliases
+or containers, and the above situation then cannot happen. Exploiting
+this exception is rarely necessary, and therefore it is discouraged,
+but nevertheless it is used in a few places.
+
+For regions that "have no owner" (NULL is passed at creation time), the
+machine object is actually used as the owner. Since instance_finalize is
+never called for the machine object, you must never call object_unparent
+on regions that have no owner, unless they are aliases or containers.
+
+
+Overlapping regions and priority
+--------------------------------
+Usually, regions may not overlap each other; a memory address decodes into
+exactly one target. In some cases it is useful to allow regions to overlap,
+and sometimes to control which of an overlapping regions is visible to the
+guest. This is done with memory_region_add_subregion_overlap(), which
+allows the region to overlap any other region in the same container, and
+specifies a priority that allows the core to decide which of two regions at
+the same address are visible (highest wins).
+Priority values are signed, and the default value is zero. This means that
+you can use memory_region_add_subregion_overlap() both to specify a region
+that must sit 'above' any others (with a positive priority) and also a
+background region that sits 'below' others (with a negative priority).
+
+If the higher priority region in an overlap is a container or alias, then
+the lower priority region will appear in any "holes" that the higher priority
+region has left by not mapping subregions to that area of its address range.
+(This applies recursively -- if the subregions are themselves containers or
+aliases that leave holes then the lower priority region will appear in these
+holes too.)
+
+For example, suppose we have a container A of size 0x8000 with two subregions
+B and C. B is a container mapped at 0x2000, size 0x4000, priority 2; C is
+an MMIO region mapped at 0x0, size 0x6000, priority 1. B currently has two
+of its own subregions: D of size 0x1000 at offset 0 and E of size 0x1000 at
+offset 0x2000. As a diagram:
+
+ 0 1000 2000 3000 4000 5000 6000 7000 8000
+ |------|------|------|------|------|------|------|------|
+ A: [ ]
+ C: [CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC]
+ B: [ ]
+ D: [DDDDD]
+ E: [EEEEE]
+
+The regions that will be seen within this address range then are:
+ [CCCCCCCCCCCC][DDDDD][CCCCC][EEEEE][CCCCC]
+
+Since B has higher priority than C, its subregions appear in the flat map
+even where they overlap with C. In ranges where B has not mapped anything
+C's region appears.
+
+If B had provided its own MMIO operations (ie it was not a pure container)
+then these would be used for any addresses in its range not handled by
+D or E, and the result would be:
+ [CCCCCCCCCCCC][DDDDD][BBBBB][EEEEE][BBBBB]
+
+Priority values are local to a container, because the priorities of two
+regions are only compared when they are both children of the same container.
+This means that the device in charge of the container (typically modelling
+a bus or a memory controller) can use them to manage the interaction of
+its child regions without any side effects on other parts of the system.
+In the example above, the priorities of D and E are unimportant because
+they do not overlap each other. It is the relative priority of B and C
+that causes D and E to appear on top of C: D and E's priorities are never
+compared against the priority of C.
+
+Visibility
+----------
+The memory core uses the following rules to select a memory region when the
+guest accesses an address:
+
+- all direct subregions of the root region are matched against the address, in
+ descending priority order
+ - if the address lies outside the region offset/size, the subregion is
+ discarded
+ - if the subregion is a leaf (RAM or MMIO), the search terminates, returning
+ this leaf region
+ - if the subregion is a container, the same algorithm is used within the
+ subregion (after the address is adjusted by the subregion offset)
+ - if the subregion is an alias, the search is continued at the alias target
+ (after the address is adjusted by the subregion offset and alias offset)
+ - if a recursive search within a container or alias subregion does not
+ find a match (because of a "hole" in the container's coverage of its
+ address range), then if this is a container with its own MMIO or RAM
+ backing the search terminates, returning the container itself. Otherwise
+ we continue with the next subregion in priority order
+- if none of the subregions match the address then the search terminates
+ with no match found
+
+Example memory map
+------------------
+
+system_memory: container@0-2^48-1
+ |
+ +---- lomem: alias@0-0xdfffffff ---> #ram (0-0xdfffffff)
+ |
+ +---- himem: alias@0x100000000-0x11fffffff ---> #ram (0xe0000000-0xffffffff)
+ |
+ +---- vga-window: alias@0xa0000-0xbffff ---> #pci (0xa0000-0xbffff)
+ | (prio 1)
+ |
+ +---- pci-hole: alias@0xe0000000-0xffffffff ---> #pci (0xe0000000-0xffffffff)
+
+pci (0-2^32-1)
+ |
+ +--- vga-area: container@0xa0000-0xbffff
+ | |
+ | +--- alias@0x00000-0x7fff ---> #vram (0x010000-0x017fff)
+ | |
+ | +--- alias@0x08000-0xffff ---> #vram (0x020000-0x027fff)
+ |
+ +---- vram: ram@0xe1000000-0xe1ffffff
+ |
+ +---- vga-mmio: mmio@0xe2000000-0xe200ffff
+
+ram: ram@0x00000000-0xffffffff
+
+This is a (simplified) PC memory map. The 4GB RAM block is mapped into the
+system address space via two aliases: "lomem" is a 1:1 mapping of the first
+3.5GB; "himem" maps the last 0.5GB at address 4GB. This leaves 0.5GB for the
+so-called PCI hole, that allows a 32-bit PCI bus to exist in a system with
+4GB of memory.
+
+The memory controller diverts addresses in the range 640K-768K to the PCI
+address space. This is modelled using the "vga-window" alias, mapped at a
+higher priority so it obscures the RAM at the same addresses. The vga window
+can be removed by programming the memory controller; this is modelled by
+removing the alias and exposing the RAM underneath.
+
+The pci address space is not a direct child of the system address space, since
+we only want parts of it to be visible (we accomplish this using aliases).
+It has two subregions: vga-area models the legacy vga window and is occupied
+by two 32K memory banks pointing at two sections of the framebuffer.
+In addition the vram is mapped as a BAR at address e1000000, and an additional
+BAR containing MMIO registers is mapped after it.
+
+Note that if the guest maps a BAR outside the PCI hole, it would not be
+visible as the pci-hole alias clips it to a 0.5GB range.
+
+MMIO Operations
+---------------
+
+MMIO regions are provided with ->read() and ->write() callbacks; in addition
+various constraints can be supplied to control how these callbacks are called:
+
+ - .valid.min_access_size, .valid.max_access_size define the access sizes
+ (in bytes) which the device accepts; accesses outside this range will
+ have device and bus specific behaviour (ignored, or machine check)
+ - .valid.unaligned specifies that the *device being modelled* supports
+ unaligned accesses; if false, unaligned accesses will invoke the
+ appropriate bus or CPU specific behaviour.
+ - .impl.min_access_size, .impl.max_access_size define the access sizes
+ (in bytes) supported by the *implementation*; other access sizes will be
+ emulated using the ones available. For example a 4-byte write will be
+ emulated using four 1-byte writes, if .impl.max_access_size = 1.
+ - .impl.unaligned specifies that the *implementation* supports unaligned
+ accesses; if false, unaligned accesses will be emulated by two aligned
+ accesses.
+ - .old_mmio eases the porting of code that was formerly using
+ cpu_register_io_memory(). It should not be used in new code.
diff --git a/docs/devel/migration.txt b/docs/devel/migration.txt
new file mode 100644
index 0000000000..1b940a829b
--- /dev/null
+++ b/docs/devel/migration.txt
@@ -0,0 +1,555 @@
+= Migration =
+
+QEMU has code to load/save the state of the guest that it is running.
+These are two complementary operations. Saving the state just does
+that, saves the state for each device that the guest is running.
+Restoring a guest is just the opposite operation: we need to load the
+state of each device.
+
+For this to work, QEMU has to be launched with the same arguments the
+two times. I.e. it can only restore the state in one guest that has
+the same devices that the one it was saved (this last requirement can
+be relaxed a bit, but for now we can consider that configuration has
+to be exactly the same).
+
+Once that we are able to save/restore a guest, a new functionality is
+requested: migration. This means that QEMU is able to start in one
+machine and being "migrated" to another machine. I.e. being moved to
+another machine.
+
+Next was the "live migration" functionality. This is important
+because some guests run with a lot of state (specially RAM), and it
+can take a while to move all state from one machine to another. Live
+migration allows the guest to continue running while the state is
+transferred. Only while the last part of the state is transferred has
+the guest to be stopped. Typically the time that the guest is
+unresponsive during live migration is the low hundred of milliseconds
+(notice that this depends on a lot of things).
+
+=== Types of migration ===
+
+Now that we have talked about live migration, there are several ways
+to do migration:
+
+- tcp migration: do the migration using tcp sockets
+- unix migration: do the migration using unix sockets
+- exec migration: do the migration using the stdin/stdout through a process.
+- fd migration: do the migration using an file descriptor that is
+ passed to QEMU. QEMU doesn't care how this file descriptor is opened.
+
+All these four migration protocols use the same infrastructure to
+save/restore state devices. This infrastructure is shared with the
+savevm/loadvm functionality.
+
+=== State Live Migration ===
+
+This is used for RAM and block devices. It is not yet ported to vmstate.
+<Fill more information here>
+
+=== What is the common infrastructure ===
+
+QEMU uses a QEMUFile abstraction to be able to do migration. Any type
+of migration that wants to use QEMU infrastructure has to create a
+QEMUFile with:
+
+QEMUFile *qemu_fopen_ops(void *opaque,
+ QEMUFilePutBufferFunc *put_buffer,
+ QEMUFileGetBufferFunc *get_buffer,
+ QEMUFileCloseFunc *close);
+
+The functions have the following functionality:
+
+This function writes a chunk of data to a file at the given position.
+The pos argument can be ignored if the file is only used for
+streaming. The handler should try to write all of the data it can.
+
+typedef int (QEMUFilePutBufferFunc)(void *opaque, const uint8_t *buf,
+ int64_t pos, int size);
+
+Read a chunk of data from a file at the given position. The pos argument
+can be ignored if the file is only be used for streaming. The number of
+bytes actually read should be returned.
+
+typedef int (QEMUFileGetBufferFunc)(void *opaque, uint8_t *buf,
+ int64_t pos, int size);
+
+Close a file and return an error code.
+
+typedef int (QEMUFileCloseFunc)(void *opaque);
+
+You can use any internal state that you need using the opaque void *
+pointer that is passed to all functions.
+
+The important functions for us are put_buffer()/get_buffer() that
+allow to write/read a buffer into the QEMUFile.
+
+=== How to save the state of one device ===
+
+The state of a device is saved using intermediate buffers. There are
+some helper functions to assist this saving.
+
+There is a new concept that we have to explain here: device state
+version. When we migrate a device, we save/load the state as a series
+of fields. Some times, due to bugs or new functionality, we need to
+change the state to store more/different information. We use the
+version to identify each time that we do a change. Each version is
+associated with a series of fields saved. The save_state always saves
+the state as the newer version. But load_state sometimes is able to
+load state from an older version.
+
+=== Legacy way ===
+
+This way is going to disappear as soon as all current users are ported to VMSTATE.
+
+Each device has to register two functions, one to save the state and
+another to load the state back.
+
+int register_savevm(DeviceState *dev,
+ const char *idstr,
+ int instance_id,
+ int version_id,
+ SaveStateHandler *save_state,
+ LoadStateHandler *load_state,
+ void *opaque);
+
+typedef void SaveStateHandler(QEMUFile *f, void *opaque);
+typedef int LoadStateHandler(QEMUFile *f, void *opaque, int version_id);
+
+The important functions for the device state format are the save_state
+and load_state. Notice that load_state receives a version_id
+parameter to know what state format is receiving. save_state doesn't
+have a version_id parameter because it always uses the latest version.
+
+=== VMState ===
+
+The legacy way of saving/loading state of the device had the problem
+that we have to maintain two functions in sync. If we did one change
+in one of them and not in the other, we would get a failed migration.
+
+VMState changed the way that state is saved/loaded. Instead of using
+a function to save the state and another to load it, it was changed to
+a declarative way of what the state consisted of. Now VMState is able
+to interpret that definition to be able to load/save the state. As
+the state is declared only once, it can't go out of sync in the
+save/load functions.
+
+An example (from hw/input/pckbd.c)
+
+static const VMStateDescription vmstate_kbd = {
+ .name = "pckbd",
+ .version_id = 3,
+ .minimum_version_id = 3,
+ .fields = (VMStateField[]) {
+ VMSTATE_UINT8(write_cmd, KBDState),
+ VMSTATE_UINT8(status, KBDState),
+ VMSTATE_UINT8(mode, KBDState),
+ VMSTATE_UINT8(pending, KBDState),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+We are declaring the state with name "pckbd".
+The version_id is 3, and the fields are 4 uint8_t in a KBDState structure.
+We registered this with:
+
+ vmstate_register(NULL, 0, &vmstate_kbd, s);
+
+Note: talk about how vmstate <-> qdev interact, and what the instance ids mean.
+
+You can search for VMSTATE_* macros for lots of types used in QEMU in
+include/hw/hw.h.
+
+=== More about versions ===
+
+Version numbers are intended for major incompatible changes to the
+migration of a device, and using them breaks backwards-migration
+compatibility; in general most changes can be made by adding Subsections
+(see below) or _TEST macros (see below) which won't break compatibility.
+
+You can see that there are several version fields:
+
+- version_id: the maximum version_id supported by VMState for that device.
+- minimum_version_id: the minimum version_id that VMState is able to understand
+ for that device.
+- minimum_version_id_old: For devices that were not able to port to vmstate, we can
+ assign a function that knows how to read this old state. This field is
+ ignored if there is no load_state_old handler.
+
+So, VMState is able to read versions from minimum_version_id to
+version_id. And the function load_state_old() (if present) is able to
+load state from minimum_version_id_old to minimum_version_id. This
+function is deprecated and will be removed when no more users are left.
+
+Saving state will always create a section with the 'version_id' value
+and thus can't be loaded by any older QEMU.
+
+=== Massaging functions ===
+
+Sometimes, it is not enough to be able to save the state directly
+from one structure, we need to fill the correct values there. One
+example is when we are using kvm. Before saving the cpu state, we
+need to ask kvm to copy to QEMU the state that it is using. And the
+opposite when we are loading the state, we need a way to tell kvm to
+load the state for the cpu that we have just loaded from the QEMUFile.
+
+The functions to do that are inside a vmstate definition, and are called:
+
+- int (*pre_load)(void *opaque);
+
+ This function is called before we load the state of one device.
+
+- int (*post_load)(void *opaque, int version_id);
+
+ This function is called after we load the state of one device.
+
+- void (*pre_save)(void *opaque);
+
+ This function is called before we save the state of one device.
+
+Example: You can look at hpet.c, that uses the three function to
+ massage the state that is transferred.
+
+If you use memory API functions that update memory layout outside
+initialization (i.e., in response to a guest action), this is a strong
+indication that you need to call these functions in a post_load callback.
+Examples of such memory API functions are:
+
+ - memory_region_add_subregion()
+ - memory_region_del_subregion()
+ - memory_region_set_readonly()
+ - memory_region_set_enabled()
+ - memory_region_set_address()
+ - memory_region_set_alias_offset()
+
+=== Subsections ===
+
+The use of version_id allows to be able to migrate from older versions
+to newer versions of a device. But not the other way around. This
+makes very complicated to fix bugs in stable branches. If we need to
+add anything to the state to fix a bug, we have to disable migration
+to older versions that don't have that bug-fix (i.e. a new field).
+
+But sometimes, that bug-fix is only needed sometimes, not always. For
+instance, if the device is in the middle of a DMA operation, it is
+using a specific functionality, ....
+
+It is impossible to create a way to make migration from any version to
+any other version to work. But we can do better than only allowing
+migration from older versions to newer ones. For that fields that are
+only needed sometimes, we add the idea of subsections. A subsection
+is "like" a device vmstate, but with a particularity, it has a Boolean
+function that tells if that values are needed to be sent or not. If
+this functions returns false, the subsection is not sent.
+
+On the receiving side, if we found a subsection for a device that we
+don't understand, we just fail the migration. If we understand all
+the subsections, then we load the state with success.
+
+One important note is that the post_load() function is called "after"
+loading all subsections, because a newer subsection could change same
+value that it uses.
+
+Example:
+
+static bool ide_drive_pio_state_needed(void *opaque)
+{
+ IDEState *s = opaque;
+
+ return ((s->status & DRQ_STAT) != 0)
+ || (s->bus->error_status & BM_STATUS_PIO_RETRY);
+}
+
+const VMStateDescription vmstate_ide_drive_pio_state = {
+ .name = "ide_drive/pio_state",
+ .version_id = 1,
+ .minimum_version_id = 1,
+ .pre_save = ide_drive_pio_pre_save,
+ .post_load = ide_drive_pio_post_load,
+ .needed = ide_drive_pio_state_needed,
+ .fields = (VMStateField[]) {
+ VMSTATE_INT32(req_nb_sectors, IDEState),
+ VMSTATE_VARRAY_INT32(io_buffer, IDEState, io_buffer_total_len, 1,
+ vmstate_info_uint8, uint8_t),
+ VMSTATE_INT32(cur_io_buffer_offset, IDEState),
+ VMSTATE_INT32(cur_io_buffer_len, IDEState),
+ VMSTATE_UINT8(end_transfer_fn_idx, IDEState),
+ VMSTATE_INT32(elementary_transfer_size, IDEState),
+ VMSTATE_INT32(packet_transfer_size, IDEState),
+ VMSTATE_END_OF_LIST()
+ }
+};
+
+const VMStateDescription vmstate_ide_drive = {
+ .name = "ide_drive",
+ .version_id = 3,
+ .minimum_version_id = 0,
+ .post_load = ide_drive_post_load,
+ .fields = (VMStateField[]) {
+ .... several fields ....
+ VMSTATE_END_OF_LIST()
+ },
+ .subsections = (const VMStateDescription*[]) {
+ &vmstate_ide_drive_pio_state,
+ NULL
+ }
+};
+
+Here we have a subsection for the pio state. We only need to
+save/send this state when we are in the middle of a pio operation
+(that is what ide_drive_pio_state_needed() checks). If DRQ_STAT is
+not enabled, the values on that fields are garbage and don't need to
+be sent.
+
+Using a condition function that checks a 'property' to determine whether
+to send a subsection allows backwards migration compatibility when
+new subsections are added.
+
+For example;
+ a) Add a new property using DEFINE_PROP_BOOL - e.g. support-foo and
+ default it to true.
+ b) Add an entry to the HW_COMPAT_ for the previous version
+ that sets the property to false.
+ c) Add a static bool support_foo function that tests the property.
+ d) Add a subsection with a .needed set to the support_foo function
+ e) (potentially) Add a pre_load that sets up a default value for 'foo'
+ to be used if the subsection isn't loaded.
+
+Now that subsection will not be generated when using an older
+machine type and the migration stream will be accepted by older
+QEMU versions. pre-load functions can be used to initialise state
+on the newer version so that they default to suitable values
+when loading streams created by older QEMU versions that do not
+generate the subsection.
+
+In some cases subsections are added for data that had been accidentally
+omitted by earlier versions; if the missing data causes the migration
+process to succeed but the guest to behave badly then it may be better
+to send the subsection and cause the migration to explicitly fail
+with the unknown subsection error. If the bad behaviour only happens
+with certain data values, making the subsection conditional on
+the data value (rather than the machine type) allows migrations to succeed
+in most cases. In general the preference is to tie the subsection to
+the machine type, and allow reliable migrations, unless the behaviour
+from omission of the subsection is really bad.
+
+= Not sending existing elements =
+
+Sometimes members of the VMState are no longer needed;
+ removing them will break migration compatibility
+ making them version dependent and bumping the version will break backwards
+ migration compatibility.
+
+The best way is to:
+ a) Add a new property/compatibility/function in the same way for subsections
+ above.
+ b) replace the VMSTATE macro with the _TEST version of the macro, e.g.:
+ VMSTATE_UINT32(foo, barstruct)
+ becomes
+ VMSTATE_UINT32_TEST(foo, barstruct, pre_version_baz)
+
+ Sometime in the future when we no longer care about the ancient
+versions these can be killed off.
+
+= Return path =
+
+In most migration scenarios there is only a single data path that runs
+from the source VM to the destination, typically along a single fd (although
+possibly with another fd or similar for some fast way of throwing pages across).
+
+However, some uses need two way communication; in particular the Postcopy
+destination needs to be able to request pages on demand from the source.
+
+For these scenarios there is a 'return path' from the destination to the source;
+qemu_file_get_return_path(QEMUFile* fwdpath) gives the QEMUFile* for the return
+path.
+
+ Source side
+ Forward path - written by migration thread
+ Return path - opened by main thread, read by return-path thread
+
+ Destination side
+ Forward path - read by main thread
+ Return path - opened by main thread, written by main thread AND postcopy
+ thread (protected by rp_mutex)
+
+= Postcopy =
+'Postcopy' migration is a way to deal with migrations that refuse to converge
+(or take too long to converge) its plus side is that there is an upper bound on
+the amount of migration traffic and time it takes, the down side is that during
+the postcopy phase, a failure of *either* side or the network connection causes
+the guest to be lost.
+
+In postcopy the destination CPUs are started before all the memory has been
+transferred, and accesses to pages that are yet to be transferred cause
+a fault that's translated by QEMU into a request to the source QEMU.
+
+Postcopy can be combined with precopy (i.e. normal migration) so that if precopy
+doesn't finish in a given time the switch is made to postcopy.
+
+=== Enabling postcopy ===
+
+To enable postcopy, issue this command on the monitor prior to the
+start of migration:
+
+migrate_set_capability postcopy-ram on
+
+The normal commands are then used to start a migration, which is still
+started in precopy mode. Issuing:
+
+migrate_start_postcopy
+
+will now cause the transition from precopy to postcopy.
+It can be issued immediately after migration is started or any
+time later on. Issuing it after the end of a migration is harmless.
+
+Note: During the postcopy phase, the bandwidth limits set using
+migrate_set_speed is ignored (to avoid delaying requested pages that
+the destination is waiting for).
+
+=== Postcopy device transfer ===
+
+Loading of device data may cause the device emulation to access guest RAM
+that may trigger faults that have to be resolved by the source, as such
+the migration stream has to be able to respond with page data *during* the
+device load, and hence the device data has to be read from the stream completely
+before the device load begins to free the stream up. This is achieved by
+'packaging' the device data into a blob that's read in one go.
+
+Source behaviour
+
+Until postcopy is entered the migration stream is identical to normal
+precopy, except for the addition of a 'postcopy advise' command at
+the beginning, to tell the destination that postcopy might happen.
+When postcopy starts the source sends the page discard data and then
+forms the 'package' containing:
+
+ Command: 'postcopy listen'
+ The device state
+ A series of sections, identical to the precopy streams device state stream
+ containing everything except postcopiable devices (i.e. RAM)
+ Command: 'postcopy run'
+
+The 'package' is sent as the data part of a Command: 'CMD_PACKAGED', and the
+contents are formatted in the same way as the main migration stream.
+
+During postcopy the source scans the list of dirty pages and sends them
+to the destination without being requested (in much the same way as precopy),
+however when a page request is received from the destination, the dirty page
+scanning restarts from the requested location. This causes requested pages
+to be sent quickly, and also causes pages directly after the requested page
+to be sent quickly in the hope that those pages are likely to be used
+by the destination soon.
+
+Destination behaviour
+
+Initially the destination looks the same as precopy, with a single thread
+reading the migration stream; the 'postcopy advise' and 'discard' commands
+are processed to change the way RAM is managed, but don't affect the stream
+processing.
+
+------------------------------------------------------------------------------
+ 1 2 3 4 5 6 7
+main -----DISCARD-CMD_PACKAGED ( LISTEN DEVICE DEVICE DEVICE RUN )
+thread | |
+ | (page request)
+ | \___
+ v \
+listen thread: --- page -- page -- page -- page -- page --
+
+ a b c
+------------------------------------------------------------------------------
+
+On receipt of CMD_PACKAGED (1)
+ All the data associated with the package - the ( ... ) section in the
+diagram - is read into memory, and the main thread recurses into
+qemu_loadvm_state_main to process the contents of the package (2)
+which contains commands (3,6) and devices (4...)
+
+On receipt of 'postcopy listen' - 3 -(i.e. the 1st command in the package)
+a new thread (a) is started that takes over servicing the migration stream,
+while the main thread carries on loading the package. It loads normal
+background page data (b) but if during a device load a fault happens (5) the
+returned page (c) is loaded by the listen thread allowing the main threads
+device load to carry on.
+
+The last thing in the CMD_PACKAGED is a 'RUN' command (6) letting the destination
+CPUs start running.
+At the end of the CMD_PACKAGED (7) the main thread returns to normal running behaviour
+and is no longer used by migration, while the listen thread carries
+on servicing page data until the end of migration.
+
+=== Postcopy states ===
+
+Postcopy moves through a series of states (see postcopy_state) from
+ADVISE->DISCARD->LISTEN->RUNNING->END
+
+ Advise: Set at the start of migration if postcopy is enabled, even
+ if it hasn't had the start command; here the destination
+ checks that its OS has the support needed for postcopy, and performs
+ setup to ensure the RAM mappings are suitable for later postcopy.
+ The destination will fail early in migration at this point if the
+ required OS support is not present.
+ (Triggered by reception of POSTCOPY_ADVISE command)
+
+ Discard: Entered on receipt of the first 'discard' command; prior to
+ the first Discard being performed, hugepages are switched off
+ (using madvise) to ensure that no new huge pages are created
+ during the postcopy phase, and to cause any huge pages that
+ have discards on them to be broken.
+
+ Listen: The first command in the package, POSTCOPY_LISTEN, switches
+ the destination state to Listen, and starts a new thread
+ (the 'listen thread') which takes over the job of receiving
+ pages off the migration stream, while the main thread carries
+ on processing the blob. With this thread able to process page
+ reception, the destination now 'sensitises' the RAM to detect
+ any access to missing pages (on Linux using the 'userfault'
+ system).
+
+ Running: POSTCOPY_RUN causes the destination to synchronise all
+ state and start the CPUs and IO devices running. The main
+ thread now finishes processing the migration package and
+ now carries on as it would for normal precopy migration
+ (although it can't do the cleanup it would do as it
+ finishes a normal migration).
+
+ End: The listen thread can now quit, and perform the cleanup of migration
+ state, the migration is now complete.
+
+=== Source side page maps ===
+
+The source side keeps two bitmaps during postcopy; 'the migration bitmap'
+and 'unsent map'. The 'migration bitmap' is basically the same as in
+the precopy case, and holds a bit to indicate that page is 'dirty' -
+i.e. needs sending. During the precopy phase this is updated as the CPU
+dirties pages, however during postcopy the CPUs are stopped and nothing
+should dirty anything any more.
+
+The 'unsent map' is used for the transition to postcopy. It is a bitmap that
+has a bit cleared whenever a page is sent to the destination, however during
+the transition to postcopy mode it is combined with the migration bitmap
+to form a set of pages that:
+ a) Have been sent but then redirtied (which must be discarded)
+ b) Have not yet been sent - which also must be discarded to cause any
+ transparent huge pages built during precopy to be broken.
+
+Note that the contents of the unsentmap are sacrificed during the calculation
+of the discard set and thus aren't valid once in postcopy. The dirtymap
+is still valid and is used to ensure that no page is sent more than once. Any
+request for a page that has already been sent is ignored. Duplicate requests
+such as this can happen as a page is sent at about the same time the
+destination accesses it.
+
+=== Postcopy with hugepages ===
+
+Postcopy now works with hugetlbfs backed memory:
+ a) The linux kernel on the destination must support userfault on hugepages.
+ b) The huge-page configuration on the source and destination VMs must be
+ identical; i.e. RAMBlocks on both sides must use the same page size.
+ c) Note that -mem-path /dev/hugepages will fall back to allocating normal
+ RAM if it doesn't have enough hugepages, triggering (b) to fail.
+ Using -mem-prealloc enforces the allocation using hugepages.
+ d) Care should be taken with the size of hugepage used; postcopy with 2MB
+ hugepages works well, however 1GB hugepages are likely to be problematic
+ since it takes ~1 second to transfer a 1GB hugepage across a 10Gbps link,
+ and until the full page is transferred the destination thread is blocked.
diff --git a/docs/devel/multi-thread-tcg.txt b/docs/devel/multi-thread-tcg.txt
new file mode 100644
index 0000000000..a99b4564c6
--- /dev/null
+++ b/docs/devel/multi-thread-tcg.txt
@@ -0,0 +1,350 @@
+Copyright (c) 2015-2016 Linaro Ltd.
+
+This work is licensed under the terms of the GNU GPL, version 2 or
+later. See the COPYING file in the top-level directory.
+
+Introduction
+============
+
+This document outlines the design for multi-threaded TCG system-mode
+emulation. The current user-mode emulation mirrors the thread
+structure of the translated executable. Some of the work will be
+applicable to both system and linux-user emulation.
+
+The original system-mode TCG implementation was single threaded and
+dealt with multiple CPUs with simple round-robin scheduling. This
+simplified a lot of things but became increasingly limited as systems
+being emulated gained additional cores and per-core performance gains
+for host systems started to level off.
+
+vCPU Scheduling
+===============
+
+We introduce a new running mode where each vCPU will run on its own
+user-space thread. This will be enabled by default for all FE/BE
+combinations that have had the required work done to support this
+safely.
+
+In the general case of running translated code there should be no
+inter-vCPU dependencies and all vCPUs should be able to run at full
+speed. Synchronisation will only be required while accessing internal
+shared data structures or when the emulated architecture requires a
+coherent representation of the emulated machine state.
+
+Shared Data Structures
+======================
+
+Main Run Loop
+-------------
+
+Even when there is no code being generated there are a number of
+structures associated with the hot-path through the main run-loop.
+These are associated with looking up the next translation block to
+execute. These include:
+
+ tb_jmp_cache (per-vCPU, cache of recent jumps)
+ tb_ctx.htable (global hash table, phys address->tb lookup)
+
+As TB linking only occurs when blocks are in the same page this code
+is critical to performance as looking up the next TB to execute is the
+most common reason to exit the generated code.
+
+DESIGN REQUIREMENT: Make access to lookup structures safe with
+multiple reader/writer threads. Minimise any lock contention to do it.
+
+The hot-path avoids using locks where possible. The tb_jmp_cache is
+updated with atomic accesses to ensure consistent results. The fall
+back QHT based hash table is also designed for lockless lookups. Locks
+are only taken when code generation is required or TranslationBlocks
+have their block-to-block jumps patched.
+
+Global TCG State
+----------------
+
+We need to protect the entire code generation cycle including any post
+generation patching of the translated code. This also implies a shared
+translation buffer which contains code running on all cores. Any
+execution path that comes to the main run loop will need to hold a
+mutex for code generation. This also includes times when we need flush
+code or entries from any shared lookups/caches. Structures held on a
+per-vCPU basis won't need locking unless other vCPUs will need to
+modify them.
+
+DESIGN REQUIREMENT: Add locking around all code generation and TB
+patching.
+
+(Current solution)
+
+Mainly as part of the linux-user work all code generation is
+serialised with a tb_lock(). For the SoftMMU tb_lock() also takes the
+place of mmap_lock() in linux-user.
+
+Translation Blocks
+------------------
+
+Currently the whole system shares a single code generation buffer
+which when full will force a flush of all translations and start from
+scratch again. Some operations also force a full flush of translations
+including:
+
+ - debugging operations (breakpoint insertion/removal)
+ - some CPU helper functions
+
+This is done with the async_safe_run_on_cpu() mechanism to ensure all
+vCPUs are quiescent when changes are being made to shared global
+structures.
+
+More granular translation invalidation events are typically due
+to a change of the state of a physical page:
+
+ - code modification (self modify code, patching code)
+ - page changes (new page mapping in linux-user mode)
+
+While setting the invalid flag in a TranslationBlock will stop it
+being used when looked up in the hot-path there are a number of other
+book-keeping structures that need to be safely cleared.
+
+Any TranslationBlocks which have been patched to jump directly to the
+now invalid blocks need the jump patches reversing so they will return
+to the C code.
+
+There are a number of look-up caches that need to be properly updated
+including the:
+
+ - jump lookup cache
+ - the physical-to-tb lookup hash table
+ - the global page table
+
+The global page table (l1_map) which provides a multi-level look-up
+for PageDesc structures which contain pointers to the start of a
+linked list of all Translation Blocks in that page (see page_next).
+
+Both the jump patching and the page cache involve linked lists that
+the invalidated TranslationBlock needs to be removed from.
+
+DESIGN REQUIREMENT: Safely handle invalidation of TBs
+ - safely patch/revert direct jumps
+ - remove central PageDesc lookup entries
+ - ensure lookup caches/hashes are safely updated
+
+(Current solution)
+
+The direct jump themselves are updated atomically by the TCG
+tb_set_jmp_target() code. Modification to the linked lists that allow
+searching for linked pages are done under the protect of the
+tb_lock().
+
+The global page table is protected by the tb_lock() in system-mode and
+mmap_lock() in linux-user mode.
+
+The lookup caches are updated atomically and the lookup hash uses QHT
+which is designed for concurrent safe lookup.
+
+
+Memory maps and TLBs
+--------------------
+
+The memory handling code is fairly critical to the speed of memory
+access in the emulated system. The SoftMMU code is designed so the
+hot-path can be handled entirely within translated code. This is
+handled with a per-vCPU TLB structure which once populated will allow
+a series of accesses to the page to occur without exiting the
+translated code. It is possible to set flags in the TLB address which
+will ensure the slow-path is taken for each access. This can be done
+to support:
+
+ - Memory regions (dividing up access to PIO, MMIO and RAM)
+ - Dirty page tracking (for code gen, SMC detection, migration and display)
+ - Virtual TLB (for translating guest address->real address)
+
+When the TLB tables are updated by a vCPU thread other than their own
+we need to ensure it is done in a safe way so no inconsistent state is
+seen by the vCPU thread.
+
+Some operations require updating a number of vCPUs TLBs at the same
+time in a synchronised manner.
+
+DESIGN REQUIREMENTS:
+
+ - TLB Flush All/Page
+ - can be across-vCPUs
+ - cross vCPU TLB flush may need other vCPU brought to halt
+ - change may need to be visible to the calling vCPU immediately
+ - TLB Flag Update
+ - usually cross-vCPU
+ - want change to be visible as soon as possible
+ - TLB Update (update a CPUTLBEntry, via tlb_set_page_with_attrs)
+ - This is a per-vCPU table - by definition can't race
+ - updated by its own thread when the slow-path is forced
+
+(Current solution)
+
+We have updated cputlb.c to defer operations when a cross-vCPU
+operation with async_run_on_cpu() which ensures each vCPU sees a
+coherent state when it next runs its work (in a few instructions
+time).
+
+A new set up operations (tlb_flush_*_all_cpus) take an additional flag
+which when set will force synchronisation by setting the source vCPUs
+work as "safe work" and exiting the cpu run loop. This ensure by the
+time execution restarts all flush operations have completed.
+
+TLB flag updates are all done atomically and are also protected by the
+tb_lock() which is used by the functions that update the TLB in bulk.
+
+(Known limitation)
+
+Not really a limitation but the wait mechanism is overly strict for
+some architectures which only need flushes completed by a barrier
+instruction. This could be a future optimisation.
+
+Emulated hardware state
+-----------------------
+
+Currently thanks to KVM work any access to IO memory is automatically
+protected by the global iothread mutex, also known as the BQL (Big
+Qemu Lock). Any IO region that doesn't use global mutex is expected to
+do its own locking.
+
+However IO memory isn't the only way emulated hardware state can be
+modified. Some architectures have model specific registers that
+trigger hardware emulation features. Generally any translation helper
+that needs to update more than a single vCPUs of state should take the
+BQL.
+
+As the BQL, or global iothread mutex is shared across the system we
+push the use of the lock as far down into the TCG code as possible to
+minimise contention.
+
+(Current solution)
+
+MMIO access automatically serialises hardware emulation by way of the
+BQL. Currently ARM targets serialise all ARM_CP_IO register accesses
+and also defer the reset/startup of vCPUs to the vCPU context by way
+of async_run_on_cpu().
+
+Updates to interrupt state are also protected by the BQL as they can
+often be cross vCPU.
+
+Memory Consistency
+==================
+
+Between emulated guests and host systems there are a range of memory
+consistency models. Even emulating weakly ordered systems on strongly
+ordered hosts needs to ensure things like store-after-load re-ordering
+can be prevented when the guest wants to.
+
+Memory Barriers
+---------------
+
+Barriers (sometimes known as fences) provide a mechanism for software
+to enforce a particular ordering of memory operations from the point
+of view of external observers (e.g. another processor core). They can
+apply to any memory operations as well as just loads or stores.
+
+The Linux kernel has an excellent write-up on the various forms of
+memory barrier and the guarantees they can provide [1].
+
+Barriers are often wrapped around synchronisation primitives to
+provide explicit memory ordering semantics. However they can be used
+by themselves to provide safe lockless access by ensuring for example
+a change to a signal flag will only be visible once the changes to
+payload are.
+
+DESIGN REQUIREMENT: Add a new tcg_memory_barrier op
+
+This would enforce a strong load/store ordering so all loads/stores
+complete at the memory barrier. On single-core non-SMP strongly
+ordered backends this could become a NOP.
+
+Aside from explicit standalone memory barrier instructions there are
+also implicit memory ordering semantics which comes with each guest
+memory access instruction. For example all x86 load/stores come with
+fairly strong guarantees of sequential consistency where as ARM has
+special variants of load/store instructions that imply acquire/release
+semantics.
+
+In the case of a strongly ordered guest architecture being emulated on
+a weakly ordered host the scope for a heavy performance impact is
+quite high.
+
+DESIGN REQUIREMENTS: Be efficient with use of memory barriers
+ - host systems with stronger implied guarantees can skip some barriers
+ - merge consecutive barriers to the strongest one
+
+(Current solution)
+
+The system currently has a tcg_gen_mb() which will add memory barrier
+operations if code generation is being done in a parallel context. The
+tcg_optimize() function attempts to merge barriers up to their
+strongest form before any load/store operations. The solution was
+originally developed and tested for linux-user based systems. All
+backends have been converted to emit fences when required. So far the
+following front-ends have been updated to emit fences when required:
+
+ - target-i386
+ - target-arm
+ - target-aarch64
+ - target-alpha
+ - target-mips
+
+Memory Control and Maintenance
+------------------------------
+
+This includes a class of instructions for controlling system cache
+behaviour. While QEMU doesn't model cache behaviour these instructions
+are often seen when code modification has taken place to ensure the
+changes take effect.
+
+Synchronisation Primitives
+--------------------------
+
+There are two broad types of synchronisation primitives found in
+modern ISAs: atomic instructions and exclusive regions.
+
+The first type offer a simple atomic instruction which will guarantee
+some sort of test and conditional store will be truly atomic w.r.t.
+other cores sharing access to the memory. The classic example is the
+x86 cmpxchg instruction.
+
+The second type offer a pair of load/store instructions which offer a
+guarantee that an region of memory has not been touched between the
+load and store instructions. An example of this is ARM's ldrex/strex
+pair where the strex instruction will return a flag indicating a
+successful store only if no other CPU has accessed the memory region
+since the ldrex.
+
+Traditionally TCG has generated a series of operations that work
+because they are within the context of a single translation block so
+will have completed before another CPU is scheduled. However with
+the ability to have multiple threads running to emulate multiple CPUs
+we will need to explicitly expose these semantics.
+
+DESIGN REQUIREMENTS:
+ - Support classic atomic instructions
+ - Support load/store exclusive (or load link/store conditional) pairs
+ - Generic enough infrastructure to support all guest architectures
+CURRENT OPEN QUESTIONS:
+ - How problematic is the ABA problem in general?
+
+(Current solution)
+
+The TCG provides a number of atomic helpers (tcg_gen_atomic_*) which
+can be used directly or combined to emulate other instructions like
+ARM's ldrex/strex instructions. While they are susceptible to the ABA
+problem so far common guests have not implemented patterns where
+this may be a problem - typically presenting a locking ABI which
+assumes cmpxchg like semantics.
+
+The code also includes a fall-back for cases where multi-threaded TCG
+ops can't work (e.g. guest atomic width > host atomic width). In this
+case an EXCP_ATOMIC exit occurs and the instruction is emulated with
+an exclusive lock which ensures all emulation is serialised.
+
+While the atomic helpers look good enough for now there may be a need
+to look at solutions that can more closely model the guest
+architectures semantics.
+
+==========
+
+[1] https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/plain/Documentation/memory-barriers.txt
diff --git a/docs/devel/multiple-iothreads.txt b/docs/devel/multiple-iothreads.txt
new file mode 100644
index 0000000000..e4d340bbb7
--- /dev/null
+++ b/docs/devel/multiple-iothreads.txt
@@ -0,0 +1,137 @@
+Copyright (c) 2014 Red Hat Inc.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+
+This document explains the IOThread feature and how to write code that runs
+outside the QEMU global mutex.
+
+The main loop and IOThreads
+---------------------------
+QEMU is an event-driven program that can do several things at once using an
+event loop. The VNC server and the QMP monitor are both processed from the
+same event loop, which monitors their file descriptors until they become
+readable and then invokes a callback.
+
+The default event loop is called the main loop (see main-loop.c). It is
+possible to create additional event loop threads using -object
+iothread,id=my-iothread.
+
+Side note: The main loop and IOThread are both event loops but their code is
+not shared completely. Sometimes it is useful to remember that although they
+are conceptually similar they are currently not interchangeable.
+
+Why IOThreads are useful
+------------------------
+IOThreads allow the user to control the placement of work. The main loop is a
+scalability bottleneck on hosts with many CPUs. Work can be spread across
+several IOThreads instead of just one main loop. When set up correctly this
+can improve I/O latency and reduce jitter seen by the guest.
+
+The main loop is also deeply associated with the QEMU global mutex, which is a
+scalability bottleneck in itself. vCPU threads and the main loop use the QEMU
+global mutex to serialize execution of QEMU code. This mutex is necessary
+because a lot of QEMU's code historically was not thread-safe.
+
+The fact that all I/O processing is done in a single main loop and that the
+QEMU global mutex is contended by all vCPU threads and the main loop explain
+why it is desirable to place work into IOThreads.
+
+The experimental virtio-blk data-plane implementation has been benchmarked and
+shows these effects:
+ftp://public.dhe.ibm.com/linux/pdfs/KVM_Virtualized_IO_Performance_Paper.pdf
+
+How to program for IOThreads
+----------------------------
+The main difference between legacy code and new code that can run in an
+IOThread is dealing explicitly with the event loop object, AioContext
+(see include/block/aio.h). Code that only works in the main loop
+implicitly uses the main loop's AioContext. Code that supports running
+in IOThreads must be aware of its AioContext.
+
+AioContext supports the following services:
+ * File descriptor monitoring (read/write/error on POSIX hosts)
+ * Event notifiers (inter-thread signalling)
+ * Timers
+ * Bottom Halves (BH) deferred callbacks
+
+There are several old APIs that use the main loop AioContext:
+ * LEGACY qemu_aio_set_fd_handler() - monitor a file descriptor
+ * LEGACY qemu_aio_set_event_notifier() - monitor an event notifier
+ * LEGACY timer_new_ms() - create a timer
+ * LEGACY qemu_bh_new() - create a BH
+ * LEGACY qemu_aio_wait() - run an event loop iteration
+
+Since they implicitly work on the main loop they cannot be used in code that
+runs in an IOThread. They might cause a crash or deadlock if called from an
+IOThread since the QEMU global mutex is not held.
+
+Instead, use the AioContext functions directly (see include/block/aio.h):
+ * aio_set_fd_handler() - monitor a file descriptor
+ * aio_set_event_notifier() - monitor an event notifier
+ * aio_timer_new() - create a timer
+ * aio_bh_new() - create a BH
+ * aio_poll() - run an event loop iteration
+
+The AioContext can be obtained from the IOThread using
+iothread_get_aio_context() or for the main loop using qemu_get_aio_context().
+Code that takes an AioContext argument works both in IOThreads or the main
+loop, depending on which AioContext instance the caller passes in.
+
+How to synchronize with an IOThread
+-----------------------------------
+AioContext is not thread-safe so some rules must be followed when using file
+descriptors, event notifiers, timers, or BHs across threads:
+
+1. AioContext functions can always be called safely. They handle their
+own locking internally.
+
+2. Other threads wishing to access the AioContext must use
+aio_context_acquire()/aio_context_release() for mutual exclusion. Once the
+context is acquired no other thread can access it or run event loop iterations
+in this AioContext.
+
+aio_context_acquire()/aio_context_release() calls may be nested. This
+means you can call them if you're not sure whether #2 applies.
+
+There is currently no lock ordering rule if a thread needs to acquire multiple
+AioContexts simultaneously. Therefore, it is only safe for code holding the
+QEMU global mutex to acquire other AioContexts.
+
+Side note: the best way to schedule a function call across threads is to call
+aio_bh_schedule_oneshot(). No acquire/release or locking is needed.
+
+AioContext and the block layer
+------------------------------
+The AioContext originates from the QEMU block layer, even though nowadays
+AioContext is a generic event loop that can be used by any QEMU subsystem.
+
+The block layer has support for AioContext integrated. Each BlockDriverState
+is associated with an AioContext using bdrv_set_aio_context() and
+bdrv_get_aio_context(). This allows block layer code to process I/O inside the
+right AioContext. Other subsystems may wish to follow a similar approach.
+
+Block layer code must therefore expect to run in an IOThread and avoid using
+old APIs that implicitly use the main loop. See the "How to program for
+IOThreads" above for information on how to do that.
+
+If main loop code such as a QMP function wishes to access a BlockDriverState
+it must first call aio_context_acquire(bdrv_get_aio_context(bs)) to ensure
+that callbacks in the IOThread do not run in parallel.
+
+Code running in the monitor typically needs to ensure that past
+requests from the guest are completed. When a block device is running
+in an IOThread, the IOThread can also process requests from the guest
+(via ioeventfd). To achieve both objects, wrap the code between
+bdrv_drained_begin() and bdrv_drained_end(), thus creating a "drained
+section". The functions must be called between aio_context_acquire()
+and aio_context_release(). You can freely release and re-acquire the
+AioContext within a drained section.
+
+Long-running jobs (usually in the form of coroutines) are best scheduled in
+the BlockDriverState's AioContext to avoid the need to acquire/release around
+each bdrv_*() call. The functions bdrv_add/remove_aio_context_notifier,
+or alternatively blk_add/remove_aio_context_notifier if you use BlockBackends,
+can be used to get a notification whenever bdrv_set_aio_context() moves a
+BlockDriverState to a different AioContext.
diff --git a/docs/devel/qapi-code-gen.txt b/docs/devel/qapi-code-gen.txt
new file mode 100644
index 0000000000..52e3874efe
--- /dev/null
+++ b/docs/devel/qapi-code-gen.txt
@@ -0,0 +1,1310 @@
+= How to use the QAPI code generator =
+
+Copyright IBM Corp. 2011
+Copyright (C) 2012-2016 Red Hat, Inc.
+
+This work is licensed under the terms of the GNU GPL, version 2 or
+later. See the COPYING file in the top-level directory.
+
+== Introduction ==
+
+QAPI is a native C API within QEMU which provides management-level
+functionality to internal and external users. For external
+users/processes, this interface is made available by a JSON-based wire
+format for the QEMU Monitor Protocol (QMP) for controlling qemu, as
+well as the QEMU Guest Agent (QGA) for communicating with the guest.
+The remainder of this document uses "Client JSON Protocol" when
+referring to the wire contents of a QMP or QGA connection.
+
+To map Client JSON Protocol interfaces to the native C QAPI
+implementations, a JSON-based schema is used to define types and
+function signatures, and a set of scripts is used to generate types,
+signatures, and marshaling/dispatch code. This document will describe
+how the schemas, scripts, and resulting code are used.
+
+
+== QMP/Guest agent schema ==
+
+A QAPI schema file is designed to be loosely based on JSON
+(http://www.ietf.org/rfc/rfc7159.txt) with changes for quoting style
+and the use of comments; a QAPI schema file is then parsed by a python
+code generation program. A valid QAPI schema consists of a series of
+top-level expressions, with no commas between them. Where
+dictionaries (JSON objects) are used, they are parsed as python
+OrderedDicts so that ordering is preserved (for predictable layout of
+generated C structs and parameter lists). Ordering doesn't matter
+between top-level expressions or the keys within an expression, but
+does matter within dictionary values for 'data' and 'returns' members
+of a single expression. QAPI schema input is written using 'single
+quotes' instead of JSON's "double quotes" (in contrast, Client JSON
+Protocol uses no comments, and while input accepts 'single quotes' as
+an extension, output is strict JSON using only "double quotes"). As
+in JSON, trailing commas are not permitted in arrays or dictionaries.
+Input must be ASCII (although QMP supports full Unicode strings, the
+QAPI parser does not). At present, there is no place where a QAPI
+schema requires the use of JSON numbers or null.
+
+
+=== Comments ===
+
+Comments are allowed; anything between an unquoted # and the following
+newline is ignored.
+
+A multi-line comment that starts and ends with a '##' line is a
+documentation comment. These are parsed by the documentation
+generator, which recognizes certain markup detailed below.
+
+
+==== Documentation markup ====
+
+Comment text starting with '=' is a section title:
+
+ # = Section title
+
+Double the '=' for a subsection title:
+
+ # == Subection title
+
+'|' denotes examples:
+
+ # | Text of the example, may span
+ # | multiple lines
+
+'*' starts an itemized list:
+
+ # * First item, may span
+ # multiple lines
+ # * Second item
+
+You can also use '-' instead of '*'.
+
+A decimal number followed by '.' starts a numbered list:
+
+ # 1. First item, may span
+ # multiple lines
+ # 2. Second item
+
+The actual number doesn't matter. You could even use '*' instead of
+'2.' for the second item.
+
+Lists can't be nested. Blank lines are currently not supported within
+lists.
+
+Additional whitespace between the initial '#' and the comment text is
+permitted.
+
+*foo* and _foo_ are for strong and emphasis styles respectively (they
+do not work over multiple lines). @foo is used to reference a name in
+the schema.
+
+Example:
+
+##
+# = Section
+# == Subsection
+#
+# Some text foo with *strong* and _emphasis_
+# 1. with a list
+# 2. like that
+#
+# And some code:
+# | $ echo foo
+# | -> do this
+# | <- get that
+#
+##
+
+
+==== Expression documentation ====
+
+Each expression that isn't an include directive may be preceded by a
+documentation block. Such blocks are called expression documentation
+blocks.
+
+When documentation is required (see pragma 'doc-required'), expression
+documentation blocks are mandatory.
+
+The documentation block consists of a first line naming the
+expression, an optional overview, a description of each argument (for
+commands and events) or member (for structs, unions and alternates),
+and optional tagged sections.
+
+FIXME: the parser accepts these things in almost any order.
+
+Extensions added after the expression was first released carry a
+'(since x.y.z)' comment.
+
+A tagged section starts with one of the following words:
+"Note:"/"Notes:", "Since:", "Example"/"Examples", "Returns:", "TODO:".
+The section ends with the start of a new section.
+
+A 'Since: x.y.z' tagged section lists the release that introduced the
+expression.
+
+For example:
+
+##
+# @BlockStats:
+#
+# Statistics of a virtual block device or a block backing device.
+#
+# @device: If the stats are for a virtual block device, the name
+# corresponding to the virtual block device.
+#
+# @node-name: The node name of the device. (since 2.3)
+#
+# ... more members ...
+#
+# Since: 0.14.0
+##
+{ 'struct': 'BlockStats',
+ 'data': {'*device': 'str', '*node-name': 'str',
+ ... more members ... } }
+
+##
+# @query-blockstats:
+#
+# Query the @BlockStats for all virtual block devices.
+#
+# @query-nodes: If true, the command will query all the
+# block nodes ... explain, explain ... (since 2.3)
+#
+# Returns: A list of @BlockStats for each virtual block devices.
+#
+# Since: 0.14.0
+#
+# Example:
+#
+# -> { "execute": "query-blockstats" }
+# <- {
+# ... lots of output ...
+# }
+#
+##
+{ 'command': 'query-blockstats',
+ 'data': { '*query-nodes': 'bool' },
+ 'returns': ['BlockStats'] }
+
+==== Free-form documentation ====
+
+A documentation block that isn't an expression documentation block is
+a free-form documentation block. These may be used to provide
+additional text and structuring content.
+
+
+=== Schema overview ===
+
+The schema sets up a series of types, as well as commands and events
+that will use those types. Forward references are allowed: the parser
+scans in two passes, where the first pass learns all type names, and
+the second validates the schema and generates the code. This allows
+the definition of complex structs that can have mutually recursive
+types, and allows for indefinite nesting of Client JSON Protocol that
+satisfies the schema. A type name should not be defined more than
+once. It is permissible for the schema to contain additional types
+not used by any commands or events in the Client JSON Protocol, for
+the side effect of generated C code used internally.
+
+There are eight top-level expressions recognized by the parser:
+'include', 'pragma', 'command', 'struct', 'enum', 'union',
+'alternate', and 'event'. There are several groups of types: simple
+types (a number of built-in types, such as 'int' and 'str'; as well as
+enumerations), complex types (structs and two flavors of unions), and
+alternate types (a choice between other types). The 'command' and
+'event' expressions can refer to existing types by name, or list an
+anonymous type as a dictionary. Listing a type name inside an array
+refers to a single-dimension array of that type; multi-dimension
+arrays are not directly supported (although an array of a complex
+struct that contains an array member is possible).
+
+All names must begin with a letter, and contain only ASCII letters,
+digits, hyphen, and underscore. There are two exceptions: enum values
+may start with a digit, and names that are downstream extensions (see
+section Downstream extensions) start with underscore.
+
+Names beginning with 'q_' are reserved for the generator, which uses
+them for munging QMP names that resemble C keywords or other
+problematic strings. For example, a member named "default" in qapi
+becomes "q_default" in the generated C code.
+
+Types, commands, and events share a common namespace. Therefore,
+generally speaking, type definitions should always use CamelCase for
+user-defined type names, while built-in types are lowercase.
+
+Type names ending with 'Kind' or 'List' are reserved for the
+generator, which uses them for implicit union enums and array types,
+respectively.
+
+Command names, and member names within a type, should be all lower
+case with words separated by a hyphen. However, some existing older
+commands and complex types use underscore; when extending such
+expressions, consistency is preferred over blindly avoiding
+underscore.
+
+Event names should be ALL_CAPS with words separated by underscore.
+
+Member names starting with 'has-' or 'has_' are reserved for the
+generator, which uses them for tracking optional members.
+
+Any name (command, event, type, member, or enum value) beginning with
+"x-" is marked experimental, and may be withdrawn or changed
+incompatibly in a future release.
+
+Pragma 'name-case-whitelist' lets you violate the rules on use of
+upper and lower case. Use for new code is strongly discouraged.
+
+In the rest of this document, usage lines are given for each
+expression type, with literal strings written in lower case and
+placeholders written in capitals. If a literal string includes a
+prefix of '*', that key/value pair can be omitted from the expression.
+For example, a usage statement that includes '*base':STRUCT-NAME
+means that an expression has an optional key 'base', which if present
+must have a value that forms a struct name.
+
+
+=== Built-in Types ===
+
+The following types are predefined, and map to C as follows:
+
+ Schema C JSON
+ str char * any JSON string, UTF-8
+ number double any JSON number
+ int int64_t a JSON number without fractional part
+ that fits into the C integer type
+ int8 int8_t likewise
+ int16 int16_t likewise
+ int32 int32_t likewise
+ int64 int64_t likewise
+ uint8 uint8_t likewise
+ uint16 uint16_t likewise
+ uint32 uint32_t likewise
+ uint64 uint64_t likewise
+ size uint64_t like uint64_t, except StringInputVisitor
+ accepts size suffixes
+ bool bool JSON true or false
+ any QObject * any JSON value
+ QType QType JSON string matching enum QType values
+
+
+=== Include directives ===
+
+Usage: { 'include': STRING }
+
+The QAPI schema definitions can be modularized using the 'include' directive:
+
+ { 'include': 'path/to/file.json' }
+
+The directive is evaluated recursively, and include paths are relative to the
+file using the directive. Multiple includes of the same file are
+idempotent. No other keys should appear in the expression, and the include
+value should be a string.
+
+As a matter of style, it is a good idea to have all files be
+self-contained, but at the moment, nothing prevents an included file
+from making a forward reference to a type that is only introduced by
+an outer file. The parser may be made stricter in the future to
+prevent incomplete include files.
+
+
+=== Pragma directives ===
+
+Usage: { 'pragma': DICT }
+
+The pragma directive lets you control optional generator behavior.
+The dictionary's entries are pragma names and values.
+
+Pragma's scope is currently the complete schema. Setting the same
+pragma to different values in parts of the schema doesn't work.
+
+Pragma 'doc-required' takes a boolean value. If true, documentation
+is required. Default is false.
+
+Pragma 'returns-whitelist' takes a list of command names that may
+violate the rules on permitted return types. Default is none.
+
+Pragma 'name-case-whitelist' takes a list of names that may violate
+rules on use of upper- vs. lower-case letters. Default is none.
+
+
+=== Struct types ===
+
+Usage: { 'struct': STRING, 'data': DICT, '*base': STRUCT-NAME }
+
+A struct is a dictionary containing a single 'data' key whose value is
+a dictionary; the dictionary may be empty. This corresponds to a
+struct in C or an Object in JSON. Each value of the 'data' dictionary
+must be the name of a type, or a one-element array containing a type
+name. An example of a struct is:
+
+ { 'struct': 'MyType',
+ 'data': { 'member1': 'str', 'member2': 'int', '*member3': 'str' } }
+
+The use of '*' as a prefix to the name means the member is optional in
+the corresponding JSON protocol usage.
+
+The default initialization value of an optional argument should not be changed
+between versions of QEMU unless the new default maintains backward
+compatibility to the user-visible behavior of the old default.
+
+With proper documentation, this policy still allows some flexibility; for
+example, documenting that a default of 0 picks an optimal buffer size allows
+one release to declare the optimal size at 512 while another release declares
+the optimal size at 4096 - the user-visible behavior is not the bytes used by
+the buffer, but the fact that the buffer was optimal size.
+
+On input structures (only mentioned in the 'data' side of a command), changing
+from mandatory to optional is safe (older clients will supply the option, and
+newer clients can benefit from the default); changing from optional to
+mandatory is backwards incompatible (older clients may be omitting the option,
+and must continue to work).
+
+On output structures (only mentioned in the 'returns' side of a command),
+changing from mandatory to optional is in general unsafe (older clients may be
+expecting the member, and could crash if it is missing), although it
+can be done if the only way that the optional argument will be omitted
+is when it is triggered by the presence of a new input flag to the
+command that older clients don't know to send. Changing from optional
+to mandatory is safe.
+
+A structure that is used in both input and output of various commands
+must consider the backwards compatibility constraints of both directions
+of use.
+
+A struct definition can specify another struct as its base.
+In this case, the members of the base type are included as top-level members
+of the new struct's dictionary in the Client JSON Protocol wire
+format. An example definition is:
+
+ { 'struct': 'BlockdevOptionsGenericFormat', 'data': { 'file': 'str' } }
+ { 'struct': 'BlockdevOptionsGenericCOWFormat',
+ 'base': 'BlockdevOptionsGenericFormat',
+ 'data': { '*backing': 'str' } }
+
+An example BlockdevOptionsGenericCOWFormat object on the wire could use
+both members like this:
+
+ { "file": "/some/place/my-image",
+ "backing": "/some/place/my-backing-file" }
+
+
+=== Enumeration types ===
+
+Usage: { 'enum': STRING, 'data': ARRAY-OF-STRING }
+ { 'enum': STRING, '*prefix': STRING, 'data': ARRAY-OF-STRING }
+
+An enumeration type is a dictionary containing a single 'data' key
+whose value is a list of strings. An example enumeration is:
+
+ { 'enum': 'MyEnum', 'data': [ 'value1', 'value2', 'value3' ] }
+
+Nothing prevents an empty enumeration, although it is probably not
+useful. The list of strings should be lower case; if an enum name
+represents multiple words, use '-' between words. The string 'max' is
+not allowed as an enum value, and values should not be repeated.
+
+The enum constants will be named by using a heuristic to turn the
+type name into a set of underscore separated words. For the example
+above, 'MyEnum' will turn into 'MY_ENUM' giving a constant name
+of 'MY_ENUM_VALUE1' for the first value. If the default heuristic
+does not result in a desirable name, the optional 'prefix' member
+can be used when defining the enum.
+
+The enumeration values are passed as strings over the Client JSON
+Protocol, but are encoded as C enum integral values in generated code.
+While the C code starts numbering at 0, it is better to use explicit
+comparisons to enum values than implicit comparisons to 0; the C code
+will also include a generated enum member ending in _MAX for tracking
+the size of the enum, useful when using common functions for
+converting between strings and enum values. Since the wire format
+always passes by name, it is acceptable to reorder or add new
+enumeration members in any location without breaking clients of Client
+JSON Protocol; however, removing enum values would break
+compatibility. For any struct that has a member that will only contain
+a finite set of string values, using an enum type for that member is
+better than open-coding the member to be type 'str'.
+
+
+=== Union types ===
+
+Usage: { 'union': STRING, 'data': DICT }
+or: { 'union': STRING, 'data': DICT, 'base': STRUCT-NAME-OR-DICT,
+ 'discriminator': ENUM-MEMBER-OF-BASE }
+
+Union types are used to let the user choose between several different
+variants for an object. There are two flavors: simple (no
+discriminator or base), and flat (both discriminator and base). A union
+type is defined using a data dictionary as explained in the following
+paragraphs. The data dictionary for either type of union must not
+be empty.
+
+A simple union type defines a mapping from automatic discriminator
+values to data types like in this example:
+
+ { 'struct': 'BlockdevOptionsFile', 'data': { 'filename': 'str' } }
+ { 'struct': 'BlockdevOptionsQcow2',
+ 'data': { 'backing': 'str', '*lazy-refcounts': 'bool' } }
+
+ { 'union': 'BlockdevOptionsSimple',
+ 'data': { 'file': 'BlockdevOptionsFile',
+ 'qcow2': 'BlockdevOptionsQcow2' } }
+
+In the Client JSON Protocol, a simple union is represented by a
+dictionary that contains the 'type' member as a discriminator, and a
+'data' member that is of the specified data type corresponding to the
+discriminator value, as in these examples:
+
+ { "type": "file", "data": { "filename": "/some/place/my-image" } }
+ { "type": "qcow2", "data": { "backing": "/some/place/my-image",
+ "lazy-refcounts": true } }
+
+The generated C code uses a struct containing a union. Additionally,
+an implicit C enum 'NameKind' is created, corresponding to the union
+'Name', for accessing the various branches of the union. No branch of
+the union can be named 'max', as this would collide with the implicit
+enum. The value for each branch can be of any type.
+
+A flat union definition avoids nesting on the wire, and specifies a
+set of common members that occur in all variants of the union. The
+'base' key must specify either a type name (the type must be a
+struct, not a union), or a dictionary representing an anonymous type.
+All branches of the union must be complex types, and the top-level
+members of the union dictionary on the wire will be combination of
+members from both the base type and the appropriate branch type (when
+merging two dictionaries, there must be no keys in common). The
+'discriminator' member must be the name of a non-optional enum-typed
+member of the base struct.
+
+The following example enhances the above simple union example by
+adding an optional common member 'read-only', renaming the
+discriminator to something more applicable than the simple union's
+default of 'type', and reducing the number of {} required on the wire:
+
+ { 'enum': 'BlockdevDriver', 'data': [ 'file', 'qcow2' ] }
+ { 'union': 'BlockdevOptions',
+ 'base': { 'driver': 'BlockdevDriver', '*read-only': 'bool' },
+ 'discriminator': 'driver',
+ 'data': { 'file': 'BlockdevOptionsFile',
+ 'qcow2': 'BlockdevOptionsQcow2' } }
+
+Resulting in these JSON objects:
+
+ { "driver": "file", "read-only": true,
+ "filename": "/some/place/my-image" }
+ { "driver": "qcow2", "read-only": false,
+ "backing": "/some/place/my-image", "lazy-refcounts": true }
+
+Notice that in a flat union, the discriminator name is controlled by
+the user, but because it must map to a base member with enum type, the
+code generator can ensure that branches exist for all values of the
+enum (although the order of the keys need not match the declaration of
+the enum). In the resulting generated C data types, a flat union is
+represented as a struct with the base members included directly, and
+then a union of structures for each branch of the struct.
+
+A simple union can always be re-written as a flat union where the base
+class has a single member named 'type', and where each branch of the
+union has a struct with a single member named 'data'. That is,
+
+ { 'union': 'Simple', 'data': { 'one': 'str', 'two': 'int' } }
+
+is identical on the wire to:
+
+ { 'enum': 'Enum', 'data': ['one', 'two'] }
+ { 'struct': 'Branch1', 'data': { 'data': 'str' } }
+ { 'struct': 'Branch2', 'data': { 'data': 'int' } }
+ { 'union': 'Flat': 'base': { 'type': 'Enum' }, 'discriminator': 'type',
+ 'data': { 'one': 'Branch1', 'two': 'Branch2' } }
+
+
+=== Alternate types ===
+
+Usage: { 'alternate': STRING, 'data': DICT }
+
+An alternate type is one that allows a choice between two or more JSON
+data types (string, integer, number, or object, but currently not
+array) on the wire. The definition is similar to a simple union type,
+where each branch of the union names a QAPI type. For example:
+
+ { 'alternate': 'BlockdevRef',
+ 'data': { 'definition': 'BlockdevOptions',
+ 'reference': 'str' } }
+
+Unlike a union, the discriminator string is never passed on the wire
+for the Client JSON Protocol. Instead, the value's JSON type serves
+as an implicit discriminator, which in turn means that an alternate
+can only express a choice between types represented differently in
+JSON. If a branch is typed as the 'bool' built-in, the alternate
+accepts true and false; if it is typed as any of the various numeric
+built-ins, it accepts a JSON number; if it is typed as a 'str'
+built-in or named enum type, it accepts a JSON string; and if it is
+typed as a complex type (struct or union), it accepts a JSON object.
+Two different complex types, for instance, aren't permitted, because
+both are represented as a JSON object.
+
+The example alternate declaration above allows using both of the
+following example objects:
+
+ { "file": "my_existing_block_device_id" }
+ { "file": { "driver": "file",
+ "read-only": false,
+ "filename": "/tmp/mydisk.qcow2" } }
+
+
+=== Commands ===
+
+Usage: { 'command': STRING, '*data': COMPLEX-TYPE-NAME-OR-DICT,
+ '*returns': TYPE-NAME, '*boxed': true,
+ '*gen': false, '*success-response': false }
+
+Commands are defined by using a dictionary containing several members,
+where three members are most common. The 'command' member is a
+mandatory string, and determines the "execute" value passed in a
+Client JSON Protocol command exchange.
+
+The 'data' argument maps to the "arguments" dictionary passed in as
+part of a Client JSON Protocol command. The 'data' member is optional
+and defaults to {} (an empty dictionary). If present, it must be the
+string name of a complex type, or a dictionary that declares an
+anonymous type with the same semantics as a 'struct' expression.
+
+The 'returns' member describes what will appear in the "return" member
+of a Client JSON Protocol reply on successful completion of a command.
+The member is optional from the command declaration; if absent, the
+"return" member will be an empty dictionary. If 'returns' is present,
+it must be the string name of a complex or built-in type, a
+one-element array containing the name of a complex or built-in type.
+To return anything else, you have to list the command in pragma
+'returns-whitelist'. If you do this, the command cannot be extended
+to return additional information in the future. Use of
+'returns-whitelist' for new commands is strongly discouraged.
+
+All commands in Client JSON Protocol use a dictionary to report
+failure, with no way to specify that in QAPI. Where the error return
+is different than the usual GenericError class in order to help the
+client react differently to certain error conditions, it is worth
+documenting this in the comments before the command declaration.
+
+Some example commands:
+
+ { 'command': 'my-first-command',
+ 'data': { 'arg1': 'str', '*arg2': 'str' } }
+ { 'struct': 'MyType', 'data': { '*value': 'str' } }
+ { 'command': 'my-second-command',
+ 'returns': [ 'MyType' ] }
+
+which would validate this Client JSON Protocol transaction:
+
+ => { "execute": "my-first-command",
+ "arguments": { "arg1": "hello" } }
+ <= { "return": { } }
+ => { "execute": "my-second-command" }
+ <= { "return": [ { "value": "one" }, { } ] }
+
+The generator emits a prototype for the user's function implementing
+the command. Normally, 'data' is a dictionary for an anonymous type,
+or names a struct type (possibly empty, but not a union), and its
+members are passed as separate arguments to this function. If the
+command definition includes a key 'boxed' with the boolean value true,
+then 'data' is instead the name of any non-empty complex type
+(struct, union, or alternate), and a pointer to that QAPI type is
+passed as a single argument.
+
+The generator also emits a marshalling function that extracts
+arguments for the user's function out of an input QDict, calls the
+user's function, and if it succeeded, builds an output QObject from
+its return value.
+
+In rare cases, QAPI cannot express a type-safe representation of a
+corresponding Client JSON Protocol command. You then have to suppress
+generation of a marshalling function by including a key 'gen' with
+boolean value false, and instead write your own function. Please try
+to avoid adding new commands that rely on this, and instead use
+type-safe unions. For an example of this usage:
+
+ { 'command': 'netdev_add',
+ 'data': {'type': 'str', 'id': 'str'},
+ 'gen': false }
+
+Normally, the QAPI schema is used to describe synchronous exchanges,
+where a response is expected. But in some cases, the action of a
+command is expected to change state in a way that a successful
+response is not possible (although the command will still return a
+normal dictionary error on failure). When a successful reply is not
+possible, the command expression should include the optional key
+'success-response' with boolean value false. So far, only QGA makes
+use of this member.
+
+
+=== Events ===
+
+Usage: { 'event': STRING, '*data': COMPLEX-TYPE-NAME-OR-DICT,
+ '*boxed': true }
+
+Events are defined with the keyword 'event'. It is not allowed to
+name an event 'MAX', since the generator also produces a C enumeration
+of all event names with a generated _MAX value at the end. When
+'data' is also specified, additional info will be included in the
+event, with similar semantics to a 'struct' expression. Finally there
+will be C API generated in qapi-event.h; when called by QEMU code, a
+message with timestamp will be emitted on the wire.
+
+An example event is:
+
+{ 'event': 'EVENT_C',
+ 'data': { '*a': 'int', 'b': 'str' } }
+
+Resulting in this JSON object:
+
+{ "event": "EVENT_C",
+ "data": { "b": "test string" },
+ "timestamp": { "seconds": 1267020223, "microseconds": 435656 } }
+
+The generator emits a function to send the event. Normally, 'data' is
+a dictionary for an anonymous type, or names a struct type (possibly
+empty, but not a union), and its members are passed as separate
+arguments to this function. If the event definition includes a key
+'boxed' with the boolean value true, then 'data' is instead the name of
+any non-empty complex type (struct, union, or alternate), and a
+pointer to that QAPI type is passed as a single argument.
+
+
+=== Downstream extensions ===
+
+QAPI schema names that are externally visible, say in the Client JSON
+Protocol, need to be managed with care. Names starting with a
+downstream prefix of the form __RFQDN_ are reserved for the downstream
+who controls the valid, reverse fully qualified domain name RFQDN.
+RFQDN may only contain ASCII letters, digits, hyphen and period.
+
+Example: Red Hat, Inc. controls redhat.com, and may therefore add a
+downstream command __com.redhat_drive-mirror.
+
+
+== Client JSON Protocol introspection ==
+
+Clients of a Client JSON Protocol commonly need to figure out what
+exactly the server (QEMU) supports.
+
+For this purpose, QMP provides introspection via command
+query-qmp-schema. QGA currently doesn't support introspection.
+
+While Client JSON Protocol wire compatibility should be maintained
+between qemu versions, we cannot make the same guarantees for
+introspection stability. For example, one version of qemu may provide
+a non-variant optional member of a struct, and a later version rework
+the member to instead be non-optional and associated with a variant.
+Likewise, one version of qemu may list a member with open-ended type
+'str', and a later version could convert it to a finite set of strings
+via an enum type; or a member may be converted from a specific type to
+an alternate that represents a choice between the original type and
+something else.
+
+query-qmp-schema returns a JSON array of SchemaInfo objects. These
+objects together describe the wire ABI, as defined in the QAPI schema.
+There is no specified order to the SchemaInfo objects returned; a
+client must search for a particular name throughout the entire array
+to learn more about that name, but is at least guaranteed that there
+will be no collisions between type, command, and event names.
+
+However, the SchemaInfo can't reflect all the rules and restrictions
+that apply to QMP. It's interface introspection (figuring out what's
+there), not interface specification. The specification is in the QAPI
+schema. To understand how QMP is to be used, you need to study the
+QAPI schema.
+
+Like any other command, query-qmp-schema is itself defined in the QAPI
+schema, along with the SchemaInfo type. This text attempts to give an
+overview how things work. For details you need to consult the QAPI
+schema.
+
+SchemaInfo objects have common members "name" and "meta-type", and
+additional variant members depending on the value of meta-type.
+
+Each SchemaInfo object describes a wire ABI entity of a certain
+meta-type: a command, event or one of several kinds of type.
+
+SchemaInfo for commands and events have the same name as in the QAPI
+schema.
+
+Command and event names are part of the wire ABI, but type names are
+not. Therefore, the SchemaInfo for types have auto-generated
+meaningless names. For readability, the examples in this section use
+meaningful type names instead.
+
+To examine a type, start with a command or event using it, then follow
+references by name.
+
+QAPI schema definitions not reachable that way are omitted.
+
+The SchemaInfo for a command has meta-type "command", and variant
+members "arg-type" and "ret-type". On the wire, the "arguments"
+member of a client's "execute" command must conform to the object type
+named by "arg-type". The "return" member that the server passes in a
+success response conforms to the type named by "ret-type".
+
+If the command takes no arguments, "arg-type" names an object type
+without members. Likewise, if the command returns nothing, "ret-type"
+names an object type without members.
+
+Example: the SchemaInfo for command query-qmp-schema
+
+ { "name": "query-qmp-schema", "meta-type": "command",
+ "arg-type": "q_empty", "ret-type": "SchemaInfoList" }
+
+ Type "q_empty" is an automatic object type without members, and type
+ "SchemaInfoList" is the array of SchemaInfo type.
+
+The SchemaInfo for an event has meta-type "event", and variant member
+"arg-type". On the wire, a "data" member that the server passes in an
+event conforms to the object type named by "arg-type".
+
+If the event carries no additional information, "arg-type" names an
+object type without members. The event may not have a data member on
+the wire then.
+
+Each command or event defined with dictionary-valued 'data' in the
+QAPI schema implicitly defines an object type.
+
+Example: the SchemaInfo for EVENT_C from section Events
+
+ { "name": "EVENT_C", "meta-type": "event",
+ "arg-type": "q_obj-EVENT_C-arg" }
+
+ Type "q_obj-EVENT_C-arg" is an implicitly defined object type with
+ the two members from the event's definition.
+
+The SchemaInfo for struct and union types has meta-type "object".
+
+The SchemaInfo for a struct type has variant member "members".
+
+The SchemaInfo for a union type additionally has variant members "tag"
+and "variants".
+
+"members" is a JSON array describing the object's common members, if
+any. Each element is a JSON object with members "name" (the member's
+name), "type" (the name of its type), and optionally "default". The
+member is optional if "default" is present. Currently, "default" can
+only have value null. Other values are reserved for future
+extensions. The "members" array is in no particular order; clients
+must search the entire object when learning whether a particular
+member is supported.
+
+Example: the SchemaInfo for MyType from section Struct types
+
+ { "name": "MyType", "meta-type": "object",
+ "members": [
+ { "name": "member1", "type": "str" },
+ { "name": "member2", "type": "int" },
+ { "name": "member3", "type": "str", "default": null } ] }
+
+"tag" is the name of the common member serving as type tag.
+"variants" is a JSON array describing the object's variant members.
+Each element is a JSON object with members "case" (the value of type
+tag this element applies to) and "type" (the name of an object type
+that provides the variant members for this type tag value). The
+"variants" array is in no particular order, and is not guaranteed to
+list cases in the same order as the corresponding "tag" enum type.
+
+Example: the SchemaInfo for flat union BlockdevOptions from section
+Union types
+
+ { "name": "BlockdevOptions", "meta-type": "object",
+ "members": [
+ { "name": "driver", "type": "BlockdevDriver" },
+ { "name": "read-only", "type": "bool", "default": null } ],
+ "tag": "driver",
+ "variants": [
+ { "case": "file", "type": "BlockdevOptionsFile" },
+ { "case": "qcow2", "type": "BlockdevOptionsQcow2" } ] }
+
+Note that base types are "flattened": its members are included in the
+"members" array.
+
+A simple union implicitly defines an enumeration type for its implicit
+discriminator (called "type" on the wire, see section Union types).
+
+A simple union implicitly defines an object type for each of its
+variants.
+
+Example: the SchemaInfo for simple union BlockdevOptionsSimple from section
+Union types
+
+ { "name": "BlockdevOptionsSimple", "meta-type": "object",
+ "members": [
+ { "name": "type", "type": "BlockdevOptionsSimpleKind" } ],
+ "tag": "type",
+ "variants": [
+ { "case": "file", "type": "q_obj-BlockdevOptionsFile-wrapper" },
+ { "case": "qcow2", "type": "q_obj-BlockdevOptionsQcow2-wrapper" } ] }
+
+ Enumeration type "BlockdevOptionsSimpleKind" and the object types
+ "q_obj-BlockdevOptionsFile-wrapper", "q_obj-BlockdevOptionsQcow2-wrapper"
+ are implicitly defined.
+
+The SchemaInfo for an alternate type has meta-type "alternate", and
+variant member "members". "members" is a JSON array. Each element is
+a JSON object with member "type", which names a type. Values of the
+alternate type conform to exactly one of its member types. There is
+no guarantee on the order in which "members" will be listed.
+
+Example: the SchemaInfo for BlockdevRef from section Alternate types
+
+ { "name": "BlockdevRef", "meta-type": "alternate",
+ "members": [
+ { "type": "BlockdevOptions" },
+ { "type": "str" } ] }
+
+The SchemaInfo for an array type has meta-type "array", and variant
+member "element-type", which names the array's element type. Array
+types are implicitly defined. For convenience, the array's name may
+resemble the element type; however, clients should examine member
+"element-type" instead of making assumptions based on parsing member
+"name".
+
+Example: the SchemaInfo for ['str']
+
+ { "name": "[str]", "meta-type": "array",
+ "element-type": "str" }
+
+The SchemaInfo for an enumeration type has meta-type "enum" and
+variant member "values". The values are listed in no particular
+order; clients must search the entire enum when learning whether a
+particular value is supported.
+
+Example: the SchemaInfo for MyEnum from section Enumeration types
+
+ { "name": "MyEnum", "meta-type": "enum",
+ "values": [ "value1", "value2", "value3" ] }
+
+The SchemaInfo for a built-in type has the same name as the type in
+the QAPI schema (see section Built-in Types), with one exception
+detailed below. It has variant member "json-type" that shows how
+values of this type are encoded on the wire.
+
+Example: the SchemaInfo for str
+
+ { "name": "str", "meta-type": "builtin", "json-type": "string" }
+
+The QAPI schema supports a number of integer types that only differ in
+how they map to C. They are identical as far as SchemaInfo is
+concerned. Therefore, they get all mapped to a single type "int" in
+SchemaInfo.
+
+As explained above, type names are not part of the wire ABI. Not even
+the names of built-in types. Clients should examine member
+"json-type" instead of hard-coding names of built-in types.
+
+
+== Code generation ==
+
+Schemas are fed into five scripts to generate all the code/files that,
+paired with the core QAPI libraries, comprise everything required to
+take JSON commands read in by a Client JSON Protocol server, unmarshal
+the arguments into the underlying C types, call into the corresponding
+C function, map the response back to a Client JSON Protocol response
+to be returned to the user, and introspect the commands.
+
+As an example, we'll use the following schema, which describes a
+single complex user-defined type, along with command which takes a
+list of that type as a parameter, and returns a single element of that
+type. The user is responsible for writing the implementation of
+qmp_my_command(); everything else is produced by the generator.
+
+ $ cat example-schema.json
+ { 'struct': 'UserDefOne',
+ 'data': { 'integer': 'int', '*string': 'str' } }
+
+ { 'command': 'my-command',
+ 'data': { 'arg1': ['UserDefOne'] },
+ 'returns': 'UserDefOne' }
+
+ { 'event': 'MY_EVENT' }
+
+For a more thorough look at generated code, the testsuite includes
+tests/qapi-schema/qapi-schema-tests.json that covers more examples of
+what the generator will accept, and compiles the resulting C code as
+part of 'make check-unit'.
+
+=== scripts/qapi-types.py ===
+
+Used to generate the C types defined by a schema, along with
+supporting code. The following files are created:
+
+$(prefix)qapi-types.h - C types corresponding to types defined in
+ the schema you pass in
+$(prefix)qapi-types.c - Cleanup functions for the above C types
+
+The $(prefix) is an optional parameter used as a namespace to keep the
+generated code from one schema/code-generation separated from others so code
+can be generated/used from multiple schemas without clobbering previously
+created code.
+
+Example:
+
+ $ python scripts/qapi-types.py --output-dir="qapi-generated" \
+ --prefix="example-" example-schema.json
+ $ cat qapi-generated/example-qapi-types.h
+[Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_TYPES_H
+ #define EXAMPLE_QAPI_TYPES_H
+
+[Built-in types omitted...]
+
+ typedef struct UserDefOne UserDefOne;
+
+ typedef struct UserDefOneList UserDefOneList;
+
+ struct UserDefOne {
+ int64_t integer;
+ bool has_string;
+ char *string;
+ };
+
+ void qapi_free_UserDefOne(UserDefOne *obj);
+
+ struct UserDefOneList {
+ UserDefOneList *next;
+ UserDefOne *value;
+ };
+
+ void qapi_free_UserDefOneList(UserDefOneList *obj);
+
+ #endif
+ $ cat qapi-generated/example-qapi-types.c
+[Uninteresting stuff omitted...]
+
+ void qapi_free_UserDefOne(UserDefOne *obj)
+ {
+ Visitor *v;
+
+ if (!obj) {
+ return;
+ }
+
+ v = qapi_dealloc_visitor_new();
+ visit_type_UserDefOne(v, NULL, &obj, NULL);
+ visit_free(v);
+ }
+
+ void qapi_free_UserDefOneList(UserDefOneList *obj)
+ {
+ Visitor *v;
+
+ if (!obj) {
+ return;
+ }
+
+ v = qapi_dealloc_visitor_new();
+ visit_type_UserDefOneList(v, NULL, &obj, NULL);
+ visit_free(v);
+ }
+
+=== scripts/qapi-visit.py ===
+
+Used to generate the visitor functions used to walk through and
+convert between a native QAPI C data structure and some other format
+(such as QObject); the generated functions are named visit_type_FOO()
+and visit_type_FOO_members().
+
+The following files are generated:
+
+$(prefix)qapi-visit.c: visitor function for a particular C type, used
+ to automagically convert QObjects into the
+ corresponding C type and vice-versa, as well
+ as for deallocating memory for an existing C
+ type
+
+$(prefix)qapi-visit.h: declarations for previously mentioned visitor
+ functions
+
+Example:
+
+ $ python scripts/qapi-visit.py --output-dir="qapi-generated"
+ --prefix="example-" example-schema.json
+ $ cat qapi-generated/example-qapi-visit.h
+[Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_VISIT_H
+ #define EXAMPLE_QAPI_VISIT_H
+
+[Visitors for built-in types omitted...]
+
+ void visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp);
+ void visit_type_UserDefOne(Visitor *v, const char *name, UserDefOne **obj, Error **errp);
+ void visit_type_UserDefOneList(Visitor *v, const char *name, UserDefOneList **obj, Error **errp);
+
+ #endif
+ $ cat qapi-generated/example-qapi-visit.c
+[Uninteresting stuff omitted...]
+
+ void visit_type_UserDefOne_members(Visitor *v, UserDefOne *obj, Error **errp)
+ {
+ Error *err = NULL;
+
+ visit_type_int(v, "integer", &obj->integer, &err);
+ if (err) {
+ goto out;
+ }
+ if (visit_optional(v, "string", &obj->has_string)) {
+ visit_type_str(v, "string", &obj->string, &err);
+ if (err) {
+ goto out;
+ }
+ }
+
+ out:
+ error_propagate(errp, err);
+ }
+
+ void visit_type_UserDefOne(Visitor *v, const char *name, UserDefOne **obj, Error **errp)
+ {
+ Error *err = NULL;
+
+ visit_start_struct(v, name, (void **)obj, sizeof(UserDefOne), &err);
+ if (err) {
+ goto out;
+ }
+ if (!*obj) {
+ goto out_obj;
+ }
+ visit_type_UserDefOne_members(v, *obj, &err);
+ if (err) {
+ goto out_obj;
+ }
+ visit_check_struct(v, &err);
+ out_obj:
+ visit_end_struct(v, (void **)obj);
+ if (err && visit_is_input(v)) {
+ qapi_free_UserDefOne(*obj);
+ *obj = NULL;
+ }
+ out:
+ error_propagate(errp, err);
+ }
+
+ void visit_type_UserDefOneList(Visitor *v, const char *name, UserDefOneList **obj, Error **errp)
+ {
+ Error *err = NULL;
+ UserDefOneList *tail;
+ size_t size = sizeof(**obj);
+
+ visit_start_list(v, name, (GenericList **)obj, size, &err);
+ if (err) {
+ goto out;
+ }
+
+ for (tail = *obj; tail;
+ tail = (UserDefOneList *)visit_next_list(v, (GenericList *)tail, size)) {
+ visit_type_UserDefOne(v, NULL, &tail->value, &err);
+ if (err) {
+ break;
+ }
+ }
+
+ visit_end_list(v, (void **)obj);
+ if (err && visit_is_input(v)) {
+ qapi_free_UserDefOneList(*obj);
+ *obj = NULL;
+ }
+ out:
+ error_propagate(errp, err);
+ }
+
+=== scripts/qapi-commands.py ===
+
+Used to generate the marshaling/dispatch functions for the commands
+defined in the schema. The generated code implements
+qmp_marshal_COMMAND() (registered automatically), and declares
+qmp_COMMAND() that the user must implement. The following files are
+generated:
+
+$(prefix)qmp-marshal.c: command marshal/dispatch functions for each
+ QMP command defined in the schema. Functions
+ generated by qapi-visit.py are used to
+ convert QObjects received from the wire into
+ function parameters, and uses the same
+ visitor functions to convert native C return
+ values to QObjects from transmission back
+ over the wire.
+
+$(prefix)qmp-commands.h: Function prototypes for the QMP commands
+ specified in the schema.
+
+Example:
+
+ $ python scripts/qapi-commands.py --output-dir="qapi-generated"
+ --prefix="example-" example-schema.json
+ $ cat qapi-generated/example-qmp-commands.h
+[Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QMP_COMMANDS_H
+ #define EXAMPLE_QMP_COMMANDS_H
+
+ #include "example-qapi-types.h"
+ #include "qapi/qmp/qdict.h"
+ #include "qapi/error.h"
+
+ UserDefOne *qmp_my_command(UserDefOneList *arg1, Error **errp);
+
+ #endif
+ $ cat qapi-generated/example-qmp-marshal.c
+[Uninteresting stuff omitted...]
+
+ static void qmp_marshal_output_UserDefOne(UserDefOne *ret_in, QObject **ret_out, Error **errp)
+ {
+ Error *err = NULL;
+ Visitor *v;
+
+ v = qobject_output_visitor_new(ret_out);
+ visit_type_UserDefOne(v, "unused", &ret_in, &err);
+ if (!err) {
+ visit_complete(v, ret_out);
+ }
+ error_propagate(errp, err);
+ visit_free(v);
+ v = qapi_dealloc_visitor_new();
+ visit_type_UserDefOne(v, "unused", &ret_in, NULL);
+ visit_free(v);
+ }
+
+ static void qmp_marshal_my_command(QDict *args, QObject **ret, Error **errp)
+ {
+ Error *err = NULL;
+ UserDefOne *retval;
+ Visitor *v;
+ UserDefOneList *arg1 = NULL;
+
+ v = qobject_input_visitor_new(QOBJECT(args));
+ visit_start_struct(v, NULL, NULL, 0, &err);
+ if (err) {
+ goto out;
+ }
+ visit_type_UserDefOneList(v, "arg1", &arg1, &err);
+ if (!err) {
+ visit_check_struct(v, &err);
+ }
+ visit_end_struct(v, NULL);
+ if (err) {
+ goto out;
+ }
+
+ retval = qmp_my_command(arg1, &err);
+ if (err) {
+ goto out;
+ }
+
+ qmp_marshal_output_UserDefOne(retval, ret, &err);
+
+ out:
+ error_propagate(errp, err);
+ visit_free(v);
+ v = qapi_dealloc_visitor_new();
+ visit_start_struct(v, NULL, NULL, 0, NULL);
+ visit_type_UserDefOneList(v, "arg1", &arg1, NULL);
+ visit_end_struct(v, NULL);
+ visit_free(v);
+ }
+
+ static void qmp_init_marshal(void)
+ {
+ qmp_register_command("my-command", qmp_marshal_my_command, QCO_NO_OPTIONS);
+ }
+
+ qapi_init(qmp_init_marshal);
+
+=== scripts/qapi-event.py ===
+
+Used to generate the event-related C code defined by a schema, with
+implementations for qapi_event_send_FOO(). The following files are
+created:
+
+$(prefix)qapi-event.h - Function prototypes for each event type, plus an
+ enumeration of all event names
+$(prefix)qapi-event.c - Implementation of functions to send an event
+
+Example:
+
+ $ python scripts/qapi-event.py --output-dir="qapi-generated"
+ --prefix="example-" example-schema.json
+ $ cat qapi-generated/example-qapi-event.h
+[Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QAPI_EVENT_H
+ #define EXAMPLE_QAPI_EVENT_H
+
+ #include "qapi/error.h"
+ #include "qapi/qmp/qdict.h"
+ #include "example-qapi-types.h"
+
+
+ void qapi_event_send_my_event(Error **errp);
+
+ typedef enum example_QAPIEvent {
+ EXAMPLE_QAPI_EVENT_MY_EVENT = 0,
+ EXAMPLE_QAPI_EVENT__MAX = 1,
+ } example_QAPIEvent;
+
+ extern const char *const example_QAPIEvent_lookup[];
+
+ #endif
+ $ cat qapi-generated/example-qapi-event.c
+[Uninteresting stuff omitted...]
+
+ void qapi_event_send_my_event(Error **errp)
+ {
+ QDict *qmp;
+ Error *err = NULL;
+ QMPEventFuncEmit emit;
+ emit = qmp_event_get_func_emit();
+ if (!emit) {
+ return;
+ }
+
+ qmp = qmp_event_build_dict("MY_EVENT");
+
+ emit(EXAMPLE_QAPI_EVENT_MY_EVENT, qmp, &err);
+
+ error_propagate(errp, err);
+ QDECREF(qmp);
+ }
+
+ const char *const example_QAPIEvent_lookup[] = {
+ [EXAMPLE_QAPI_EVENT_MY_EVENT] = "MY_EVENT",
+ [EXAMPLE_QAPI_EVENT__MAX] = NULL,
+ };
+
+=== scripts/qapi-introspect.py ===
+
+Used to generate the introspection C code for a schema. The following
+files are created:
+
+$(prefix)qmp-introspect.c - Defines a string holding a JSON
+ description of the schema.
+$(prefix)qmp-introspect.h - Declares the above string.
+
+Example:
+
+ $ python scripts/qapi-introspect.py --output-dir="qapi-generated"
+ --prefix="example-" example-schema.json
+ $ cat qapi-generated/example-qmp-introspect.h
+[Uninteresting stuff omitted...]
+
+ #ifndef EXAMPLE_QMP_INTROSPECT_H
+ #define EXAMPLE_QMP_INTROSPECT_H
+
+ extern const char example_qmp_schema_json[];
+
+ #endif
+ $ cat qapi-generated/example-qmp-introspect.c
+[Uninteresting stuff omitted...]
+
+ const char example_qmp_schema_json[] = "["
+ "{\"arg-type\": \"0\", \"meta-type\": \"event\", \"name\": \"MY_EVENT\"}, "
+ "{\"arg-type\": \"1\", \"meta-type\": \"command\", \"name\": \"my-command\", \"ret-type\": \"2\"}, "
+ "{\"members\": [], \"meta-type\": \"object\", \"name\": \"0\"}, "
+ "{\"members\": [{\"name\": \"arg1\", \"type\": \"[2]\"}], \"meta-type\": \"object\", \"name\": \"1\"}, "
+ "{\"members\": [{\"name\": \"integer\", \"type\": \"int\"}, {\"default\": null, \"name\": \"string\", \"type\": \"str\"}], \"meta-type\": \"object\", \"name\": \"2\"}, "
+ "{\"element-type\": \"2\", \"meta-type\": \"array\", \"name\": \"[2]\"}, "
+ "{\"json-type\": \"int\", \"meta-type\": \"builtin\", \"name\": \"int\"}, "
+ "{\"json-type\": \"string\", \"meta-type\": \"builtin\", \"name\": \"str\"}]";
diff --git a/docs/devel/rcu.txt b/docs/devel/rcu.txt
new file mode 100644
index 0000000000..c84e7f42b2
--- /dev/null
+++ b/docs/devel/rcu.txt
@@ -0,0 +1,390 @@
+Using RCU (Read-Copy-Update) for synchronization
+================================================
+
+Read-copy update (RCU) is a synchronization mechanism that is used to
+protect read-mostly data structures. RCU is very efficient and scalable
+on the read side (it is wait-free), and thus can make the read paths
+extremely fast.
+
+RCU supports concurrency between a single writer and multiple readers,
+thus it is not used alone. Typically, the write-side will use a lock to
+serialize multiple updates, but other approaches are possible (e.g.,
+restricting updates to a single task). In QEMU, when a lock is used,
+this will often be the "iothread mutex", also known as the "big QEMU
+lock" (BQL). Also, restricting updates to a single task is done in
+QEMU using the "bottom half" API.
+
+RCU is fundamentally a "wait-to-finish" mechanism. The read side marks
+sections of code with "critical sections", and the update side will wait
+for the execution of all *currently running* critical sections before
+proceeding, or before asynchronously executing a callback.
+
+The key point here is that only the currently running critical sections
+are waited for; critical sections that are started _after_ the beginning
+of the wait do not extend the wait, despite running concurrently with
+the updater. This is the reason why RCU is more scalable than,
+for example, reader-writer locks. It is so much more scalable that
+the system will have a single instance of the RCU mechanism; a single
+mechanism can be used for an arbitrary number of "things", without
+having to worry about things such as contention or deadlocks.
+
+How is this possible? The basic idea is to split updates in two phases,
+"removal" and "reclamation". During removal, we ensure that subsequent
+readers will not be able to get a reference to the old data. After
+removal has completed, a critical section will not be able to access
+the old data. Therefore, critical sections that begin after removal
+do not matter; as soon as all previous critical sections have finished,
+there cannot be any readers who hold references to the data structure,
+and these can now be safely reclaimed (e.g., freed or unref'ed).
+
+Here is a picture:
+
+ thread 1 thread 2 thread 3
+ ------------------- ------------------------ -------------------
+ enter RCU crit.sec.
+ | finish removal phase
+ | begin wait
+ | | enter RCU crit.sec.
+ exit RCU crit.sec | |
+ complete wait |
+ begin reclamation phase |
+ exit RCU crit.sec.
+
+
+Note how thread 3 is still executing its critical section when thread 2
+starts reclaiming data. This is possible, because the old version of the
+data structure was not accessible at the time thread 3 began executing
+that critical section.
+
+
+RCU API
+=======
+
+The core RCU API is small:
+
+ void rcu_read_lock(void);
+
+ Used by a reader to inform the reclaimer that the reader is
+ entering an RCU read-side critical section.
+
+ void rcu_read_unlock(void);
+
+ Used by a reader to inform the reclaimer that the reader is
+ exiting an RCU read-side critical section. Note that RCU
+ read-side critical sections may be nested and/or overlapping.
+
+ void synchronize_rcu(void);
+
+ Blocks until all pre-existing RCU read-side critical sections
+ on all threads have completed. This marks the end of the removal
+ phase and the beginning of reclamation phase.
+
+ Note that it would be valid for another update to come while
+ synchronize_rcu is running. Because of this, it is better that
+ the updater releases any locks it may hold before calling
+ synchronize_rcu. If this is not possible (for example, because
+ the updater is protected by the BQL), you can use call_rcu.
+
+ void call_rcu1(struct rcu_head * head,
+ void (*func)(struct rcu_head *head));
+
+ This function invokes func(head) after all pre-existing RCU
+ read-side critical sections on all threads have completed. This
+ marks the end of the removal phase, with func taking care
+ asynchronously of the reclamation phase.
+
+ The foo struct needs to have an rcu_head structure added,
+ perhaps as follows:
+
+ struct foo {
+ struct rcu_head rcu;
+ int a;
+ char b;
+ long c;
+ };
+
+ so that the reclaimer function can fetch the struct foo address
+ and free it:
+
+ call_rcu1(&foo.rcu, foo_reclaim);
+
+ void foo_reclaim(struct rcu_head *rp)
+ {
+ struct foo *fp = container_of(rp, struct foo, rcu);
+ g_free(fp);
+ }
+
+ For the common case where the rcu_head member is the first of the
+ struct, you can use the following macro.
+
+ void call_rcu(T *p,
+ void (*func)(T *p),
+ field-name);
+ void g_free_rcu(T *p,
+ field-name);
+
+ call_rcu1 is typically used through these macro, in the common case
+ where the "struct rcu_head" is the first field in the struct. If
+ the callback function is g_free, in particular, g_free_rcu can be
+ used. In the above case, one could have written simply:
+
+ g_free_rcu(&foo, rcu);
+
+ typeof(*p) atomic_rcu_read(p);
+
+ atomic_rcu_read() is similar to atomic_mb_read(), but it makes
+ some assumptions on the code that calls it. This allows a more
+ optimized implementation.
+
+ atomic_rcu_read assumes that whenever a single RCU critical
+ section reads multiple shared data, these reads are either
+ data-dependent or need no ordering. This is almost always the
+ case when using RCU, because read-side critical sections typically
+ navigate one or more pointers (the pointers that are changed on
+ every update) until reaching a data structure of interest,
+ and then read from there.
+
+ RCU read-side critical sections must use atomic_rcu_read() to
+ read data, unless concurrent writes are prevented by another
+ synchronization mechanism.
+
+ Furthermore, RCU read-side critical sections should traverse the
+ data structure in a single direction, opposite to the direction
+ in which the updater initializes it.
+
+ void atomic_rcu_set(p, typeof(*p) v);
+
+ atomic_rcu_set() is also similar to atomic_mb_set(), and it also
+ makes assumptions on the code that calls it in order to allow a more
+ optimized implementation.
+
+ In particular, atomic_rcu_set() suffices for synchronization
+ with readers, if the updater never mutates a field within a
+ data item that is already accessible to readers. This is the
+ case when initializing a new copy of the RCU-protected data
+ structure; just ensure that initialization of *p is carried out
+ before atomic_rcu_set() makes the data item visible to readers.
+ If this rule is observed, writes will happen in the opposite
+ order as reads in the RCU read-side critical sections (or if
+ there is just one update), and there will be no need for other
+ synchronization mechanism to coordinate the accesses.
+
+The following APIs must be used before RCU is used in a thread:
+
+ void rcu_register_thread(void);
+
+ Mark a thread as taking part in the RCU mechanism. Such a thread
+ will have to report quiescent points regularly, either manually
+ or through the QemuCond/QemuSemaphore/QemuEvent APIs.
+
+ void rcu_unregister_thread(void);
+
+ Mark a thread as not taking part anymore in the RCU mechanism.
+ It is not a problem if such a thread reports quiescent points,
+ either manually or by using the QemuCond/QemuSemaphore/QemuEvent
+ APIs.
+
+Note that these APIs are relatively heavyweight, and should _not_ be
+nested.
+
+
+DIFFERENCES WITH LINUX
+======================
+
+- Waiting on a mutex is possible, though discouraged, within an RCU critical
+ section. This is because spinlocks are rarely (if ever) used in userspace
+ programming; not allowing this would prevent upgrading an RCU read-side
+ critical section to become an updater.
+
+- atomic_rcu_read and atomic_rcu_set replace rcu_dereference and
+ rcu_assign_pointer. They take a _pointer_ to the variable being accessed.
+
+- call_rcu is a macro that has an extra argument (the name of the first
+ field in the struct, which must be a struct rcu_head), and expects the
+ type of the callback's argument to be the type of the first argument.
+ call_rcu1 is the same as Linux's call_rcu.
+
+
+RCU PATTERNS
+============
+
+Many patterns using read-writer locks translate directly to RCU, with
+the advantages of higher scalability and deadlock immunity.
+
+In general, RCU can be used whenever it is possible to create a new
+"version" of a data structure every time the updater runs. This may
+sound like a very strict restriction, however:
+
+- the updater does not mean "everything that writes to a data structure",
+ but rather "everything that involves a reclamation step". See the
+ array example below
+
+- in some cases, creating a new version of a data structure may actually
+ be very cheap. For example, modifying the "next" pointer of a singly
+ linked list is effectively creating a new version of the list.
+
+Here are some frequently-used RCU idioms that are worth noting.
+
+
+RCU list processing
+-------------------
+
+TBD (not yet used in QEMU)
+
+
+RCU reference counting
+----------------------
+
+Because grace periods are not allowed to complete while there is an RCU
+read-side critical section in progress, the RCU read-side primitives
+may be used as a restricted reference-counting mechanism. For example,
+consider the following code fragment:
+
+ rcu_read_lock();
+ p = atomic_rcu_read(&foo);
+ /* do something with p. */
+ rcu_read_unlock();
+
+The RCU read-side critical section ensures that the value of "p" remains
+valid until after the rcu_read_unlock(). In some sense, it is acquiring
+a reference to p that is later released when the critical section ends.
+The write side looks simply like this (with appropriate locking):
+
+ qemu_mutex_lock(&foo_mutex);
+ old = foo;
+ atomic_rcu_set(&foo, new);
+ qemu_mutex_unlock(&foo_mutex);
+ synchronize_rcu();
+ free(old);
+
+If the processing cannot be done purely within the critical section, it
+is possible to combine this idiom with a "real" reference count:
+
+ rcu_read_lock();
+ p = atomic_rcu_read(&foo);
+ foo_ref(p);
+ rcu_read_unlock();
+ /* do something with p. */
+ foo_unref(p);
+
+The write side can be like this:
+
+ qemu_mutex_lock(&foo_mutex);
+ old = foo;
+ atomic_rcu_set(&foo, new);
+ qemu_mutex_unlock(&foo_mutex);
+ synchronize_rcu();
+ foo_unref(old);
+
+or with call_rcu:
+
+ qemu_mutex_lock(&foo_mutex);
+ old = foo;
+ atomic_rcu_set(&foo, new);
+ qemu_mutex_unlock(&foo_mutex);
+ call_rcu(foo_unref, old, rcu);
+
+In both cases, the write side only performs removal. Reclamation
+happens when the last reference to a "foo" object is dropped.
+Using synchronize_rcu() is undesirably expensive, because the
+last reference may be dropped on the read side. Hence you can
+use call_rcu() instead:
+
+ foo_unref(struct foo *p) {
+ if (atomic_fetch_dec(&p->refcount) == 1) {
+ call_rcu(foo_destroy, p, rcu);
+ }
+ }
+
+
+Note that the same idioms would be possible with reader/writer
+locks:
+
+ read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
+ p = foo; p = foo;
+ /* do something with p. */ foo = new;
+ read_unlock(&foo_rwlock); free(p);
+ write_mutex_unlock(&foo_rwlock);
+ free(p);
+
+ ------------------------------------------------------------------
+
+ read_lock(&foo_rwlock); write_mutex_lock(&foo_rwlock);
+ p = foo; old = foo;
+ foo_ref(p); foo = new;
+ read_unlock(&foo_rwlock); foo_unref(old);
+ /* do something with p. */ write_mutex_unlock(&foo_rwlock);
+ read_lock(&foo_rwlock);
+ foo_unref(p);
+ read_unlock(&foo_rwlock);
+
+foo_unref could use a mechanism such as bottom halves to move deallocation
+out of the write-side critical section.
+
+
+RCU resizable arrays
+--------------------
+
+Resizable arrays can be used with RCU. The expensive RCU synchronization
+(or call_rcu) only needs to take place when the array is resized.
+The two items to take care of are:
+
+- ensuring that the old version of the array is available between removal
+ and reclamation;
+
+- avoiding mismatches in the read side between the array data and the
+ array size.
+
+The first problem is avoided simply by not using realloc. Instead,
+each resize will allocate a new array and copy the old data into it.
+The second problem would arise if the size and the data pointers were
+two members of a larger struct:
+
+ struct mystuff {
+ ...
+ int data_size;
+ int data_alloc;
+ T *data;
+ ...
+ };
+
+Instead, we store the size of the array with the array itself:
+
+ struct arr {
+ int size;
+ int alloc;
+ T data[];
+ };
+ struct arr *global_array;
+
+ read side:
+ rcu_read_lock();
+ struct arr *array = atomic_rcu_read(&global_array);
+ x = i < array->size ? array->data[i] : -1;
+ rcu_read_unlock();
+ return x;
+
+ write side (running under a lock):
+ if (global_array->size == global_array->alloc) {
+ /* Creating a new version. */
+ new_array = g_malloc(sizeof(struct arr) +
+ global_array->alloc * 2 * sizeof(T));
+ new_array->size = global_array->size;
+ new_array->alloc = global_array->alloc * 2;
+ memcpy(new_array->data, global_array->data,
+ global_array->alloc * sizeof(T));
+
+ /* Removal phase. */
+ old_array = global_array;
+ atomic_rcu_set(&new_array->data, new_array);
+ synchronize_rcu();
+
+ /* Reclamation phase. */
+ free(old_array);
+ }
+
+
+SOURCES
+=======
+
+* Documentation/RCU/ from the Linux kernel
diff --git a/docs/devel/tracing.txt b/docs/devel/tracing.txt
new file mode 100644
index 0000000000..8c0029beca
--- /dev/null
+++ b/docs/devel/tracing.txt
@@ -0,0 +1,442 @@
+= Tracing =
+
+== Introduction ==
+
+This document describes the tracing infrastructure in QEMU and how to use it
+for debugging, profiling, and observing execution.
+
+== Quickstart ==
+
+1. Build with the 'simple' trace backend:
+
+ ./configure --enable-trace-backends=simple
+ make
+
+2. Create a file with the events you want to trace:
+
+ echo bdrv_aio_readv > /tmp/events
+ echo bdrv_aio_writev >> /tmp/events
+
+3. Run the virtual machine to produce a trace file:
+
+ qemu -trace events=/tmp/events ... # your normal QEMU invocation
+
+4. Pretty-print the binary trace file:
+
+ ./scripts/simpletrace.py trace-events-all trace-* # Override * with QEMU <pid>
+
+== Trace events ==
+
+=== Sub-directory setup ===
+
+Each directory in the source tree can declare a set of static trace events
+in a local "trace-events" file. All directories which contain "trace-events"
+files must be listed in the "trace-events-subdirs" make variable in the top
+level Makefile.objs. During build, the "trace-events" file in each listed
+subdirectory will be processed by the "tracetool" script to generate code for
+the trace events.
+
+The individual "trace-events" files are merged into a "trace-events-all" file,
+which is also installed into "/usr/share/qemu" with the name "trace-events".
+This merged file is to be used by the "simpletrace.py" script to later analyse
+traces in the simpletrace data format.
+
+In the sub-directory the following files will be automatically generated
+
+ - trace.c - the trace event state declarations
+ - trace.h - the trace event enums and probe functions
+ - trace-dtrace.h - DTrace event probe specification
+ - trace-dtrace.dtrace - DTrace event probe helper declaration
+ - trace-dtrace.o - binary DTrace provider (generated by dtrace)
+ - trace-ust.h - UST event probe helper declarations
+
+Source files in the sub-directory should #include the local 'trace.h' file,
+without any sub-directory path prefix. eg io/channel-buffer.c would do
+
+ #include "trace.h"
+
+To access the 'io/trace.h' file. While it is possible to include a trace.h
+file from outside a source files' own sub-directory, this is discouraged in
+general. It is strongly preferred that all events be declared directly in
+the sub-directory that uses them. The only exception is where there are some
+shared trace events defined in the top level directory trace-events file.
+The top level directory generates trace files with a filename prefix of
+"trace-root" instead of just "trace". This is to avoid ambiguity between
+a trace.h in the current directory, vs the top level directory.
+
+=== Using trace events ===
+
+Trace events are invoked directly from source code like this:
+
+ #include "trace.h" /* needed for trace event prototype */
+
+ void *qemu_vmalloc(size_t size)
+ {
+ void *ptr;
+ size_t align = QEMU_VMALLOC_ALIGN;
+
+ if (size < align) {
+ align = getpagesize();
+ }
+ ptr = qemu_memalign(align, size);
+ trace_qemu_vmalloc(size, ptr);
+ return ptr;
+ }
+
+=== Declaring trace events ===
+
+The "tracetool" script produces the trace.h header file which is included by
+every source file that uses trace events. Since many source files include
+trace.h, it uses a minimum of types and other header files included to keep the
+namespace clean and compile times and dependencies down.
+
+Trace events should use types as follows:
+
+ * Use stdint.h types for fixed-size types. Most offsets and guest memory
+ addresses are best represented with uint32_t or uint64_t. Use fixed-size
+ types over primitive types whose size may change depending on the host
+ (32-bit versus 64-bit) so trace events don't truncate values or break
+ the build.
+
+ * Use void * for pointers to structs or for arrays. The trace.h header
+ cannot include all user-defined struct declarations and it is therefore
+ necessary to use void * for pointers to structs.
+
+ * For everything else, use primitive scalar types (char, int, long) with the
+ appropriate signedness.
+
+Format strings should reflect the types defined in the trace event. Take
+special care to use PRId64 and PRIu64 for int64_t and uint64_t types,
+respectively. This ensures portability between 32- and 64-bit platforms.
+
+Each event declaration will start with the event name, then its arguments,
+finally a format string for pretty-printing. For example:
+
+ qemu_vmalloc(size_t size, void *ptr) "size %zu ptr %p"
+ qemu_vfree(void *ptr) "ptr %p"
+
+
+=== Hints for adding new trace events ===
+
+1. Trace state changes in the code. Interesting points in the code usually
+ involve a state change like starting, stopping, allocating, freeing. State
+ changes are good trace events because they can be used to understand the
+ execution of the system.
+
+2. Trace guest operations. Guest I/O accesses like reading device registers
+ are good trace events because they can be used to understand guest
+ interactions.
+
+3. Use correlator fields so the context of an individual line of trace output
+ can be understood. For example, trace the pointer returned by malloc and
+ used as an argument to free. This way mallocs and frees can be matched up.
+ Trace events with no context are not very useful.
+
+4. Name trace events after their function. If there are multiple trace events
+ in one function, append a unique distinguisher at the end of the name.
+
+== Generic interface and monitor commands ==
+
+You can programmatically query and control the state of trace events through a
+backend-agnostic interface provided by the header "trace/control.h".
+
+Note that some of the backends do not provide an implementation for some parts
+of this interface, in which case QEMU will just print a warning (please refer to
+header "trace/control.h" to see which routines are backend-dependent).
+
+The state of events can also be queried and modified through monitor commands:
+
+* info trace-events
+ View available trace events and their state. State 1 means enabled, state 0
+ means disabled.
+
+* trace-event NAME on|off
+ Enable/disable a given trace event or a group of events (using wildcards).
+
+The "-trace events=<file>" command line argument can be used to enable the
+events listed in <file> from the very beginning of the program. This file must
+contain one event name per line.
+
+If a line in the "-trace events=<file>" file begins with a '-', the trace event
+will be disabled instead of enabled. This is useful when a wildcard was used
+to enable an entire family of events but one noisy event needs to be disabled.
+
+Wildcard matching is supported in both the monitor command "trace-event" and the
+events list file. That means you can enable/disable the events having a common
+prefix in a batch. For example, virtio-blk trace events could be enabled using
+the following monitor command:
+
+ trace-event virtio_blk_* on
+
+== Trace backends ==
+
+The "tracetool" script automates tedious trace event code generation and also
+keeps the trace event declarations independent of the trace backend. The trace
+events are not tightly coupled to a specific trace backend, such as LTTng or
+SystemTap. Support for trace backends can be added by extending the "tracetool"
+script.
+
+The trace backends are chosen at configure time:
+
+ ./configure --enable-trace-backends=simple
+
+For a list of supported trace backends, try ./configure --help or see below.
+If multiple backends are enabled, the trace is sent to them all.
+
+If no backends are explicitly selected, configure will default to the
+"log" backend.
+
+The following subsections describe the supported trace backends.
+
+=== Nop ===
+
+The "nop" backend generates empty trace event functions so that the compiler
+can optimize out trace events completely. This imposes no performance
+penalty.
+
+Note that regardless of the selected trace backend, events with the "disable"
+property will be generated with the "nop" backend.
+
+=== Log ===
+
+The "log" backend sends trace events directly to standard error. This
+effectively turns trace events into debug printfs.
+
+This is the simplest backend and can be used together with existing code that
+uses DPRINTF().
+
+=== Simpletrace ===
+
+The "simple" backend supports common use cases and comes as part of the QEMU
+source tree. It may not be as powerful as platform-specific or third-party
+trace backends but it is portable. This is the recommended trace backend
+unless you have specific needs for more advanced backends.
+
+=== Ftrace ===
+
+The "ftrace" backend writes trace data to ftrace marker. This effectively
+sends trace events to ftrace ring buffer, and you can compare qemu trace
+data and kernel(especially kvm.ko when using KVM) trace data.
+
+if you use KVM, enable kvm events in ftrace:
+
+ # echo 1 > /sys/kernel/debug/tracing/events/kvm/enable
+
+After running qemu by root user, you can get the trace:
+
+ # cat /sys/kernel/debug/tracing/trace
+
+Restriction: "ftrace" backend is restricted to Linux only.
+
+=== Syslog ===
+
+The "syslog" backend sends trace events using the POSIX syslog API. The log
+is opened specifying the LOG_DAEMON facility and LOG_PID option (so events
+are tagged with the pid of the particular QEMU process that generated
+them). All events are logged at LOG_INFO level.
+
+NOTE: syslog may squash duplicate consecutive trace events and apply rate
+ limiting.
+
+Restriction: "syslog" backend is restricted to POSIX compliant OS.
+
+==== Monitor commands ====
+
+* trace-file on|off|flush|set <path>
+ Enable/disable/flush the trace file or set the trace file name.
+
+==== Analyzing trace files ====
+
+The "simple" backend produces binary trace files that can be formatted with the
+simpletrace.py script. The script takes the "trace-events-all" file and the
+binary trace:
+
+ ./scripts/simpletrace.py trace-events-all trace-12345
+
+You must ensure that the same "trace-events-all" file was used to build QEMU,
+otherwise trace event declarations may have changed and output will not be
+consistent.
+
+=== LTTng Userspace Tracer ===
+
+The "ust" backend uses the LTTng Userspace Tracer library. There are no
+monitor commands built into QEMU, instead UST utilities should be used to list,
+enable/disable, and dump traces.
+
+Package lttng-tools is required for userspace tracing. You must ensure that the
+current user belongs to the "tracing" group, or manually launch the
+lttng-sessiond daemon for the current user prior to running any instance of
+QEMU.
+
+While running an instrumented QEMU, LTTng should be able to list all available
+events:
+
+ lttng list -u
+
+Create tracing session:
+
+ lttng create mysession
+
+Enable events:
+
+ lttng enable-event qemu:g_malloc -u
+
+Where the events can either be a comma-separated list of events, or "-a" to
+enable all tracepoint events. Start and stop tracing as needed:
+
+ lttng start
+ lttng stop
+
+View the trace:
+
+ lttng view
+
+Destroy tracing session:
+
+ lttng destroy
+
+Babeltrace can be used at any later time to view the trace:
+
+ babeltrace $HOME/lttng-traces/mysession-<date>-<time>
+
+=== SystemTap ===
+
+The "dtrace" backend uses DTrace sdt probes but has only been tested with
+SystemTap. When SystemTap support is detected a .stp file with wrapper probes
+is generated to make use in scripts more convenient. This step can also be
+performed manually after a build in order to change the binary name in the .stp
+probes:
+
+ scripts/tracetool.py --backends=dtrace --format=stap \
+ --binary path/to/qemu-binary \
+ --target-type system \
+ --target-name x86_64 \
+ <trace-events-all >qemu.stp
+
+== Trace event properties ==
+
+Each event in the "trace-events-all" file can be prefixed with a space-separated
+list of zero or more of the following event properties.
+
+=== "disable" ===
+
+If a specific trace event is going to be invoked a huge number of times, this
+might have a noticeable performance impact even when the event is
+programmatically disabled.
+
+In this case you should declare such event with the "disable" property. This
+will effectively disable the event at compile time (by using the "nop" backend),
+thus having no performance impact at all on regular builds (i.e., unless you
+edit the "trace-events-all" file).
+
+In addition, there might be cases where relatively complex computations must be
+performed to generate values that are only used as arguments for a trace
+function. In these cases you can use the macro 'TRACE_${EVENT_NAME}_ENABLED' to
+guard such computations and avoid its compilation when the event is disabled:
+
+ #include "trace.h" /* needed for trace event prototype */
+
+ void *qemu_vmalloc(size_t size)
+ {
+ void *ptr;
+ size_t align = QEMU_VMALLOC_ALIGN;
+
+ if (size < align) {
+ align = getpagesize();
+ }
+ ptr = qemu_memalign(align, size);
+ if (TRACE_QEMU_VMALLOC_ENABLED) { /* preprocessor macro */
+ void *complex;
+ /* some complex computations to produce the 'complex' value */
+ trace_qemu_vmalloc(size, ptr, complex);
+ }
+ return ptr;
+ }
+
+You can check both if the event has been disabled and is dynamically enabled at
+the same time using the 'trace_event_get_state' routine (see header
+"trace/control.h" for more information).
+
+=== "tcg" ===
+
+Guest code generated by TCG can be traced by defining an event with the "tcg"
+event property. Internally, this property generates two events:
+"<eventname>_trans" to trace the event at translation time, and
+"<eventname>_exec" to trace the event at execution time.
+
+Instead of using these two events, you should instead use the function
+"trace_<eventname>_tcg" during translation (TCG code generation). This function
+will automatically call "trace_<eventname>_trans", and will generate the
+necessary TCG code to call "trace_<eventname>_exec" during guest code execution.
+
+Events with the "tcg" property can be declared in the "trace-events" file with a
+mix of native and TCG types, and "trace_<eventname>_tcg" will gracefully forward
+them to the "<eventname>_trans" and "<eventname>_exec" events. Since TCG values
+are not known at translation time, these are ignored by the "<eventname>_trans"
+event. Because of this, the entry in the "trace-events" file needs two printing
+formats (separated by a comma):
+
+ tcg foo(uint8_t a1, TCGv_i32 a2) "a1=%d", "a1=%d a2=%d"
+
+For example:
+
+ #include "trace-tcg.h"
+
+ void some_disassembly_func (...)
+ {
+ uint8_t a1 = ...;
+ TCGv_i32 a2 = ...;
+ trace_foo_tcg(a1, a2);
+ }
+
+This will immediately call:
+
+ void trace_foo_trans(uint8_t a1);
+
+and will generate the TCG code to call:
+
+ void trace_foo(uint8_t a1, uint32_t a2);
+
+=== "vcpu" ===
+
+Identifies events that trace vCPU-specific information. It implicitly adds a
+"CPUState*" argument, and extends the tracing print format to show the vCPU
+information. If used together with the "tcg" property, it adds a second
+"TCGv_env" argument that must point to the per-target global TCG register that
+points to the vCPU when guest code is executed (usually the "cpu_env" variable).
+
+The "tcg" and "vcpu" properties are currently only honored in the root
+./trace-events file.
+
+The following example events:
+
+ foo(uint32_t a) "a=%x"
+ vcpu bar(uint32_t a) "a=%x"
+ tcg vcpu baz(uint32_t a) "a=%x", "a=%x"
+
+Can be used as:
+
+ #include "trace-tcg.h"
+
+ CPUArchState *env;
+ TCGv_ptr cpu_env;
+
+ void some_disassembly_func(...)
+ {
+ /* trace emitted at this point */
+ trace_foo(0xd1);
+ /* trace emitted at this point */
+ trace_bar(ENV_GET_CPU(env), 0xd2);
+ /* trace emitted at this point (env) and when guest code is executed (cpu_env) */
+ trace_baz_tcg(ENV_GET_CPU(env), cpu_env, 0xd3);
+ }
+
+If the translating vCPU has address 0xc1 and code is later executed by vCPU
+0xc2, this would be an example output:
+
+ // at guest code translation
+ foo a=0xd1
+ bar cpu=0xc1 a=0xd2
+ baz_trans cpu=0xc1 a=0xd3
+ // at guest code execution
+ baz_exec cpu=0xc2 a=0xd3
diff --git a/docs/devel/virtio-migration.txt b/docs/devel/virtio-migration.txt
new file mode 100644
index 0000000000..98a6b0ffb5
--- /dev/null
+++ b/docs/devel/virtio-migration.txt
@@ -0,0 +1,108 @@
+Virtio devices and migration
+============================
+
+Copyright 2015 IBM Corp.
+
+This work is licensed under the terms of the GNU GPL, version 2 or later. See
+the COPYING file in the top-level directory.
+
+Saving and restoring the state of virtio devices is a bit of a twisty maze,
+for several reasons:
+- state is distributed between several parts:
+ - virtio core, for common fields like features, number of queues, ...
+ - virtio transport (pci, ccw, ...), for the different proxy devices and
+ transport specific state (msix vectors, indicators, ...)
+ - virtio device (net, blk, ...), for the different device types and their
+ state (mac address, request queue, ...)
+- most fields are saved via the stream interface; subsequently, subsections
+ have been added to make cross-version migration possible
+
+This file attempts to document the current procedure and point out some
+caveats.
+
+
+Save state procedure
+====================
+
+virtio core virtio transport virtio device
+----------- ---------------- -------------
+
+ save() function registered
+ via VMState wrapper on
+ device class
+virtio_save() <----------
+ ------> save_config()
+ - save proxy device
+ - save transport-specific
+ device fields
+- save common device
+ fields
+- save common virtqueue
+ fields
+ ------> save_queue()
+ - save transport-specific
+ virtqueue fields
+ ------> save_device()
+ - save device-specific
+ fields
+- save subsections
+ - device endianness,
+ if changed from
+ default endianness
+ - 64 bit features, if
+ any high feature bit
+ is set
+ - virtio-1 virtqueue
+ fields, if VERSION_1
+ is set
+
+
+Load state procedure
+====================
+
+virtio core virtio transport virtio device
+----------- ---------------- -------------
+
+ load() function registered
+ via VMState wrapper on
+ device class
+virtio_load() <----------
+ ------> load_config()
+ - load proxy device
+ - load transport-specific
+ device fields
+- load common device
+ fields
+- load common virtqueue
+ fields
+ ------> load_queue()
+ - load transport-specific
+ virtqueue fields
+- notify guest
+ ------> load_device()
+ - load device-specific
+ fields
+- load subsections
+ - device endianness
+ - 64 bit features
+ - virtio-1 virtqueue
+ fields
+- sanitize endianness
+- sanitize features
+- virtqueue index sanity
+ check
+ - feature-dependent setup
+
+
+Implications of this setup
+==========================
+
+Devices need to be careful in their state processing during load: The
+load_device() procedure is invoked by the core before subsections have
+been loaded. Any code that depends on information transmitted in subsections
+therefore has to be invoked in the device's load() function _after_
+virtio_load() returned (like e.g. code depending on features).
+
+Any extension of the state being migrated should be done in subsections
+added to the core for compatibility reasons. If transport or device specific
+state is added, core needs to invoke a callback from the new subsection.
diff --git a/docs/devel/writing-qmp-commands.txt b/docs/devel/writing-qmp-commands.txt
new file mode 100644
index 0000000000..1e6375495b
--- /dev/null
+++ b/docs/devel/writing-qmp-commands.txt
@@ -0,0 +1,607 @@
+= How to write QMP commands using the QAPI framework =
+
+This document is a step-by-step guide on how to write new QMP commands using
+the QAPI framework. It also shows how to implement new style HMP commands.
+
+This document doesn't discuss QMP protocol level details, nor does it dive
+into the QAPI framework implementation.
+
+For an in-depth introduction to the QAPI framework, please refer to
+docs/qapi-code-gen.txt. For documentation about the QMP protocol,
+start with docs/qmp-intro.txt.
+
+== Overview ==
+
+Generally speaking, the following steps should be taken in order to write a
+new QMP command.
+
+1. Write the command's and type(s) specification in the QAPI schema file
+ (qapi-schema.json in the root source directory)
+
+2. Write the QMP command itself, which is a regular C function. Preferably,
+ the command should be exported by some QEMU subsystem. But it can also be
+ added to the qmp.c file
+
+3. At this point the command can be tested under the QMP protocol
+
+4. Write the HMP command equivalent. This is not required and should only be
+ done if it does make sense to have the functionality in HMP. The HMP command
+ is implemented in terms of the QMP command
+
+The following sections will demonstrate each of the steps above. We will start
+very simple and get more complex as we progress.
+
+=== Testing ===
+
+For all the examples in the next sections, the test setup is the same and is
+shown here.
+
+First, QEMU should be started as:
+
+# /path/to/your/source/qemu [...] \
+ -chardev socket,id=qmp,port=4444,host=localhost,server \
+ -mon chardev=qmp,mode=control,pretty=on
+
+Then, in a different terminal:
+
+$ telnet localhost 4444
+Trying 127.0.0.1...
+Connected to localhost.
+Escape character is '^]'.
+{
+ "QMP": {
+ "version": {
+ "qemu": {
+ "micro": 50,
+ "minor": 15,
+ "major": 0
+ },
+ "package": ""
+ },
+ "capabilities": [
+ ]
+ }
+}
+
+The above output is the QMP server saying you're connected. The server is
+actually in capabilities negotiation mode. To enter in command mode type:
+
+{ "execute": "qmp_capabilities" }
+
+Then the server should respond:
+
+{
+ "return": {
+ }
+}
+
+Which is QMP's way of saying "the latest command executed OK and didn't return
+any data". Now you're ready to enter the QMP example commands as explained in
+the following sections.
+
+== Writing a command that doesn't return data ==
+
+That's the most simple QMP command that can be written. Usually, this kind of
+command carries some meaningful action in QEMU but here it will just print
+"Hello, world" to the standard output.
+
+Our command will be called "hello-world". It takes no arguments, nor does it
+return any data.
+
+The first step is to add the following line to the bottom of the
+qapi-schema.json file:
+
+{ 'command': 'hello-world' }
+
+The "command" keyword defines a new QMP command. It's an JSON object. All
+schema entries are JSON objects. The line above will instruct the QAPI to
+generate any prototypes and the necessary code to marshal and unmarshal
+protocol data.
+
+The next step is to write the "hello-world" implementation. As explained
+earlier, it's preferable for commands to live in QEMU subsystems. But
+"hello-world" doesn't pertain to any, so we put its implementation in qmp.c:
+
+void qmp_hello_world(Error **errp)
+{
+ printf("Hello, world!\n");
+}
+
+There are a few things to be noticed:
+
+1. QMP command implementation functions must be prefixed with "qmp_"
+2. qmp_hello_world() returns void, this is in accordance with the fact that the
+ command doesn't return any data
+3. It takes an "Error **" argument. This is required. Later we will see how to
+ return errors and take additional arguments. The Error argument should not
+ be touched if the command doesn't return errors
+4. We won't add the function's prototype. That's automatically done by the QAPI
+5. Printing to the terminal is discouraged for QMP commands, we do it here
+ because it's the easiest way to demonstrate a QMP command
+
+You're done. Now build qemu, run it as suggested in the "Testing" section,
+and then type the following QMP command:
+
+{ "execute": "hello-world" }
+
+Then check the terminal running qemu and look for the "Hello, world" string. If
+you don't see it then something went wrong.
+
+=== Arguments ===
+
+Let's add an argument called "message" to our "hello-world" command. The new
+argument will contain the string to be printed to stdout. It's an optional
+argument, if it's not present we print our default "Hello, World" string.
+
+The first change we have to do is to modify the command specification in the
+schema file to the following:
+
+{ 'command': 'hello-world', 'data': { '*message': 'str' } }
+
+Notice the new 'data' member in the schema. It's an JSON object whose each
+element is an argument to the command in question. Also notice the asterisk,
+it's used to mark the argument optional (that means that you shouldn't use it
+for mandatory arguments). Finally, 'str' is the argument's type, which
+stands for "string". The QAPI also supports integers, booleans, enumerations
+and user defined types.
+
+Now, let's update our C implementation in qmp.c:
+
+void qmp_hello_world(bool has_message, const char *message, Error **errp)
+{
+ if (has_message) {
+ printf("%s\n", message);
+ } else {
+ printf("Hello, world\n");
+ }
+}
+
+There are two important details to be noticed:
+
+1. All optional arguments are accompanied by a 'has_' boolean, which is set
+ if the optional argument is present or false otherwise
+2. The C implementation signature must follow the schema's argument ordering,
+ which is defined by the "data" member
+
+Time to test our new version of the "hello-world" command. Build qemu, run it as
+described in the "Testing" section and then send two commands:
+
+{ "execute": "hello-world" }
+{
+ "return": {
+ }
+}
+
+{ "execute": "hello-world", "arguments": { "message": "We love qemu" } }
+{
+ "return": {
+ }
+}
+
+You should see "Hello, world" and "we love qemu" in the terminal running qemu,
+if you don't see these strings, then something went wrong.
+
+=== Errors ===
+
+QMP commands should use the error interface exported by the error.h header
+file. Basically, most errors are set by calling the error_setg() function.
+
+Let's say we don't accept the string "message" to contain the word "love". If
+it does contain it, we want the "hello-world" command to return an error:
+
+void qmp_hello_world(bool has_message, const char *message, Error **errp)
+{
+ if (has_message) {
+ if (strstr(message, "love")) {
+ error_setg(errp, "the word 'love' is not allowed");
+ return;
+ }
+ printf("%s\n", message);
+ } else {
+ printf("Hello, world\n");
+ }
+}
+
+The first argument to the error_setg() function is the Error pointer
+to pointer, which is passed to all QMP functions. The next argument is a human
+description of the error, this is a free-form printf-like string.
+
+Let's test the example above. Build qemu, run it as defined in the "Testing"
+section, and then issue the following command:
+
+{ "execute": "hello-world", "arguments": { "message": "all you need is love" } }
+
+The QMP server's response should be:
+
+{
+ "error": {
+ "class": "GenericError",
+ "desc": "the word 'love' is not allowed"
+ }
+}
+
+As a general rule, all QMP errors should use ERROR_CLASS_GENERIC_ERROR
+(done by default when using error_setg()). There are two exceptions to
+this rule:
+
+ 1. A non-generic ErrorClass value exists* for the failure you want to report
+ (eg. DeviceNotFound)
+
+ 2. Management applications have to take special action on the failure you
+ want to report, hence you have to add a new ErrorClass value so that they
+ can check for it
+
+If the failure you want to report falls into one of the two cases above,
+use error_set() with a second argument of an ErrorClass value.
+
+ * All existing ErrorClass values are defined in the qapi-schema.json file
+
+=== Command Documentation ===
+
+There's only one step missing to make "hello-world"'s implementation complete,
+and that's its documentation in the schema file.
+
+This is very important. No QMP command will be accepted in QEMU without proper
+documentation.
+
+There are many examples of such documentation in the schema file already, but
+here goes "hello-world"'s new entry for the qapi-schema.json file:
+
+##
+# @hello-world
+#
+# Print a client provided string to the standard output stream.
+#
+# @message: string to be printed
+#
+# Returns: Nothing on success.
+#
+# Notes: if @message is not provided, the "Hello, world" string will
+# be printed instead
+#
+# Since: <next qemu stable release, eg. 1.0>
+##
+{ 'command': 'hello-world', 'data': { '*message': 'str' } }
+
+Please, note that the "Returns" clause is optional if a command doesn't return
+any data nor any errors.
+
+=== Implementing the HMP command ===
+
+Now that the QMP command is in place, we can also make it available in the human
+monitor (HMP).
+
+With the introduction of the QAPI, HMP commands make QMP calls. Most of the
+time HMP commands are simple wrappers. All HMP commands implementation exist in
+the hmp.c file.
+
+Here's the implementation of the "hello-world" HMP command:
+
+void hmp_hello_world(Monitor *mon, const QDict *qdict)
+{
+ const char *message = qdict_get_try_str(qdict, "message");
+ Error *err = NULL;
+
+ qmp_hello_world(!!message, message, &err);
+ if (err) {
+ monitor_printf(mon, "%s\n", error_get_pretty(err));
+ error_free(err);
+ return;
+ }
+}
+
+Also, you have to add the function's prototype to the hmp.h file.
+
+There are three important points to be noticed:
+
+1. The "mon" and "qdict" arguments are mandatory for all HMP functions. The
+ former is the monitor object. The latter is how the monitor passes
+ arguments entered by the user to the command implementation
+2. hmp_hello_world() performs error checking. In this example we just print
+ the error description to the user, but we could do more, like taking
+ different actions depending on the error qmp_hello_world() returns
+3. The "err" variable must be initialized to NULL before performing the
+ QMP call
+
+There's one last step to actually make the command available to monitor users,
+we should add it to the hmp-commands.hx file:
+
+ {
+ .name = "hello-world",
+ .args_type = "message:s?",
+ .params = "hello-world [message]",
+ .help = "Print message to the standard output",
+ .cmd = hmp_hello_world,
+ },
+
+STEXI
+@item hello_world @var{message}
+@findex hello_world
+Print message to the standard output
+ETEXI
+
+To test this you have to open a user monitor and issue the "hello-world"
+command. It might be instructive to check the command's documentation with
+HMP's "help" command.
+
+Please, check the "-monitor" command-line option to know how to open a user
+monitor.
+
+== Writing a command that returns data ==
+
+A QMP command is capable of returning any data the QAPI supports like integers,
+strings, booleans, enumerations and user defined types.
+
+In this section we will focus on user defined types. Please, check the QAPI
+documentation for information about the other types.
+
+=== User Defined Types ===
+
+FIXME This example needs to be redone after commit 6d32717
+
+For this example we will write the query-alarm-clock command, which returns
+information about QEMU's timer alarm. For more information about it, please
+check the "-clock" command-line option.
+
+We want to return two pieces of information. The first one is the alarm clock's
+name. The second one is when the next alarm will fire. The former information is
+returned as a string, the latter is an integer in nanoseconds (which is not
+very useful in practice, as the timer has probably already fired when the
+information reaches the client).
+
+The best way to return that data is to create a new QAPI type, as shown below:
+
+##
+# @QemuAlarmClock
+#
+# QEMU alarm clock information.
+#
+# @clock-name: The alarm clock method's name.
+#
+# @next-deadline: The time (in nanoseconds) the next alarm will fire.
+#
+# Since: 1.0
+##
+{ 'type': 'QemuAlarmClock',
+ 'data': { 'clock-name': 'str', '*next-deadline': 'int' } }
+
+The "type" keyword defines a new QAPI type. Its "data" member contains the
+type's members. In this example our members are the "clock-name" and the
+"next-deadline" one, which is optional.
+
+Now let's define the query-alarm-clock command:
+
+##
+# @query-alarm-clock
+#
+# Return information about QEMU's alarm clock.
+#
+# Returns a @QemuAlarmClock instance describing the alarm clock method
+# being currently used by QEMU (this is usually set by the '-clock'
+# command-line option).
+#
+# Since: 1.0
+##
+{ 'command': 'query-alarm-clock', 'returns': 'QemuAlarmClock' }
+
+Notice the "returns" keyword. As its name suggests, it's used to define the
+data returned by a command.
+
+It's time to implement the qmp_query_alarm_clock() function, you can put it
+in the qemu-timer.c file:
+
+QemuAlarmClock *qmp_query_alarm_clock(Error **errp)
+{
+ QemuAlarmClock *clock;
+ int64_t deadline;
+
+ clock = g_malloc0(sizeof(*clock));
+
+ deadline = qemu_next_alarm_deadline();
+ if (deadline > 0) {
+ clock->has_next_deadline = true;
+ clock->next_deadline = deadline;
+ }
+ clock->clock_name = g_strdup(alarm_timer->name);
+
+ return clock;
+}
+
+There are a number of things to be noticed:
+
+1. The QemuAlarmClock type is automatically generated by the QAPI framework,
+ its members correspond to the type's specification in the schema file
+2. As specified in the schema file, the function returns a QemuAlarmClock
+ instance and takes no arguments (besides the "errp" one, which is mandatory
+ for all QMP functions)
+3. The "clock" variable (which will point to our QAPI type instance) is
+ allocated by the regular g_malloc0() function. Note that we chose to
+ initialize the memory to zero. This is recommended for all QAPI types, as
+ it helps avoiding bad surprises (specially with booleans)
+4. Remember that "next_deadline" is optional? All optional members have a
+ 'has_TYPE_NAME' member that should be properly set by the implementation,
+ as shown above
+5. Even static strings, such as "alarm_timer->name", should be dynamically
+ allocated by the implementation. This is so because the QAPI also generates
+ a function to free its types and it cannot distinguish between dynamically
+ or statically allocated strings
+6. You have to include the "qmp-commands.h" header file in qemu-timer.c,
+ otherwise qemu won't build
+
+Time to test the new command. Build qemu, run it as described in the "Testing"
+section and try this:
+
+{ "execute": "query-alarm-clock" }
+{
+ "return": {
+ "next-deadline": 2368219,
+ "clock-name": "dynticks"
+ }
+}
+
+==== The HMP command ====
+
+Here's the HMP counterpart of the query-alarm-clock command:
+
+void hmp_info_alarm_clock(Monitor *mon)
+{
+ QemuAlarmClock *clock;
+ Error *err = NULL;
+
+ clock = qmp_query_alarm_clock(&err);
+ if (err) {
+ monitor_printf(mon, "Could not query alarm clock information\n");
+ error_free(err);
+ return;
+ }
+
+ monitor_printf(mon, "Alarm clock method in use: '%s'\n", clock->clock_name);
+ if (clock->has_next_deadline) {
+ monitor_printf(mon, "Next alarm will fire in %" PRId64 " nanoseconds\n",
+ clock->next_deadline);
+ }
+
+ qapi_free_QemuAlarmClock(clock);
+}
+
+It's important to notice that hmp_info_alarm_clock() calls
+qapi_free_QemuAlarmClock() to free the data returned by qmp_query_alarm_clock().
+For user defined types, the QAPI will generate a qapi_free_QAPI_TYPE_NAME()
+function and that's what you have to use to free the types you define and
+qapi_free_QAPI_TYPE_NAMEList() for list types (explained in the next section).
+If the QMP call returns a string, then you should g_free() to free it.
+
+Also note that hmp_info_alarm_clock() performs error handling. That's not
+strictly required if you're sure the QMP function doesn't return errors, but
+it's good practice to always check for errors.
+
+Another important detail is that HMP's "info" commands don't go into the
+hmp-commands.hx. Instead, they go into the info_cmds[] table, which is defined
+in the monitor.c file. The entry for the "info alarmclock" follows:
+
+ {
+ .name = "alarmclock",
+ .args_type = "",
+ .params = "",
+ .help = "show information about the alarm clock",
+ .cmd = hmp_info_alarm_clock,
+ },
+
+To test this, run qemu and type "info alarmclock" in the user monitor.
+
+=== Returning Lists ===
+
+For this example, we're going to return all available methods for the timer
+alarm, which is pretty much what the command-line option "-clock ?" does,
+except that we're also going to inform which method is in use.
+
+This first step is to define a new type:
+
+##
+# @TimerAlarmMethod
+#
+# Timer alarm method information.
+#
+# @method-name: The method's name.
+#
+# @current: true if this alarm method is currently in use, false otherwise
+#
+# Since: 1.0
+##
+{ 'type': 'TimerAlarmMethod',
+ 'data': { 'method-name': 'str', 'current': 'bool' } }
+
+The command will be called "query-alarm-methods", here is its schema
+specification:
+
+##
+# @query-alarm-methods
+#
+# Returns information about available alarm methods.
+#
+# Returns: a list of @TimerAlarmMethod for each method
+#
+# Since: 1.0
+##
+{ 'command': 'query-alarm-methods', 'returns': ['TimerAlarmMethod'] }
+
+Notice the syntax for returning lists "'returns': ['TimerAlarmMethod']", this
+should be read as "returns a list of TimerAlarmMethod instances".
+
+The C implementation follows:
+
+TimerAlarmMethodList *qmp_query_alarm_methods(Error **errp)
+{
+ TimerAlarmMethodList *method_list = NULL;
+ const struct qemu_alarm_timer *p;
+ bool current = true;
+
+ for (p = alarm_timers; p->name; p++) {
+ TimerAlarmMethodList *info = g_malloc0(sizeof(*info));
+ info->value = g_malloc0(sizeof(*info->value));
+ info->value->method_name = g_strdup(p->name);
+ info->value->current = current;
+
+ current = false;
+
+ info->next = method_list;
+ method_list = info;
+ }
+
+ return method_list;
+}
+
+The most important difference from the previous examples is the
+TimerAlarmMethodList type, which is automatically generated by the QAPI from
+the TimerAlarmMethod type.
+
+Each list node is represented by a TimerAlarmMethodList instance. We have to
+allocate it, and that's done inside the for loop: the "info" pointer points to
+an allocated node. We also have to allocate the node's contents, which is
+stored in its "value" member. In our example, the "value" member is a pointer
+to an TimerAlarmMethod instance.
+
+Notice that the "current" variable is used as "true" only in the first
+iteration of the loop. That's because the alarm timer method in use is the
+first element of the alarm_timers array. Also notice that QAPI lists are handled
+by hand and we return the head of the list.
+
+Now Build qemu, run it as explained in the "Testing" section and try our new
+command:
+
+{ "execute": "query-alarm-methods" }
+{
+ "return": [
+ {
+ "current": false,
+ "method-name": "unix"
+ },
+ {
+ "current": true,
+ "method-name": "dynticks"
+ }
+ ]
+}
+
+The HMP counterpart is a bit more complex than previous examples because it
+has to traverse the list, it's shown below for reference:
+
+void hmp_info_alarm_methods(Monitor *mon)
+{
+ TimerAlarmMethodList *method_list, *method;
+ Error *err = NULL;
+
+ method_list = qmp_query_alarm_methods(&err);
+ if (err) {
+ monitor_printf(mon, "Could not query alarm methods\n");
+ error_free(err);
+ return;
+ }
+
+ for (method = method_list; method; method = method->next) {
+ monitor_printf(mon, "%c %s\n", method->value->current ? '*' : ' ',
+ method->value->method_name);
+ }
+
+ qapi_free_TimerAlarmMethodList(method_list);
+}