| Here's how mount actually works: |
| |
| The mount comand calls the mount system call, which has five arguments you |
| can see on the "man 2 mount" page: |
| |
| int mount(const char *source, const char *target, const char *filesystemtype, |
| unsigned long mountflags, const void *data); |
| |
| The command "mount -t ext2 /dev/sda1 /path/to/mntpoint -o ro,noatime", |
| parses its command line arguments to feed them into those five system call |
| arguments. In this example, the source is "/dev/sda1", the target is |
| "/path/to/mountpoint", and the filesystemtype is "ext2". |
| |
| The other two syscall arguments (mountflags and data) come from the |
| "-o option,option,option" argument. The mountflags argument goes to the VFS |
| (explained below), and the data argument is passed to the filesystem driver. |
| |
| The mount command's options string is a list of comma separated values. If |
| there's more than one -o argument on the mount command line, they get glued |
| together (in order) with a comma. The mount command also checks the file |
| /etc/fstab for default options, and the options you specify on the command |
| line get appended to those defaults (if any). Most other command line mount |
| flags are just synonyms for adding option flags (for example |
| "mount -o remount -w" is equivalent to "mount -o remount,rw"). Behind the |
| scenes they all get appended to the -o string and fed to a common parser. |
| |
| VFS stands for "Virtual File System" and is the common infrastructure shared |
| by different filesystems. It handles common things like making the filesystem |
| read only. The mount command assembles an option string to supply to the "data" |
| argument of the option syscall, but first it parses it for VFS options |
| (ro,noexec,nodev,nosuid,noatime...) each of which corresponds to a flag |
| from #include <sys/mount.h>. The mount command removes those options from the |
| sting and sets the corresponding bit in mountflags, then the remaining options |
| (if any) form the data argument for the filesystem driver. |
| |
| A few quick implementation details: the mountflag MS_SILENCE gets set by |
| default even if there's nothing in /etc/fstab. Some actions (such as --bind |
| and --move mounts, I.E. -o bind and -o move) are just VFS actions and don't |
| require any specific filesystem at all. The "-o remount" flag requires looking |
| up the filesystem in /proc/mounts and reassembling the full option string |
| because you don't _just_ pass in the changed flags but have to reassemble |
| the complete new filesystem state to give the system call. Some of the options |
| in /etc/fstab are for the mount command (such as "user" which only does |
| anything if the mount command has the suid bit set) and don't get passed |
| through to the system call. |
| |
| When mounting a new filesystem, the "filesystem" argument to the mount system |
| call specifies which filesystem driver to use. All the loaded drivers are |
| listed in /proc/filesystems, but calling mount can also trigger a module load |
| request to add another. A filesystem driver is responsible for putting files |
| and subdirectories under the mount point: any time you open, close, read, |
| write, truncate, list the contents of a directory, move, or delete a file, |
| you're talking to a filesystem driver to do it. (Or when you call |
| ioctl(), stat(), statvfs(), utime()...) |
| |
| Different drivers implement different filesystems, which have four categories: |
| |
| 1) Block device backed filesystems, such as ext2 and vfat. |
| |
| This kind of filesystem driver acts as a lens to look at a block device |
| through. The source argument for block backed filesystems is a path to a |
| block device, such as "/dev/hda1", which stores the contents of the |
| filesystem in a fixed block of sequential storage, and there's a seperate |
| driver providing that block device. |
| |
| Block backed filesystems are the "conventional" filesystem type most people |
| think of when they mount things. The name means that the "backing store" |
| (where the data lives when the system is switched off) is on a block device. |
| |
| 2) Server backed filesystems, such as cifs/samba or fuse. |
| |
| These drivers convert filesystem operations into a sequential stream of |
| bytes, which it can send through a pipe to talk to a program. The filesystem |
| server could be a local Filesystem in Userspace daemon (connected to a local |
| process through a pipe filehandle), behind a network socket (CIFS and v9fs), |
| behind a char device (/dev/ttyS0), and so on. The common attribute is there's |
| some program on the other end sending and receiving a sequential bytestream. |
| The backing store is a server somewhere, and the filesystem driver is talking |
| to a process that reads and writes data in some known protocol. |
| |
| The source argument for these filesystems indicates where the filesystem lives. It's often in a URL-like format for network filesystems, but it's really just a blob of data that the filesystem driver understands. |
| |
| A lot of server backed filesystems want to open their own connection so they |
| don't have to pass their data through a persistent local userspace process, |
| not really for performance reasons but because in low memory situations a |
| chicken-and-egg situation can develop where all the process's pages have |
| been swapped out but the filesystem needs to write data to its backing |
| store in order to free up memory so it can swap the process's pages back in. |
| If this mechanism is providing the root filesystem, this can deadlock and |
| freeze the system solid. So while you _can_ pass some of them a filehandle, |
| more often than not you don't. |
| |
| These are also known as "pipe backed" filesystems (or "network filesystems" |
| because that's a common case, although a network doesn't need to be inolved). |
| Conceptually they're char device backed filesystems (analogus to the block |
| device backed ones), but you don't commonly specify a character device in |
| /dev when mounting them because you're talking to a specific server process, |
| not a whole machine. |
| |
| 3) Ram backed filesystems, such as ramfs and tmpfs. |
| |
| These are very simple filesystems that don't implement a backing store. Data |
| written to these gets stored in the disk cache, and the driver ignores requests |
| to flush it to backing store (reporting all the pages as pinned and |
| unfreeable). |
| |
| These drivers essentially mount the VFS's page/dentry cache as if it was a |
| filesystem. (Page cache stores file contents, dentry cache stores directory |
| entries.) They grow and shrink dynamically, as needed: when you write files |
| into them they allocate more memory to store it, and when you delete files |
| the memory is freed. |
| |
| There's a simple one (ramfs) that does only that, and a more complex one (tmpfs) |
| which adds a size limitation (by default 50%, but it's adjustable as a mount |
| option) so the system doesn't run out of memory and lock up if you |
| "cat /dev/zero > file", and can also report how much space is remaining |
| when asked (ramfs always says 0 bytes free). The other thing tmpfs does |
| is write its data out to swap space (like processes do) when the system |
| is under memory proessure. |
| |
| Note that "ramdisk" is not the same as "ramfs". The ramdisk driver uses a |
| chunk of memory to implement a block device, and then you can format that |
| block device and mount it with a block device backed filesystem driver. |
| (This is the same "two device drivers" approach you always have with block |
| backed filesystems: one driver provides /dev/ram0 and the second driver mounts |
| it as vfat.) Ram disks are significantly less efficient than ramfs, |
| allocating a fixed amount of memory up front for the block device instead of |
| dynamically resizing itself as files are written into an deleted from the |
| page and dentry caches the way ramfs does. |
| |
| Note: initramfs cpio, tmpfs as rootfs. |
| |
| 4) Synthetic filesystems, such as proc, sysfs, devpts... |
| |
| These filesystems don't have any backing store either, because they don't |
| store arbitrary data the way the first three types of filesystems do. |
| |
| Instead they present artificial contents, which can represent processes or |
| hardware or anything the driver writer wants them to show. Listing or reading |
| from these files calls a driver function that produces whatever output it's |
| programmed to, and writing to these files submits data to the driver which |
| can do anything it wants with it. |
| |
| Synthetic ilesystems are often implemented to provide monitoring and control |
| knobs for parts of the operating system. It's an alternative to adding more |
| system calls (or ioctl, sysctl, etc), and provides a more human friendly user |
| interface which programs can use but which users can also interact with |
| directly from the command line via "cat" and redirecting the output of |
| "echo" into special files. |
| |
| |
| Those are the four types of filesystems: backing store can be a fixed length |
| block of storage, backing store can be some server the driver connects to, |
| backing store can not exist and the files merely reside in the disk cache, |
| or the filesystem driver can just make up its contents programmatically. |
| |
| And that's how filesystems get mounted, using the mount system call which has |
| five arguments. The "filesystem" argument specifies the driver implementing |
| one of those filesystems, and the "source" and "data" arguments get fed to |
| that driver. The "target" and "mountflags" arguments get parsed (and handled) |
| by the generic VFS infrastructure. (The filesystem driver can peek at the |
| VFS data, but generally doesn't need to care. The VFS tells the filesystem |
| what to do, in response to what userspace said to do.) |