writev() の原子性がどのように保証されるかを知りたいです。

Question

writev（）システムコールを使用してSCTPソケットに書き込むマルチスレッドLinux x86_64ユーザープログラムがあります。 writev() システムコールの原子性を確認したいと思います。

writev() のマニュアルページには次のように指定されています。

ssize_t writev(int fd, const struct iovec *iov, int iovcnt);

The data transfers performed by readv() and writev() are atomic: the data written by writev()
is written as a single block that is not intermingled with output from writes in other processes
(but see pipe(7) for an exception); analogously, readv() is guaranteed to read a contiguous
block of data from the file, regardless of read operations performed in other threads or processes
that have file descriptors referring to the same open file description (see open(2)).

だからwritev（）の実装を見ると、ロックが明らかに見えるようです。 writev() 実装でロックが見えなかったときに呼び出しを追跡し始めました。私が見つけたものは次のとおりです。 Linuxカーネルのソースコードを初めて見てみるのに誤解があることをご了承ください。

分析されたLinuxカーネルはx86から4.4.0です。

writev() の実装は fs/read_write.c:896 から始まります。

SYSCALL_DEFINE3(writev, unsigned long, fd, const struct iovec __user *, vec,u nsigned long, vlen)

そして、同じファイルfs / read_write.c：863で定義されているvfs_writev（）を呼び出します。

ssize_t vfs_writev(struct file *file, const struct iovec __user *vec,
           unsigned long vlen, loff_t *pos)
{
    if (!(file->f_mode & FMODE_WRITE))
        return -EBADF;
    if (!(file->f_mode & FMODE_CAN_WRITE))
        return -EINVAL;

    return do_readv_writev(WRITE, file, vec, vlen, pos);
}

ここで do_readv_writev() は fs/read_write.c:798 にもあり、WRITE タイプの場合に実行されます。

fn = (io_fn_t)file->f_op->write;
iter_fn = file->f_op->write_iter;
file_start_write(file);

file_start_write() は include/linux/fs.h:2512 のインライン関数です。

static inline void file_start_write(struct file *file)
{
    if (!S_ISREG(file_inode(file)->i_mode))
        return;
    __sb_start_write(file_inode(file)->i_sb, SB_FREEZE_WRITE, true);
}

S_ISREG() は include/uapi/linux/stat.h:20 で定義されており、記述子が通常のファイルであることを確認するために使用されます。

そして__sb_start_writeはfs / super.c：1252で定義されています。

/*
 * This is an internal function, please use sb_start_{write,pagefault,intwrite}
 * instead.
 */
int __sb_start_write(struct super_block *sb, int level, bool wait)
{
    bool force_trylock = false;
    int ret = 1;

#ifdef CONFIG_LOCKDEP
    /*
     * We want lockdep to tell us about possible deadlocks with freezing
     * but it's it bit tricky to properly instrument it. Getting a freeze
     * protection works as getting a read lock but there are subtle
     * problems. XFS for example gets freeze protection on internal level
     * twice in some cases, which is OK only because we already hold a
     * freeze protection also on higher level. Due to these cases we have
     * to use wait == F (trylock mode) which must not fail.
     */
    if (wait) {
        int i;

        for (i = 0; i < level - 1; i++)
            if (percpu_rwsem_is_held(sb->s_writers.rw_sem + i)) {
                force_trylock = true;
                break;
            }
    }
#endif
    if (wait && !force_trylock)
        percpu_down_read(sb->s_writers.rw_sem + level-1);
    else
        ret = percpu_down_read_trylock(sb->s_writers.rw_sem + level-1);

    WARN_ON(force_trylock & !ret);
    return ret;
}
EXPORT_SYMBOL(__sb_start_write);

私はカーネルがこれに基づいてCONFIG_LOCKDEPでコンパイルされたとは思わない。これ

ファイルシステムのロックは、fs/super.c:1322 で始まるコメントに記載されています。

/**
 * freeze_super - lock the filesystem and force it into a consistent state
 * @sb: the super to lock
 *
 * Syncs the super to make sure the filesystem is consistent and calls the fs's
 * freeze_fs.  Subsequent calls to this without first thawing the fs will return
 * -EBUSY.
 *
 * During this function, sb->s_writers.frozen goes through these values:
 *
 * SB_UNFROZEN: File system is normal, all writes progress as usual.
 *
 * SB_FREEZE_WRITE: The file system is in the process of being frozen.  New
 * writes should be blocked, though page faults are still allowed. We wait for
 * all writes to complete and then proceed to the next stage.
 *
 * SB_FREEZE_PAGEFAULT: Freezing continues. Now also page faults are blocked
 * but internal fs threads can still modify the filesystem (although they
 * should not dirty new pages or inodes), writeback can run etc. After waiting
 * for all running page faults we sync the filesystem which will clean all
 * dirty pages and inodes (no new dirty pages or inodes can be created when
 * sync is running).
 *
 * SB_FREEZE_FS: The file system is frozen. Now all internal sources of fs
 * modification are blocked (e.g. XFS preallocation truncation on inode
 * reclaim). This is usually implemented by blocking new transactions for
 * filesystems that have them and need this additional guard. After all
 * internal writers are finished we call ->freeze_fs() to finish filesystem
 * freezing. Then we transition to SB_FREEZE_COMPLETE state. This state is
 * mostly auxiliary for filesystems to verify they do not modify frozen fs.
 *
 * sb->s_writers.frozen is protected by sb->s_umount.
 */

最後に、kernel/locking/percpu-rwsem.c:70から

/*
 * Like the normal down_read() this is not recursive, the writer can
 * come after the first percpu_down_read() and create the deadlock.
 *
 * Note: returns with lock_is_held(brw->rw_sem) == T for lockdep,
 * percpu_up_read() does rwsem_release(). This pairs with the usage
 * of ->rw_sem in percpu_down/up_write().
 */
void percpu_down_read(struct percpu_rw_semaphore *brw)
{
    might_sleep();
    rwsem_acquire_read(&brw->rw_sem.dep_map, 0, 0, _RET_IP_);

    if (likely(update_fast_ctr(brw, +1)))
        return;

    /* Avoid rwsem_acquire_read() and rwsem_release() */
    __down_read(&brw->rw_sem);
    atomic_inc(&brw->slow_read_ctr);
    __up_read(&brw->rw_sem);
}
EXPORT_SYMBOL_GPL(percpu_down_read);

さて、これはロックです。

Answer 1