为什么这段代码会出现死锁？

Question

为什么这段代码会出现死锁？

linux-kernelkerneldeadlockwatchdogspinlock

3

我在可加载模块中创建了2个Linux内核线程，并将它们绑定到运行在双核Android设备上的不同CPU核心。运行几次后，我注意到设备会因硬件看门狗计时器重置而重新启动。我一直遇到这个问题。可能是什么原因导致死锁？

基本上，我需要做的就是确保两个线程在不同的核心上同时运行do_something()，而没有任何人窃取CPU周期（即禁用中断）。为此，我使用自旋锁和易失变量。我还有一个信号量供父线程等待子线程。

#define CPU_COUNT 2

/* Globals */
spinlock_t lock;
struct semaphore sem;
volatile unsigned long count;

/* Thread util function for binding the thread to CPU*/
struct task_struct* thread_init(kthread_fn fn, void* data, int cpu)
{
    struct task_struct *ts;

    ts=kthread_create(fn, data, "per_cpu_thread");
    kthread_bind(ts, cpu);
    if (!IS_ERR(ts)) {
        wake_up_process(ts);
    }
    else {
        ERR("Failed to bind thread to CPU %d\n", cpu);
    }
    return ts;
}

/* Sync both threads */
void thread_sync()
{   
    spin_lock(&lock);
    ++count;
    spin_unlock(&lock); 

    while (count != CPU_COUNT);
}

void do_something()
{
}

/* Child thread */
int per_cpu_thread_fn(void* data)
{
    int i = 0;
    unsigned long flags = 0;
    int cpu = smp_processor_id();

    DBG("per_cpu_thread entering (cpu:%d)...\n", cpu);

    /* Disable local interrupts */
    local_irq_save(flags);

    /* sync threads */
    thread_sync();

    /* Do something */
    do_something();

    /* Enable interrupts */
    local_irq_restore(flags);

    /* Notify parent about exit */
    up(&sem);
    DBG("per_cpu_thread exiting (cpu:%d)...\n", cpu);
    return value;
}

/* Main thread */
int main_thread()
{
    int cpuB;
    int cpu = smp_processor_id();
    unsigned long flags = 0;

    DBG("main thread running (cpu:%d)...\n", cpu);

    /* Init globals*/
    sema_init(&sem, 0);
    spin_lock_init(&lock);
    count = 0;

    /* Launch child thread and bind to the other CPU core */
    if (cpu == 0) cpuB = 1; else cpuB = 0;        
    thread_init(per_cpu_thread_fn, NULL, cpuB);

    /* Disable local interrupts */
    local_irq_save(flags);

    /* thread sync */
    thread_sync();

    /* Do something here */
    do_something();

    /* Enable interrupts */
    local_irq_restore(flags);

    /* Wait for child to join */
    DBG("main thread waiting for all child threads to finish ...\n");
    down_interruptible(&sem);
}

- Gupta

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alexey Shmalko · Answer 1

我不确定这是真正的原因，但你的代码包含一些严重的错误。首先，在“while (count！= CPU_COUNT);”中，您不得在未持有锁的情况下读取共享变量，除非读取是原子性的。对于“count”，它不能保证是这样的。您必须使用锁来保护对“count”的读取。您可以用以下内容替换while循环：

unsigned long local_count;
do {
    spin_lock(&lock);
    local_count = count;
    spin_unlock(&lock);
} while (local_count != CPU_COUNT);

或者，您可以使用原子类型。注意无需锁定

atomic_t count = ATOMIC_INIT(0);

...

void thread_sync() {
    atomic_inc(&count);
    while (atomic_read(&count) != CPU_COUNT);
}

第二个中断问题。我认为你不明白自己在做什么。

local_irq_save() 保存并禁用中断。然后，你再次使用 local_irq_disable() 禁用中断。完成一些工作后，你使用 local_irq_restore() 恢复之前的状态，并使用 local_irq_enable() 启用中断。这种启用方式是完全错误的。你启用中断，而不考虑它们之前的状态。

第三个问题。如果主线程没有绑定到 CPU，你不应该使用 smp_processor_id()，除非你确定内核在获取 CPU 编号后不会立即重新调度。最好使用 get_cpu()，它会禁用内核抢占，然后返回 CPU ID。完成后，请调用 put_cpu()。

但是，当你调用 get_cpu() 时，创建和运行其他线程就会出现 bug。这就是为什么你应该设置主线程的亲和性。

第四点。使用变量而不是指向unsigned long的指针的local_irq_save()和local_irq_restore()宏。（我在传递指针时遇到了错误和一些警告。我想知道你是如何编译你的代码的）。移除引用。

最终代码在这里可用：http://pastebin.com/Ven6wqWf