Python中的MCAR Little检验

8

我该如何在Python中执行Little's Test以查找MCAR?我已经查看了相同测试的R软件包,但我想在Python中执行。是否有其他方法来测试MCAR?


impyute 库怎么样?Little的MCAR测试(WIP)在其功能列表中。 - Istrel
@Istrel的impyute库没有解释如何做到这一点(就我所知),你能详细说明步骤或给出适当文档的链接吗? - Kiran
impyute库有一个实现Little的MCAR测试的票,但目前还没有进展:https://github.com/eltonlaw/impyute/issues/71 - skeller88
4个回答

3

您可以使用rpy2从R中获取mcar测试。请注意,使用rpy2需要一些R编码。

在Google Colab中设置rpy2

# rpy2 libraries
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import globalenv

# Import R's base package
base = importr("base")

# Import R's utility packages
utils = importr("utils")

# Select mirror 
utils.chooseCRANmirror(ind=1)

# For automatic translation of Pandas objects to R
pandas2ri.activate()

# Enable R magic
%load_ext rpy2.ipython

# Make your Pandas dataframe accessible to R
globalenv["r_df"] = df

现在您可以使用R magics在Python环境中获取R功能。使用%R来执行单行R代码,使用%%R来执行整个单元格的R代码。

要安装R包,请使用:utils.install_packages("package_name")

在使用之前,您可能还需要加载它:%R library(package_name)

对于Little的MCAR测试,我们应该安装naniar包。它的安装稍微复杂一些,因为我们还需要安装remotes才能从github下载它,但对于其他包,通用程序应该足够了。

utils.install_packages("remotes")
%R remotes::install_github("njtierney/naniar")

加载 naniar 包:

%R library(naniar)

将你的 r_df 传递给 mcar_test 函数:
# mcar_test on whole df
%R mcar_test(r_df)

如果发生错误,请尝试仅包含缺失数据的列:
%%R
# mcar_test on columns with missing data
r_dfMissing <- r_df[c("col1", "col2", "col3")]
mcar_test(r_dfMissing)

好的。你能简单解释一下为什么只包括有缺失数据的变量吗?我原以为这个想法是为了评估按缺失/非缺失分组的变量之间的差异,如果我们删除没有缺失的列,我无法想象这个方法会起作用。 - Johan
1
这是一个很好的问题。我之所以建议包含具有缺失数据的变量,只是因为mcar_test()函数会引发错误。我不确定这是否在每种情况下都会发生,还是只是在我尝试的数据中出现了这个问题。 - Akis Hadjimpalasis

2

你可以直接使用这个函数来进行Little的MCAR测试,而不必使用R代码:

import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    
    Parameters:
    data (DataFrame): A pandas DataFrame with n observations and p variables, where some values are missing.
    alpha (float): The significance level for the hypothesis test (default is 0.05).
    
    Returns:
    A tuple containing:
    - A matrix of missing values that represents the pattern of missingness in the dataset.
    - A p-value representing the significance of the MCAR test.
    """
    
    # Calculate the proportion of missing values in each variable
    p_m = data.isnull().mean()
    
    # Calculate the proportion of complete cases for each variable
    p_c = data.dropna().shape[0] / data.shape[0]
    
    # Calculate the correlation matrix for all pairs of variables that have complete cases
    R_c = data.dropna().corr()
    
    # Calculate the correlation matrix for all pairs of variables using all observations
    R_all = data.corr()
    
    # Calculate the difference between the two correlation matrices
    R_diff = R_all - R_c
    
    # Calculate the variance of the R_diff matrix
    V_Rdiff = np.var(R_diff, ddof=1)
    
    # Calculate the expected value of V_Rdiff under the null hypothesis that the missing data is MCAR
    E_Rdiff = (1 - p_c) / (1 - p_m).sum()
    
    # Calculate the test statistic
    T = np.trace(R_diff) / np.sqrt(V_Rdiff * E_Rdiff)
    
    # Calculate the degrees of freedom
    df = data.shape[1] * (data.shape[1] - 1) / 2
    
    # Calculate the p-value using a chi-squared distribution with df degrees of freedom and the test statistic T
    p_value = 1 - chi2.cdf(T ** 2, df)
    
    # Create a matrix of missing values that represents the pattern of missingness in the dataset
    missingness_matrix = data.isnull().astype(int)
    
    # Return the missingness matrix and the p-value
    return missingness_matrix, p_value


酷。你希望以什么样的输入形式?而且我认为利特尔检验应该返回一个测试和一个p值,而不是每一列一个。 - Johan

0
评论建议使用现有的软件包。这里是一个直接从pyampute中提取的示例
import pandas as pd
from pyampute.exploration.mcar_statistical_tests import MCARTest
data_mcar = pd.read_table("data/missingdata_mcar.csv")
mt = MCARTest(method="little")
print(mt.little_mcar_test(data_mcar))
0.17365464213775494

0
import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    """
    data = pd.DataFrame(data)
    data.columns = ['x' + str(i) for i in range(data.shape[1])]
    data['missing'] = np.sum(data.isnull(), axis=1)
    n = data.shape[0]
    k = data.shape[1] - 1
    df = k * (k - 1) / 2
    chi2_crit = chi2.ppf(1 - alpha, df)
    chi2_val = ((n - 1 - (k - 1) / 2) ** 2) / (k - 1) / ((n - k) * np.mean(data['missing']))
    p_val = 1 - chi2.cdf(chi2_val, df)
    if chi2_val > chi2_crit:
        print(
            'Reject null hypothesis: Data is not MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )
    else:
        print(
            'Do not reject null hypothesis: Data is MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )

1
根据目前的写法,你的回答不够清晰。请编辑以添加更多细节,帮助其他人理解这如何回答所提出的问题。你可以在帮助中心找到关于如何撰写好回答的更多信息。 - Community
1
根据目前的写法,你的回答不够清晰。请编辑以添加更多细节,以帮助其他人理解这如何回答所提出的问题。你可以在帮助中心找到关于如何撰写好回答的更多信息。 - undefined

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接