Python中的MCAR Little检验

Question

Python中的MCAR Little检验

python-3.xstatisticsmissing-dataimputationhypothesis-test

8

我该如何在Python中执行Little's Test以查找MCAR？我已经查看了相同测试的R软件包，但我想在Python中执行。是否有其他方法来测试MCAR？

- Saurabh Verma

impyute 库怎么样？Little的MCAR测试（WIP）在其功能列表中。 - Istrel

@Istrel的impyute库没有解释如何做到这一点（就我所知），你能详细说明步骤或给出适当文档的链接吗？ - Kiran

impyute库有一个实现Little的MCAR测试的票，但目前还没有进展：https://github.com/eltonlaw/impyute/issues/71 - skeller88

4个回答

2

你可以直接使用这个函数来进行Little的MCAR测试，而不必使用R代码：

import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    
    Parameters:
    data (DataFrame): A pandas DataFrame with n observations and p variables, where some values are missing.
    alpha (float): The significance level for the hypothesis test (default is 0.05).
    
    Returns:
    A tuple containing:
    - A matrix of missing values that represents the pattern of missingness in the dataset.
    - A p-value representing the significance of the MCAR test.
    """
    
    # Calculate the proportion of missing values in each variable
    p_m = data.isnull().mean()
    
    # Calculate the proportion of complete cases for each variable
    p_c = data.dropna().shape[0] / data.shape[0]
    
    # Calculate the correlation matrix for all pairs of variables that have complete cases
    R_c = data.dropna().corr()
    
    # Calculate the correlation matrix for all pairs of variables using all observations
    R_all = data.corr()
    
    # Calculate the difference between the two correlation matrices
    R_diff = R_all - R_c
    
    # Calculate the variance of the R_diff matrix
    V_Rdiff = np.var(R_diff, ddof=1)
    
    # Calculate the expected value of V_Rdiff under the null hypothesis that the missing data is MCAR
    E_Rdiff = (1 - p_c) / (1 - p_m).sum()
    
    # Calculate the test statistic
    T = np.trace(R_diff) / np.sqrt(V_Rdiff * E_Rdiff)
    
    # Calculate the degrees of freedom
    df = data.shape[1] * (data.shape[1] - 1) / 2
    
    # Calculate the p-value using a chi-squared distribution with df degrees of freedom and the test statistic T
    p_value = 1 - chi2.cdf(T ** 2, df)
    
    # Create a matrix of missing values that represents the pattern of missingness in the dataset
    missingness_matrix = data.isnull().astype(int)
    
    # Return the missingness matrix and the p-value
    return missingness_matrix, p_value

- Sadegh

酷。你希望以什么样的输入形式？而且我认为利特尔检验应该返回一个测试和一个p值，而不是每一列一个。 - Johan

0

评论建议使用现有的软件包。这里是一个直接从pyampute中提取的示例。

import pandas as pd
from pyampute.exploration.mcar_statistical_tests import MCARTest
data_mcar = pd.read_table("data/missingdata_mcar.csv")
mt = MCARTest(method="little")
print(mt.little_mcar_test(data_mcar))
0.17365464213775494

- Johan

0

import numpy as np
import pandas as pd
from scipy.stats import chi2

def little_mcar_test(data, alpha=0.05):
    """
    Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
    """
    data = pd.DataFrame(data)
    data.columns = ['x' + str(i) for i in range(data.shape[1])]
    data['missing'] = np.sum(data.isnull(), axis=1)
    n = data.shape[0]
    k = data.shape[1] - 1
    df = k * (k - 1) / 2
    chi2_crit = chi2.ppf(1 - alpha, df)
    chi2_val = ((n - 1 - (k - 1) / 2) ** 2) / (k - 1) / ((n - k) * np.mean(data['missing']))
    p_val = 1 - chi2.cdf(chi2_val, df)
    if chi2_val > chi2_crit:
        print(
            'Reject null hypothesis: Data is not MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )
    else:
        print(
            'Do not reject null hypothesis: Data is MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
        )

- Tamunoala

1

根据目前的写法，你的回答不够清晰。请编辑以添加更多细节，帮助其他人理解这如何回答所提出的问题。你可以在帮助中心找到关于如何撰写好回答的更多信息。 - Community

1

根据目前的写法，你的回答不够清晰。请编辑以添加更多细节，以帮助其他人理解这如何回答所提出的问题。你可以在帮助中心找到关于如何撰写好回答的更多信息。 - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Akis Hadjimpalasis · Accepted Answer

您可以使用rpy2从R中获取mcar测试。请注意，使用rpy2需要一些R编码。

在Google Colab中设置rpy2

# rpy2 libraries
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import globalenv

# Import R's base package
base = importr("base")

# Import R's utility packages
utils = importr("utils")

# Select mirror 
utils.chooseCRANmirror(ind=1)

# For automatic translation of Pandas objects to R
pandas2ri.activate()

# Enable R magic
%load_ext rpy2.ipython

# Make your Pandas dataframe accessible to R
globalenv["r_df"] = df

现在您可以使用R magics在Python环境中获取R功能。使用%R来执行单行R代码，使用%%R来执行整个单元格的R代码。

要安装R包，请使用：utils.install_packages("package_name")

在使用之前，您可能还需要加载它：%R library(package_name)

对于Little的MCAR测试，我们应该安装naniar包。它的安装稍微复杂一些，因为我们还需要安装remotes才能从github下载它，但对于其他包，通用程序应该足够了。

utils.install_packages("remotes")
%R remotes::install_github("njtierney/naniar")

加载 naniar 包：

%R library(naniar)

将你的 r_df 传递给 mcar_test 函数：

# mcar_test on whole df
%R mcar_test(r_df)

如果发生错误，请尝试仅包含缺失数据的列：

%%R
# mcar_test on columns with missing data
r_dfMissing <- r_df[c("col1", "col2", "col3")]
mcar_test(r_dfMissing)