我该如何在Python中执行Little's Test以查找MCAR?我已经查看了相同测试的R软件包,但我想在Python中执行。是否有其他方法来测试MCAR?
我该如何在Python中执行Little's Test以查找MCAR?我已经查看了相同测试的R软件包,但我想在Python中执行。是否有其他方法来测试MCAR?
您可以使用rpy2从R中获取mcar测试。请注意,使用rpy2需要一些R编码。
在Google Colab中设置rpy2
# rpy2 libraries
import rpy2.robjects as robjects
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
from rpy2.robjects import globalenv
# Import R's base package
base = importr("base")
# Import R's utility packages
utils = importr("utils")
# Select mirror
utils.chooseCRANmirror(ind=1)
# For automatic translation of Pandas objects to R
pandas2ri.activate()
# Enable R magic
%load_ext rpy2.ipython
# Make your Pandas dataframe accessible to R
globalenv["r_df"] = df
现在您可以使用R magics在Python环境中获取R功能。使用%R
来执行单行R代码,使用%%R
来执行整个单元格的R代码。
要安装R包,请使用:utils.install_packages("package_name")
在使用之前,您可能还需要加载它:%R library(package_name)
对于Little的MCAR测试,我们应该安装naniar
包。它的安装稍微复杂一些,因为我们还需要安装remotes
才能从github下载它,但对于其他包,通用程序应该足够了。
utils.install_packages("remotes")
%R remotes::install_github("njtierney/naniar")
加载 naniar
包:
%R library(naniar)
r_df
传递给 mcar_test
函数:# mcar_test on whole df
%R mcar_test(r_df)
%%R
# mcar_test on columns with missing data
r_dfMissing <- r_df[c("col1", "col2", "col3")]
mcar_test(r_dfMissing)
你可以直接使用这个函数来进行Little的MCAR测试,而不必使用R代码:
import numpy as np
import pandas as pd
from scipy.stats import chi2
def little_mcar_test(data, alpha=0.05):
"""
Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
Parameters:
data (DataFrame): A pandas DataFrame with n observations and p variables, where some values are missing.
alpha (float): The significance level for the hypothesis test (default is 0.05).
Returns:
A tuple containing:
- A matrix of missing values that represents the pattern of missingness in the dataset.
- A p-value representing the significance of the MCAR test.
"""
# Calculate the proportion of missing values in each variable
p_m = data.isnull().mean()
# Calculate the proportion of complete cases for each variable
p_c = data.dropna().shape[0] / data.shape[0]
# Calculate the correlation matrix for all pairs of variables that have complete cases
R_c = data.dropna().corr()
# Calculate the correlation matrix for all pairs of variables using all observations
R_all = data.corr()
# Calculate the difference between the two correlation matrices
R_diff = R_all - R_c
# Calculate the variance of the R_diff matrix
V_Rdiff = np.var(R_diff, ddof=1)
# Calculate the expected value of V_Rdiff under the null hypothesis that the missing data is MCAR
E_Rdiff = (1 - p_c) / (1 - p_m).sum()
# Calculate the test statistic
T = np.trace(R_diff) / np.sqrt(V_Rdiff * E_Rdiff)
# Calculate the degrees of freedom
df = data.shape[1] * (data.shape[1] - 1) / 2
# Calculate the p-value using a chi-squared distribution with df degrees of freedom and the test statistic T
p_value = 1 - chi2.cdf(T ** 2, df)
# Create a matrix of missing values that represents the pattern of missingness in the dataset
missingness_matrix = data.isnull().astype(int)
# Return the missingness matrix and the p-value
return missingness_matrix, p_value
pyampute
中提取的示例。import pandas as pd
from pyampute.exploration.mcar_statistical_tests import MCARTest
data_mcar = pd.read_table("data/missingdata_mcar.csv")
mt = MCARTest(method="little")
print(mt.little_mcar_test(data_mcar))
0.17365464213775494
import numpy as np
import pandas as pd
from scipy.stats import chi2
def little_mcar_test(data, alpha=0.05):
"""
Performs Little's MCAR (Missing Completely At Random) test on a dataset with missing values.
"""
data = pd.DataFrame(data)
data.columns = ['x' + str(i) for i in range(data.shape[1])]
data['missing'] = np.sum(data.isnull(), axis=1)
n = data.shape[0]
k = data.shape[1] - 1
df = k * (k - 1) / 2
chi2_crit = chi2.ppf(1 - alpha, df)
chi2_val = ((n - 1 - (k - 1) / 2) ** 2) / (k - 1) / ((n - k) * np.mean(data['missing']))
p_val = 1 - chi2.cdf(chi2_val, df)
if chi2_val > chi2_crit:
print(
'Reject null hypothesis: Data is not MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
)
else:
print(
'Do not reject null hypothesis: Data is MCAR (p-value={:.4f}, chi-square={:.4f})'.format(p_val, chi2_val)
)
impyute
库怎么样?Little的MCAR测试(WIP)在其功能列表中。 - Istrel