Python - 在数据框中编码基因组数据

Question

Python - 在数据框中编码基因组数据

3

你好，我正在尝试对一段存储在CSV文件中的基因组字符串进行编码。目前，我想将数据框中“基因组”列中的每个字符串拆分为其碱基对列表，例如将 ('acgt...') 转化为 ('a', 'c', 'g', 't'...)，然后将每个碱基对分别转换为浮点数 (0.25, 0.50, 0.75, 1.00)。

我曾尝试使用 split 函数来将每个字符串拆分为字符，但似乎都无法在数据框上正常工作，即使使用 .tostring 转换为字符串也不行。

以下是我最近的代码：

import re
import numpy as np
import pandas as pd
from sklearn.preprocessing import LabelEncoder


def string_to_array(my_string):
    my_string = my_string.lower()
    my_string = re.sub('[^acgt]', 'z', my_string)
    my_array = np.array(list(my_string))
    return my_array

label_encoder = LabelEncoder()
label_encoder.fit(np.array(['a','g','c','t','z']))

def ordinal_encoder(my_array):
    integer_encoded = label_encoder.transform(my_array)
    float_encoded = integer_encoded.astype(float)
    float_encoded[float_encoded == 0] = 0.25  # A
    float_encoded[float_encoded == 1] = 0.50  # C
    float_encoded[float_encoded == 2] = 0.75  # G
    float_encoded[float_encoded == 3] = 1.00  # T
    float_encoded[float_encoded == 4] = 0.00  # anything else, z
    return float_encoded



dfpath = 'C:\\Users\\CAAVR\\Desktop\\Ison.csv'
dataframe = pd.read_csv(dfpath)

df = ordinal_encoder(string_to_array(dataframe[['Genome']].values.tostring()))
print(df)

我尝试自己编写函数，但我不知道它们是如何工作的。我尝试的所有方法都指向无法处理numpy数组中数据的问题，并且没有任何方法可以将数据转换为另一种类型。

谢谢提供这些提示！

编辑：这是数据框的打印输出-

 Antibiotic  ...                                             Genome
0       isoniazid  ...  ccctgacacatcacggcgcctgaccgacgagcagaagatccagctc...
1       isoniazid  ...  gggggtgctggcggggccggcgccgataaccccaccggcatcggcg...
2       isoniazid  ...  aatcacaccccgcgcgattgctagcatcctcggacacactgcacgc...
3       isoniazid  ...  gttgttgttgccgagattcgcaatgcccaggttgttgttgccgaga...
4       isoniazid  ...  ttgaccgatgaccccggttcaggcttcaccacagtgtggaacgcgg...

有5列，“Genome”是列表中的第5列，我不知道为什么 1. .head() 不起作用，2. 为什么 print() 没有给我所有的列...

- Scott Valentine

你能否发布一小段代表性的数据框？最好是 df.head() 或 df.head().to_dict() 的输出结果。 - Peter Leimbigler

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Dave · Accepted Answer

我认为LabelEncoder不是你想要的。这只是一个简单的转换，建议直接进行。首先查找你的碱基映射:

lookup = {
  'a': 0.25,
  'g': 0.50,
  'c': 0.75,
  't': 1.00
  # z: 0.00
}

然后将查找应用于“Genome”列的值。 values 属性将返回结果数据框作为一个 ndarray。

dataframe['Genome'].apply(lambda bps: pd.Series([lookup[bp] if bp in lookup else 0.0 for bp in bps.lower()])).values