在MATLAB中将长字符串分割成子字符串的最有效方法

Question

在MATLAB中将长字符串分割成子字符串的最有效方法

stringmatlabsplitsubstring

3

我正在MATLAB上编写一个函数，用于比较两个基因序列并确定它们的相似度。为此，我通过使用for循环在两个序列上移动，每次移动一个核苷酸，并将子字符串添加到单元格数组中，以将它们分成较小的子字符串。

例如，字符串ATGCAAAT的子字符串长度为4，不会被拆分为ATGC，AAAT，而是按照ATCG，TGCA，GCAA，CAAA，AAAT的顺序进行拆分。

我正在尝试加快函数的执行速度，由于这两个for循环几乎占据了90％的执行时间，因此我想知道MATLAB中是否有更快的方法来完成此操作。

以下是我目前正在使用的代码：

 SubstrSequence1 = {};                                                
 SubstrSequence2 = {};
 for i = 1:length(Sequence1)-(SubstringLength-1)                
     SubstrSequence1 = [SubstrSequence1, Sequence1(i:i+SubstringLength-1)];
 end

 for i = 1:length(Sequence2)-(SubstringLength-1)                
     SubstrSequence2 = [SubstrSequence2, Sequence2(i:i+SubstringLength-1)]; 
 end

- dacm

3个回答

2

这里有一种使用hankel的方法来获取SubstrSequence1 -

A = 1:numel(Sequence1);
out = cellstr(Sequence1(hankel(A(1:SubstringLength),A(SubstringLength:end)).'))

你可以按照相同的步骤找到SubstrSequence2。

示例运行 -

>> Sequence1 = 'ATGCAAAT';
>> SubstringLength = 4;
>> A = 1:numel(Sequence1);
>> cellstr(Sequence1(hankel(A(1:SubstringLength),A(SubstringLength:end)).'))
ans = 
    'ATGC'
    'TGCA'
    'GCAA'
    'CAAA'
    'AAAT'

- Divakar

我一开始使用了 hankel，但是无法让它正常工作！ - Luis Mendo

@LuisMendo 我一开始用了bsxfun，但是太晚了！ :) - Divakar

我敢肯定你已经开始了 :-P - Luis Mendo

1

一种方法是生成一个索引矩阵，以恰当地提取所需的子字符串：

>> sequence = 'ATGCAAAT';
>> subSequenceLength = 4;
>> numSubSequence = length(sequence) - subSequenceLength + 1;
>> idx = repmat((1:numSubSequence)', 1, subSequenceLength) + repmat(0:subSequenceLength-1, numSubSequence, 1);
>> result = sequence(idx)

    result =

        ATGC
        TGCA
        GCAA
        CAAA
        AAAT

- b3.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Luis Mendo · Accepted Answer

这个怎么样？

str = 'ATGCAAAT';
n = 4;
strs = str(bsxfun(@plus, 1:n, (0:numel(str)-n).'));

结果是一个 二维字符数组:

strs =
ATGC
TGCA
GCAA
CAAA
AAAT

所以部分字符串是 strs(1,:), strs(2,:) 等。

如果您想要结果作为一个 字符串的单元数组，在末尾添加此内容：

strs = cellstr(strs);

生产

strs = 
    'ATGC'
    'TGCA'
    'GCAA'
    'CAAA'
    'AAAT'

然后部分字符串为 strs{1}，strs{2}等。