如何在使用std::fstream读取文本文件时使用非默认分隔符？

Question

如何在使用std::fstream读取文本文件时使用非默认分隔符？

18

在我的C++代码中，我想从文本文件(*.txt)中读取每个条目并进行分词。更具体地说，我想能够从文件中读取单个单词，例如 "format"，"stack"，"Jason"，"europe"，等等。

我选择使用fstream来执行此任务，但我不知道如何将其定界符设置为我想要使用的（空格、\n，以及连字符，甚至是像 "Mcdonal's" 中的撇号）。我想空格和\n是默认的定界符，但连字符不是，但我想将它们作为定界符处理，以便在解析文件时，我将获得 "blah blah xxx animal--cat" 中的单词，如"blah"、"blah"、"xxx"、"animal"、"cat"。

也就是说，我想从 "stack-overflow"、"you're"、等等中获取两个字符串，并同时保留\n和空格作为定界符。

- FrozenLand

getline(stream,variable,delimiter); - Trevor Hickey

你想过滤掉包含连字符的"animal--cat"吗？对我来说，这听起来不像是分词。 - johnsyweb

我不是在尝试过滤它们；我正在尝试将animal和cat作为两个单独的词来阅读。 - FrozenLand

明白了！我已经编辑了你的问题，使其更加清晰易懂。 - johnsyweb

2个回答

2

你可以使用

标签

istream::getline(char* buffer, steamsize maxchars, char delim)

虽然这仅支持单个分隔符。为了进一步在不同的分隔符上分割行，可以使用。

char* strtok(char* inString, const char* delims)

这是一个处理多个分隔符的函数。当你使用strtok时，只需要第一次传递缓冲区的地址 - 之后只需传入null，它将给你上一个标记后的下一个标记，当没有更多标记时返回空指针。

编辑：具体实现类似于

char buffer[120]; //this size is dependent on what you expect the file to contain
while (!myIstream.eofbit) //I may have forgotten the exact syntax of the end bit
{
    myIstream.getline(buffer, 120); //using default delimiter of \n
    char* tokBuffer;
    tokBuffer = strtok(buffer, "'- ");
    while (tokBuffer != null) {
        cout << "token is: " << tokBuffer << "\n";
        tokBuffer = strtok(null, "'- "); //I don't need to pass in the buffer again because it remembers the first time I called it
    }
}

- QuantumRipple

那么你能具体一点吗？比如说我想把 stack-overflow 分成两个单词 stack 和 overflow，我应该怎么做呢？（我还需要同时使用空格和 \n 作为分隔符。）还有，像 Let's 要分成 let 和 s。谢谢！ - FrozenLand

编辑版本应在\n、'、-和空格上进行分词。 - QuantumRipple

看起来不错，但如果我的文件是1MB的*.txt文件呢？我应该用什么替换120？ - FrozenLand

你的代码中是否有行长度限制？（或者你可能希望对 getLine 进行基于空格的分词处理） - QuantumRipple

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jerry Coffin · Accepted Answer

istream将空格视为分隔符。它使用区域设置来告诉它哪些字符是空格。而区域设置则包括一个ctype facet，用于分类字符类型。这样的facet可能看起来像这样：

#include <locale>
#include <iostream>
#include <algorithm>
#include <iterator>
#include <vector>
#include <sstream>

class my_ctype : public
std::ctype<char>
{
    mask my_table[table_size];
public:
    my_ctype(size_t refs = 0)  
        : std::ctype<char>(&my_table[0], false, refs)
    {
        std::copy_n(classic_table(), table_size, my_table);
        my_table['-'] = (mask)space;
        my_table['\''] = (mask)space;
    }
};

这是一个小测试程序来展示它的工作原理：

int main() {
    std::istringstream input("This is some input from McDonald's and Burger-King.");
    std::locale x(std::locale::classic(), new my_ctype);
    input.imbue(x);

    std::copy(std::istream_iterator<std::string>(input),
        std::istream_iterator<std::string>(),
        std::ostream_iterator<std::string>(std::cout, "\n"));

    return 0;
}

结果：

This
is
some
input
from
McDonald
s
and
Burger
King.

istream_iterator<string> 使用 >> 从流中读取单独的字符串，因此如果直接使用它们，应该会得到相同的结果。你需要包括的部分是创建区域设置，并使用 imbue 使流使用该区域设置。