通过fseek()到达SEEK_END并通过ftell()获得文件大小,是否可能读取整个文件?

8

我是否正确地认为这段代码引入了未定义的行为?

#include <stdio.h>
#include <stdlib.h>

FILE *f = fopen("textfile.txt", "rb");
fseek(f, 0, SEEK_END);
long fsize = ftell(f);
fseek(f, 0, SEEK_SET);  //same as rewind(f);

char *string = malloc(fsize + 1);
fread(string, fsize, 1, f);
fclose(f);

string[fsize] = 0;

我问这个问题的原因是,这段代码被发布为一个已接受和高赞的答案,回答了以下问题:C编程:如何将整个文件内容读入缓冲区 但是,根据以下文章:如何在C++中将整个文件读入内存(尽管标题如此,但也涉及C语言,请跟我走):

Suppose you were writing C, and you had a FILE* (that you know points to a file stream, or at least a seekable stream), and you wanted to determine how many characters to allocate in a buffer to store the entire contents of the stream. Your first instinct would probably be to write code like this:

// Bad code; undefined behaviour
fseek(p_file, 0, SEEK_END);
long file_size = ftell(p_file);

Seems legit. But then you start getting weirdness. Sometimes the reported size is bigger than the actual file size on disk. Sometimes it’s the same as the actual file size, but the number of characters you read in is different. What the hell is going on?

There are two answers, because it depends on whether the file has been opened in text mode or binary mode.

Just in case you donlt know the difference: in the default mode – text mode – on certain platforms, certain characters get translated in various ways during reading. The most well-known is that on Windows, newlines get translated to \r\n when written to a file, and translated the other way when read. In other words, if the file contains Hello\r\nWorld, it will be read as Hello\nWorld; the file size is 12 characters, the string size is 11. Less well-known is that 0x1A (or Ctrl-Z) is interpreted as the end of the file, so if the file contains Hello\x1AWorld, it will be read as Hello. Also, if the string in memory is Hello\x1AWorld and you write it to a file in text mode, the file will be Hello. In binary mode, no translations are done – whatever is in the file gets read in to your program, and vice versa.

Immediately you can guess that text mode is going to be a headache – on Windows, at least. More generally, according to the C standard:

The ftell function obtains the current value of the file position indicator for the stream pointed to by stream. For a binary stream, the value is the number of characters from the beginning of the file. For a text stream, its file position indicator contains unspecified information, usable by the fseek function for returning the file position indicator for the stream to its position at the time of the ftell call; the difference between two such return values is not necessarily a meaningful measure of the number of characters written or read.

In other words, when you’re dealing with a file opened in text mode, the value that ftell() returns is useless… except in calls to fseek(). In particular, it doesn’t necessarily tell you how many characters are in the stream up to the current point.

So you can’t use the return value from ftell() to tell you the size of the file, the number of characters in the file, or for anything (except in a later call to fseek()). So you can’t get the file size that way.

Okay, so to hell with text mode. What say we work in binary mode only? As the C standard says: "For a binary stream, the value is the number of characters from the beginning of the file." That sounds promising.

And, indeed, it is. If you are at the end of the file, and you call ftell(), you will find the number of bytes in the file. Huzzah! Success! All we need to do now is get to the end of the file. And to do that, all you need to do is fseek() with SEEK_END, right?

Wrong.

Once again, from the C standard:

Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream (because of possible trailing null characters) or for any stream with state-dependent encoding that does not assuredly end in the initial shift state.

To understand why this is the case: Some platforms store files as fixed-size records. If the file is shorter than the record size, the rest of the block is padded. When you seek to the “end”, for efficiency’s sake it just jumps you right to the end of the last block… possibly long after the actual end of the data, after a bunch of padding.

So, here’s the situation in C:

  • You can’t get the number of characters with ftell() in text mode.
  • You can get the number of characters with ftell() in binary mode… but you can’t seek to the end of the file with fseek(p_file, 0, SEEK_END).

我没有足够的知识来判断谁是正确的,在此我想问一个问题,如果前面提到的接受的答案与本文章相冲突,请为我澄清。


1
有一件事,你没有检查malloc()的返回值,如果它失败了,你将会有未定义行为。 - Sourav Ghosh
1
@SouravGhosh 当然可以,但那不是核心问题。 - user4385532
2
没错,这就是为什么它是一条注释而不是答案。 :) - Sourav Ghosh
请参考这个答案。这是未定义的行为,因此不具备可移植性。 - BLUEPIXY
最健壮和可移植的方法仍然是读取字符直到EOF并计数它们。(在此过程中,您可以将它们存储到数组中,并在需要时调整数组大小) - joop
1个回答

4
文章作者恶意遗漏的是引文的背景。从C11草案标准n1570,非规范脚注268可知:“将文件位置指示器设置为文件结尾(如fseek(file,0,SEEK_END)),对于二进制流(由于可能存在尾随的空字符)或具有状态相关编码的任何流而言都具有未定义的行为。这些流不保证以初始换档状态结束。” 标准规定中涉及脚注的规范部分是“7.21.3文件”:“9虽然文本和二进制面向宽的流在概念上都是宽字符序列,但与面向宽的流相关联的外部文件是多字节字符序列,如下所述:-文件中的多字节编码可能包含嵌入的空字节(不像用于程序内部的有效的多字节编码)。-文件不需要以初始换档状态开头或结尾。(268)”请注意,这涉及到“面向宽的流”。现在,在“7.21.9.2 fseek函数”中,语言是更加平稳的最后一句话:“二进制流不必有意义地支持whence值为SEEK_END的fseek调用。”

C语言的设计目的是即使在执行一些奇怪和离奇的操作系统文件系统时也可以实现。如果一个文件系统不能精确跟踪文件大小,要求实现这样做可能会使它们无法与其他程序交换数据。因此,标准的作者允许二进制文件可能没有实际的“EOF”概念的实现。这并不意味着在自然跟踪文件大小的文件系统上运行任何质量实现都应该以除了明显有用的方式之外的方式行事。 - supercat
一个高质量的实现应该将未定义行为视为“抛弃时间和因果律的法则”,而不是“在翻译或程序执行期间表现出环境特征的记录方式”,即使在环境具有明确记录行为的情况下,这种观点可能很时髦,但应该被认为是愚蠢和破坏性的。 - supercat
1
我必须不同意你的最后一点。鉴于明确存在实现定义未指定行为,实现无需将未定义行为视为实现定义。如果有什么需要的话,标准或许应该被修改以将更多内容指定为实现定义 - EOF

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接