根据文件大小将文本文件分割成较小的文件(Windows)

4
有时会创建过大的日志(.txt)文件(5GB+), 无法打开,我需要创建一个解决方案将其分割成更小的可读块以便在Wordpad中使用。这是在Windows Server 2008 R2中完成的。
我需要解决方案是批处理文件、PowerShell或类似的东西。理想情况下,它应该被硬编码为每个文本文件不超过999MB,并且不要在一行的中间停止。
我找到了一个类似于我需求的解决方案,有时可以按行计数工作。网址是:https://gallery.technet.microsoft.com/scriptcenter/PowerShell-Split-large-log-6f2c4da0
############################################# 
# Split a log/text file into smaller chunks # 
############################################# 

# WARNING: This will take a long while with extremely large files and uses lots of memory to stage the file 

# Set the baseline counters  
# Set the line counter to 0  
$linecount = 0 

# Set the file counter to 1. This is used for the naming of the log files      
$filenumber = 1

# Prompt user for the path  
$sourcefilename = Read-Host "What is the full path and name of the log file to split? (e.g. D:\mylogfiles\mylog.txt)"   

# Prompt user for the destination folder to create the chunk files      
$destinationfolderpath = Read-Host "What is the path where you want to extract the content? (e.g. d:\yourpath\)"    
Write-Host "Please wait while the line count is calculated. This may take a while. No really, it could take a long time." 

# Find the current line count to present to the user before asking the new line count for chunk files  
Get-Content $sourcefilename | Measure-Object | ForEach-Object { $sourcelinecount = $_.Count }   

#Tell the user how large the current file is  
Write-Host "Your current file size is $sourcelinecount lines long"   

# Prompt user for the size of the new chunk files  
$destinationfilesize = Read-Host "How many lines will be in each new split file?"   

# the new size is a string, so we convert to integer and up 
# Set the upper boundary (maximum line count to write to each file)    
$maxsize = [int]$destinationfilesize     
Write-Host File is $sourcefilename - destination is $destinationfolderpath - new file line count will be $destinationfilesize 

# The process reads each line of the source file, writes it to the target log file and increments the line counter. When it reaches 100000 (approximately 50 MB of text data)  
$content = get-content $sourcefilename | % {
Add-Content $destinationfolderpath\splitlog$filenumber.txt "$_"    
$linecount ++   
If ($linecount -eq $maxsize) { 
    $filenumber++ 
    $linecount = 0    }  }   
# Clean up after your pet  
[gc]::collect()   
[gc]::WaitForPendingFinalizers 
()

然而,当我运行此代码时,在PowerShell中会出现许多类似以下的错误:

Add-Content : The process cannot access the file 'C:\Desktop\splitlog1.txt' 
because it is being used by another process...

所以我请求帮助修复上述代码,或提供不同/更好的解决方案。


为了避免这样巨大的日志文件,您可能会对LogRotateWin感兴趣... - aschipfl
@aschipfl,感谢您的建议,但是在我的情况下这并没有真正帮助到我。 - JavaBeast
我经常使用从同一篇文章中衍生的脚本,而且从未遇到过任何问题。根据您所看到的错误,似乎您可能已经在其他地方打开了目标文件。您是否在另一个 shell 中运行 Get-Content split-log1.txt -tail - E.Z. Hart
2个回答

5

好的,我接受了挑战。这里是一个函数,应该适用于您的需要。它可以按行分割文本文件,并将尽可能多的完整输入行放入每个输出文件中,而不超过指定的文件大小。

注意:无法严格执行输出文件大小限制。

示例:输入文件包含两个非常长的字符串,每个字符串为1Mb。如果您尝试将此文件分割成512KB的块,则生成的文件将每个为1MB。

函数Split-FileByLine:

<#
.Synopsis
    Split text file(s) by lines, put into each output file as many complete lines of input as possible without exceeding size bytes.

.Description
    Split text file(s) by lines, put into each output file as many complete lines of input as possible without exceeding size bytes.
    Note, that output file size limit can't be strictly enforced. Example: input files contains two very long strings, 1Mb each.
    If you try to split this file into the 512KB chunks, resulting files will be 1MB each.

    Splitted files will have orinignal file's name, followed by the "_part_" string and counter. Example:
    Original file: large.log
    Splitted files: large_part_0.log, large_part_1.log, large_part_2.log, etc.

.Parameter FileName
    Array of strings, mandatory. Filename(s) to split.

.Parameter OutPath
    String, mandatory. Folder, where splittedfiles will be stored. Will be created, if not exists.

.Parameter MaxFileSize
    Long, mandatory. Maximum output file size. When output file reaches this size, new file will be created.
    You can use PowerShell's multipliers: KB, MB, GB, TB,PB

.Parameter Encoding
    String. If not specified, script will use system's current ANSI code page to read the files.
    You can get other valid encodings for your system in PowerShell console like this:

    [System.Text.Encoding]::GetEncodings()

    Example:

    Unicode (UTF-7): utf-7
    Unicode (UTF-8): utf-8
    Western European (Windows): Windows-1252

.Example
    Split-FileByLine -FileName '.\large.log' -OutPath '.\splitted' -MaxFileSize 100MB -Verbose

    Split file "large.log" in current folder, write resulting files in subfolder "splitted", limit output file size to 100Mb, be verbose.

.Example
    Split-FileByLine -FileName '.\large.log' -OutPath '.\splitted' -MaxFileSize 100MB -Encoding 'utf-8'

    Split file "large.log" in current folder, write resulting files in subfolder "splitted", limit output file size to 100Mb, use UTF-8 encoding.

.Example
    Split-FileByLine -FileName '.\large_1.log', '.\large_2.log' -OutPath '.\splitted' -MaxFileSize 999MB

    Split files "large_1.log" ".\large_2.log" and  in current folder, write resulting files in subfolder "splitted", limit output file size to 999MB.

.Example
    '.\large_1.log', '.\large_2.log' | Split-FileByLine -FileName -OutPath '.\splitted' -MaxFileSize 999MB

    Split files "large_1.log" ".\large_2.log" and  in current folder, write resulting files in subfolder "splitted", limit output file size to 999MB.

#>
function Split-FileByLine
{
    [CmdletBinding()]
    Param
    (
        [Parameter(Mandatory = $true, ValueFromPipeline = $true, ValueFromPipelineByPropertyName = $true)]
        [string[]]$FileName,

        [Parameter(ValueFromPipelineByPropertyName = $true)]
        [string]$OutPath = (Get-Location -PSProvider FileSystem).Path,

        [Parameter(Mandatory = $true, ValueFromPipelineByPropertyName = $true)]
        [long]$MaxFileSize,

        [Parameter(ValueFromPipelineByPropertyName = $true)]
        [string]$Encoding = 'Default'
    )

    Begin
    {
        # Scriptblocks for common tasks
        $DisposeInFile = {
            Write-Verbose 'Disposing StreamReader'
            $InFile.Close()
            $InFile.Dispose()
        }

        $DisposeOutFile = {
            Write-Verbose 'Disposing StreamWriter'
            $OutFile.Flush()
            $OutFile.Close()
            $OutFile.Dispose()
        }

        $NewStreamWriter = {
            Write-Verbose 'Creating StreamWriter'
            $OutFileName = Join-Path -Path $OutPath -ChildPath (
                '{0}_part_{1}{2}' -f [System.IO.Path]::GetFileNameWithoutExtension($_), $Counter, [System.IO.Path]::GetExtension($_)
            )

            $OutFile = New-Object -TypeName System.IO.StreamWriter -ArgumentList (
                $OutFileName,
                $false,
                $FileEncoding
            ) -ErrorAction Stop
            $OutFile.AutoFlush = $true
            Write-Verbose "Writing new file: $OutFileName"
        }
    }

    Process
    {
        if($Encoding -eq 'Default')
        {
            # Set default encoding
            $FileEncoding = [System.Text.Encoding]::Default
        }
        else
        {
            # Try to set user-specified encoding
            try
            {
                $FileEncoding = [System.Text.Encoding]::GetEncoding($Encoding)
            }
            catch
            {
                throw "Not valid encoding: $Encoding"
            }
        }

        Write-Verbose "Input file: $FileName"
        Write-Verbose "Output folder: $OutPath"

        if(!(Test-Path -Path $OutPath -PathType Container)){
            Write-Verbose "Folder doesn't exist, creating: $OutPath"
            $null = New-Item -Path $OutPath -ItemType Directory -ErrorAction Stop
        }

        $FileName | ForEach-Object {
            # Open input file
            $InFile = New-Object -TypeName System.IO.StreamReader -ArgumentList (
                $_,
                $FileEncoding
            ) -ErrorAction Stop
            Write-Verbose "Current file: $_"

            $Counter = 0
            $OutFile = $null

            # Read lines from input file
            while(($line = $InFile.ReadLine()) -ne $null)
            {
                if($OutFile -eq $null)
                {
                    # No output file, create StreamWriter
                    . $NewStreamWriter
                }
                else
                {
                    if($OutFile.BaseStream.Length -ge $MaxFileSize)
                    {
                        # Output file reached size limit, closing
                        Write-Verbose "OutFile lenght: $($InFile.BaseStream.Length)"
                        . $DisposeOutFile
                        $Counter++
                        . $NewStreamWriter
                    }
                }

                # Write line to the output file
                $OutFile.WriteLine($line)
            }

            Write-Verbose "Finished processing file: $_"
            # Close open files and cleanup objects
            . $DisposeOutFile
            . $DisposeInFile
        }
    }
}

您可以像这样在您的脚本中使用它:
function Split-FileByLine
{
    # function body here
}

$InputFile = 'c:\log\large.log'
$OutputDir = 'c:\log_split'

Split-FileByLine -FileName $InputFile -OutPath $OutputDir -MaxFileSize 999MB

似乎有些不对劲...我用它来分割一个983,336KB的文件(每个文件最大为200MB),结果给了我4个文件(204,801KB/204,801/204,801/164,136)...注意它们加起来不到983。这是否表明数据丢失了?如果我手动分割一个文件,大小确实会加起来等于原始文件大小。 - JavaBeast
@JavaBeast 奇怪,我会检查一下。 - beatcracker
@JavaBeast 是的,计数器中的错误导致第一个分割文件被覆盖。请检查更新版本。 - beatcracker
我知道我應該避免像“謝謝!”這樣的評論,但我無法控制自己。謝謝你!這個函數完美地運行,為我節省了大量時間。@beatcracker贏得了今天的互聯網。 - Rocky
@beatcracker 是啊,确实非常奇怪!我将函数封装在了一个Measure-Command cmdlet中执行,结果花费了6小时8分11秒的时间。而另一种方法只用了46秒。两种方法都在同一台机器上使用相同的文件作为输入,产生了10个输出文件,并且都使用100MB作为拆分标准。 - Chip Wood
显示剩余2条评论

1
您可以尝试使用 split 工具,该工具来自于CoreUtils for Windows,并使用--line-bytes参数:

--line-bytes=size

将尽可能多的完整行放入每个输出文件中,而不超过size字节。长度超过size字节的单独行或记录将被分成多个文件。size的格式与--bytes选项相同。如果指定了--separator,则lines确定记录数量。

例如:split --line-bytes=999MB c:\logs\biglog.txt


谢谢,但我无法在客户工作站上添加或安装任何工具。我需要一个一文档脚本形式的解决方案,以便我可以简单地传递给用户。 - JavaBeast

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接