将固定宽度文本文件转换为CSV格式

Question

将固定宽度文本文件转换为CSV格式

12

我有一个文本格式的大型数据文件，想要通过指定每列的长度将其转换为CSV格式。

列数 = 5

每列的长度：

[4 2 5 1 1]

样本观测：

aasdfh9013512
ajshdj 2445df

预期输出

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

- Ashish

6个回答

5

我会使用 sed 命令并匹配给定长度的组：

$ sed -r 's/^(.{4})(.{2})(.{5})(.{1})(.{1})$/\1,\2,\3,\4,\5/' file
aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

- fedorqui

首先，感谢您的回答。但实际文件中我需要将其分成80列，而sed命令只适用于9列。请帮忙。 - Ashish

@AshishKumar 那么你可能必须使用Thor的带有 awk 的答案。 - fedorqui

3

这里有一个解决方案可以与普通的 awk 一起使用（不需要 gawk）。

awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'

它使用awk的substr函数来定义每个字段的起始位置和长度。OFS定义输出字段分隔符是什么（在这种情况下，是逗号）。

（副注：这仅适用于源数据不包含任何逗号的情况。如果数据中有逗号，则必须转义它们以使其成为正确的CSV格式，这超出了本问题的范围。）

演示：

echo 'aasdfh9013512
ajshdj 2445df' | 
awk -v OFS=',' '{print substr($0,1,4), substr($0,5,2), substr($0,7,5), substr($0,12,1), substr($0,13,1)}'

输出：

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

- wisbucky

1

在处理这个问题时，可以使用一种通用方式(替代FIELDSWIDTH选项)来处理，在awk中，不需要硬编码子字符串位置，只需根据用户提供的位置数插入逗号即可。以下是实现该功能的写法，经过GNU awk测试。要使用此功能，我们必须定义值(如样例所示)，在需要插入逗号的位置上给出位置编号，awk变量名为colLength，并在它们之间留有空格。

awk -v colLengh="4 2 5 1 1" '
BEGIN{
  num=split(colLengh,arr,OFS)
}
{
  j=sum=0
  while(++j<=num){
    if(length($0)>sum){
      sub("^.{"arr[j]+sum"}","&,")
    }
    sum+=arr[j]+1
  }
}
1
' Input_file

说明：简单来说，我们需要创建一个名为colLengh的awk变量，在其中定义需要在哪些位置插入逗号。然后在BEGIN部分创建一个名为arr的数组，其中包含需要在其中插入逗号的索引值。

在主程序部分，首先将变量j和sum归零。然后从j=1开始运行一个while循环，直到j的值等于num为止。在每次运行中，根据需要从当前行的开头（如果当前行的长度大于sum，则执行替换操作，否则不执行）进行替换。例如：sub函数在第一次循环运行时将成为.{4}，然后它变成了.{7}，因为我们需要在第7个位置插入逗号，以此类推。因此，sub将用匹配值+,替换从起始位置到生成的数字之间的相应数量的字符。最后在该程序中提到1将打印编辑/未编辑的行。

- RavinderSingh13

0

如果有人仍在寻找解决方案，我已经开发了一个小的Python脚本。只要您拥有Python 3.5，它就很容易使用。

https://github.com/just10minutes/FixedWidthToDelimited/blob/master/FixedWidthToDelimiter.py

  """
This script will convert Fixed width File into Delimiter File, tried on Python 3.5 only
Sample run: (Order of argument doesnt matter)
python ConvertFixedToDelimiter.py -i SrcFile.txt -o TrgFile.txt -c Config.txt -d "|"
Inputs are as follows
1. Input FIle - Mandatory(Argument -i) - File which has fixed Width data in it
2. Config File - Optional (Argument -c, if not provided will look for Config.txt file on same path, if not present script will not run)
    Should have format as
    FieldName,fieldLength
    eg:
    FirstName,10
    SecondName,8
    Address,30
    etc:
3. Output File - Optional (Argument -o, if not provided will be used as InputFIleName plus Delimited.txt)
4. Delimiter - Optional (Argument -d, if not provided default value is "|" (pipe))
"""
from collections import OrderedDict
import argparse
from argparse import ArgumentParser
import os.path
import sys


def slices(s, args):
    position = 0
    for length in args:
        length = int(length)
        yield s[position:position + length]
        position += length

def extant_file(x):
    """
    'Type' for argparse - checks that file exists but does not open.
    """
    if not os.path.exists(x):
        # Argparse uses the ArgumentTypeError to give a rejection message like:
        # error: argument input: x does not exist
        raise argparse.ArgumentTypeError("{0} does not exist".format(x))
    return x





parser = ArgumentParser(description="Please provide your Inputs as -i InputFile -o OutPutFile -c ConfigFile")
parser.add_argument("-i", dest="InputFile", required=True,    help="Provide your Input file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE", type=extant_file)
parser.add_argument("-o", dest="OutputFile", required=False,    help="Provide your Output file name here, if file is on different path than where this script resides then provide full path of the file", metavar="FILE")
parser.add_argument("-c", dest="ConfigFile", required=False,   help="Provide your Config file name here,File should have value as fieldName,fieldLength. if file is on different path than where this script resides then provide full path of the file", metavar="FILE",type=extant_file)
parser.add_argument("-d", dest="Delimiter", required=False,   help="Provide the delimiter string you want",metavar="STRING", default="|")

args = parser.parse_args()

#Input file madatory
InputFile = args.InputFile
#Delimiter by default "|"
DELIMITER = args.Delimiter

#Output file checks
if args.OutputFile is None:
    OutputFile = str(InputFile) + "Delimited.txt"
    print ("Setting Ouput file as "+ OutputFile)
else:
    OutputFile = args.OutputFile

#Config file check
if args.ConfigFile is None:
    if not os.path.exists("Config.txt"):
        print ("There is no Config File provided exiting the script")
        sys.exit()
    else:
        ConfigFile = "Config.txt"
        print ("Taking Config.txt file on this path as Default Config File")
else:
    ConfigFile = args.ConfigFile

fieldNames = []
fieldLength = []
myvars = OrderedDict()


with open(ConfigFile) as myfile:
    for line in myfile:
        name, var = line.partition(",")[::2]
        myvars[name.strip()] = int(var)
for key,value in myvars.items():
    fieldNames.append(key)
    fieldLength.append(value)

with open(OutputFile, 'w') as f1:
    fieldNames = DELIMITER.join(map(str, fieldNames))
    f1.write(fieldNames + "\n")
    with open(InputFile, 'r') as f:
        for line in f:
            rec = (list(slices(line, fieldLength)))
            myLine = DELIMITER.join(map(str, rec))
            f1.write(myLine + "\n")

- just10minutes

0

便携式 `awk`

生成带有适当的子字符串命令的 awk 脚本

cat cols

<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1

输出：

substr($0,1,4)
substr($0,5,2)
substr($0,7,5)
substr($0,12,1)
substr($0,13,1)

将这些行合并成一个有效的 awk 脚本：

<cols awk '{ print "substr($0,"p","$1")"; cs+=$1; p=cs+1 }' p=1 |
paste -sd, | sed 's/^/{ print /; s/$/ }/'

输出：

{ print substr($0,1,4),substr($0,5,2),substr($0,7,5),substr($0,12,1),substr($0,13,1) }

将上述内容重定向到文件中，例如/tmp/t.awk，并在输入文件上运行它：

<infile awk -f /tmp/t.awk

输出：

aasd fh 90135 1 2
ajsh dj  2445 d f

或者使用逗号作为输出分隔符：

<infile awk -f /tmp/t.awk OFS=,

输出：

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

- Thor

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thor · Accepted Answer

31

GNU awk（gawk）直接支持FIELDWIDTHS，例如：

gawk '$1=$1' FIELDWIDTHS='4 2 5 1 1' OFS=, infile

输出：

aasd,fh,90135,1,2
ajsh,dj, 2445,d,f

- Thor

3

好的！我之前不知道这个功能。非常赞！相关链接：阅读固定宽度数据。 - fedorqui

如果我安装并使用gawk，那么“FIELDWIDTHS”参数才能对我起作用；在Ubuntu 14.04.3上。 - Arthur

1

@Arthur：根据GNU awk的功能历史记录，FIELDWIDTHS自gawk 2.13（即2010年7月）以来就已经可用。 - Thor

@Thor 是的，我确定那是对的。但是，如果没有安装gawk它就无所谓了。至少对于我来说，在Ubuntu 14.04.3上安装了awk但没有安装gawk。 - Arthur

@Arthur：是的，这是针对GNU awk（gawk）的特定答案，我会让它更加清晰明了。由于速度较快，许多基于Debian的系统将mawk作为默认的awk替代品。 - Thor

请注意，在Windows上，此解决方案可行，但您需要使用引号而不是撇号：gawk "$1=$1" FIELDWIDTHS="1 4 8 5 3" OFS=，sample-fixed.csv - Stephen Pace

将固定宽度文本文件转换为CSV格式

便携式 awk

便携式 `awk`