Django bulk_create引发“数据库正在恢复模式”问题

5

需求:

  1. 上传约15MB的csv文件,然后使用bulk_create方法。
  2. 将一个任务的id分成10万条记录。
  3. 下一轮将删除现有的记录并再次使用INSERT INTO方法。

我的猜测:
我怀疑sequence_id溢出是问题的根本原因, 因为之前它曾经能够上传,但现在失效了,无法再次上传。

这是我的postgres日志:

2017-06-23 04:55:21.087 UTC [27896] LOG:  server process (PID 20529) was terminated by signal 9: Killed
2017-06-23 04:55:21.087 UTC [27896] DETAIL:  Failed process was running: INSERT INTO "sales_sales" ("imc", "order_number", "original_order_date", "count") VALUES ('1049129', '415000458', '2017-03-01T03:00:00+00:00'::timestamptz, 1), ('1113804', '415000457', '2017-03-01T03:00:00+00:00'::timestamptz, 1), ('1151620', '415000460', '2017-03-01T03:00:00+00:00'::timestamptz, 1), ('1522771', '415000462', '2017-03-01T03:00:00+00:00'::timestamptz, 1), ('2280038', '415000459', '2017-03-01T03:00:00+00:00'::timestamptz, 1), ('7374979', '415000461', '2017-03-01T03:00:00+00:00'::timestamptz, 1), ('399428', '415000618', '2017-03-01T03:02:00+00:00'::timestamptz, 1), ('399428', '415000619', '2017-03-01T03:02:00+00:00'::timestamptz, 1), ('1049129', '415000614', '2017-03-01T03:02:00+00:00'::timestamptz, 1), ('1059455', '415000636', '2017-03-01T03:02:00+00:00'::timestamptz, 1), ('1059455', '415000638', '2017-03-01T03:02:00+00:00'::timestamptz, 1), ('1075963', '415000605', '2017-03-01T03:02:00+00:00'::timestamptz, 1), ('1113804', '415000607', '2017-03-01T03:02:00+00:00'::timestamptz, 1), ('1137600', '
2017-06-23 04:55:21.090 UTC [27896] LOG:  terminating any other active server processes
2017-06-23 04:55:21.100 UTC [19656] WARNING:  terminating connection because of crash of another server process
2017-06-23 04:55:21.100 UTC [19656] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2017-06-23 04:55:21.100 UTC [19656] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2017-06-23 04:55:21.134 UTC [27896] LOG:  all server processes terminated; reinitializing
2017-06-23 04:55:21.183 UTC [20539] LOG:  database system was interrupted; last known up at 2017-06-23 04:51:40 UTC
2017-06-23 04:55:21.202 UTC [20540] postgres@uihspot FATAL:  the database system is in recovery mode
2017-06-23 04:55:21.211 UTC [20541] postgres@uihspot FATAL:  the database system is in recovery mode

更新我的情况,涉及到COPY问题 解决方案: pip install django-postgres-copy

@transaction.atomic
def postgres_copy(instance: UploadedFile):
    """
    Use COPY to do bulk INSERT INTO
    :param instance:
    :return:
    """
    import time  # PyCharm Bug 30May2017 It optimized and removed my line
    start_time = time.time()

    bkk = timezone(settings.TIME_ZONE)
    urllib.request.urlretrieve(instance.file.url, "original.csv")

    Sales.objects.all().delete()
    with open("original.csv", 'rb') as source_file:
        with open("utf8.tsv", 'w+b') as dest_file:
            contents = source_file.read()
            dest_file.write(contents.decode('utf-16').encode('utf-8'))

    in_txt = csv.reader(open('./utf8.tsv', "r"), delimiter='\t')
    out_csv = csv.writer(open('./utf8.csv', 'w'))

    out_csv.writerows(in_txt)

    sales = []
    copy_mapping = CopyMapping(
        Sales,
        "./utf8.csv",
        dict(
            imc='IMC Number',
            order_number='Order Number',
            original_order_date='Original Order Date',
            count='Demand Order Count'
        )
    )
    copy_mapping.save()
    result = time.time() - start_time
    logger.info(msg=f"Total Execution save_sale_records time --- {result} seconds ---")

以及原始的版本

@transaction.atomic
def save_sale_records(instance: UploadedFile):
    """
    This file will download from minio. Since TemporaryUploadedFile class is not a File class
    Therefore it is not supported by csv reader. Python function read from real object
    :param instance:
    :return:
    """
    import time  # PyCharm Bug 30May2017 It opmized and removed my line
    start_time = time.time()

    bkk = timezone(settings.TIME_ZONE)
    urllib.request.urlretrieve(instance.file.url, "original.csv")

    Sales.objects.all().delete()
    with open("original.csv", 'rb') as source_file:
        with open("utf8.csv", 'w+b') as dest_file:
            contents = source_file.read()
            dest_file.write(contents.decode('utf-16').encode('utf-8'))

    sales = []
    with open("utf8.csv") as csv_file:
        reader = csv.reader(csv_file, dialect="excel-tab")
        for index, row in enumerate(reader):
            """
            OrderedDict([
            ('\ufeffWarehouse Code', '41CL'), 
            ('Warehouse Desc', 'แอมเวย์ ช็อป สีลม'), 
            ('IMC Number', '1113804'), 
            ('Order Number', '415000457'), 
            ('Original Order Date', '2017-03-01 00:00:00'), 
            ('Order 24 Hour Min', '09:42'), 
            ('Demand Order Count', '1')])
            """
            if index == 0:
                continue
            # Multiple lines for maintainer
            order_date = row[4].split(" ")[0]
            order_time = row[5]
            order_datetime = order_date + "-" + order_time
            date_obj = datetime.strptime(order_datetime, "%m/%d/%y-%H:%M").replace(tzinfo=bkk)
            utc_date = date_obj.astimezone(pytz.utc)
            sale = Sales(
                imc=row[2],
                order_number=row[3],
                original_order_date=utc_date,
                count=row[6]
            )
            sales.append(sale)

    Sales.objects.bulk_create(sales)
    result = time.time() - start_time
    logger.info(msg=f"Total Execution save_sale_records time --- {result} seconds ---")
2个回答

4

对于从Google过来的任何人,当尝试在大量条目上使用 bulk_updatebulk_create可能也会产生此问题)时,我遇到了这个问题。我通过指定一个 batch_size 来解决它:

MyModel.objects.bulk_update(models, fields, batch_size=1000)

2

很明显,错误日志表明这不是你的问题。

2017-06-23 04:55:21.100 UTC [19656] 详细信息:由于另一个服务器进程异常退出并可能破坏了共享内存,主控进程命令此服务器进程回滚当前事务并退出。
2017-06-23 04:55:21.100 UTC [19656] 提示:稍后您应该能够重新连接到数据库并重复执行命令。

强调是我加的。但是你仍然在错误的方式下进行操作!将大量数据加载到PostgreSQL的正确方法是使用COPY

COPY 在 PostgreSQL 表和标准文件系统文件之间移动数据。COPY TO 将表的内容复制到文件中,而 COPY FROM 将数据从文件复制到表中(将数据附加到表中已有的数据)。COPY TO 还可以复制 SELECT 查询的结果。


谢谢您的回复。我现在正在努力理解您的信息。 - joe
尝试使用此链接 https://github.com/california-civic-data-coalition/django-postgres-copy - joe
你不需要那个。你在其他问题中展示了你知道如何使用psql控制台。这就是你所需要的。 - e4c5
很高兴能够提供帮助。 - e4c5

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接