Windows下R内存没有被释放

Question

Windows下R内存没有被释放

11

我正在使用Windows 7中的RStudio，但在释放内存到操作系统时遇到了问题。下面是我的代码，在一个for循环中:

我通过Census.gov网站的API调用读取数据，并使用acs包将它们保存在.csv文件中，通过临时对象table。
我删除table(通常几MB)，并使用pryr包检查内存使用情况。

根据函数mem_used()，在删除table后，R总是返回到一个恒定的内存使用量;而根据Windows任务管理器，rsession.exe(而不是Rstudio)的内存分配在每次迭代时增加，最终导致rsession崩溃。使用gc()没有帮助。我已经阅读了很多类似的问题，但似乎唯一的释放内存的解决方案是重新启动R会话，这似乎很愚蠢。有什么建议吗？

   library(acs)
   library(pryr) 
   # for loop to extract tables from API and save them on API
   for (i in 128:length(tablecodes)) {
           tryCatch({table <- acs.fetch(table.number = tablecodes[i],endyear = 2014, span=5, 
                 geography = geo.make(state = "NY", county = "*", tract = "*"), 
                 key = "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851",col.names="pretty")},
             error = function(e){print("Table skipped") })

    # if the table is actually fetched then we save it 
    if (exists("table", mode="S4")) {         
         print(paste("Table",i,"fetched")
         if (!is.na(table)){
                   write.csv(estimate(table),paste("./CENSUS_tables/NY/",tablecodes[i],".csv",sep = ""))       
         }
    print(mem_used())  
    print(mem_change(rm(table)))
    gc()
    }
   }

- fbarian

循环结束后尝试调用gc()。 - JKJ

4

可能是由于内存泄漏引起的：一个包直接从操作系统分配内存（而非使用R的分配器）可能不会在 mem_used 中显示，但会在系统监视器中显示。据我所知，acs 没有任何 C/C++ 代码，但它使用了 XML 包。可能是 acs 没有释放由 XML 分配的内存，或者 XML 包存在内存泄漏问题（在 Windows 下据说存在此问题：http://www.omegahat.net/RSXML/）。 - Jan van der Laan

我建议使用 httr 直接调用 API。如果您直接构造 API 调用，则不应该有任何内存泄漏问题。我在 acs 包中也遇到了同样的问题。 - troh

你尝试过用gc(T)替换gc()吗？ - Zeinab Ghaffarnasab

可能会有所帮助：http://www.matthewckeller.com/html/memory.html - Zeinab Ghaffarnasab

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Technophobe01 · Accepted Answer

我可以确认在Windows 7上存在内存问题（通过MacOSX上的VMware Fusion运行）。虽然内存使用似乎相当逐渐（未经证实但表明存在内存泄漏），但它似乎也存在于MacOSX上。由于操作系统在看到高使用率时会压缩内存，因此在MacOSX上稍微有些棘手。

鉴于以上情况，我的建议是从美国人口普查局下载表格时将下载集拆分为较小的组。为什么？好吧，看看你正在下载数据以存储在.CSV文件中的代码。因此，短期内的解决方法是分割你要下载的表格列表。您的程序应该能够在一组运行中成功完成。

其中一种选择是创建一个包装器RScript并使其在N个运行中运行，每个运行调用一个单独的R会话。即Rscript按顺序调用N个RSessions，每个会话下载N个文件。

根据您的代码和观察到的内存使用情况，我感觉您正在下载大量表格，因此跨R会话进行拆分可能是最佳选择。

注意：以下内容应在Windows 7上的cgiwin下运行。

调用脚本示例：如果主表01到27不存在，则跳过...

!#/bin/bash

#Ref: https://censusreporter.org/topics/table-codes/
# Params: Primary Table Year Span

for CensusTableCode in $(seq -w 1 27)
do
  R --no-save -q --slave < ./PullCensus.R --args B"$CensusTableCode"001 2014 5
done

PullCensus.R

if (!require(acs)) install.packages("acs")
if (!require(pryr)) install.packages("pryr")

# You can obtain a US Census key from the developer site
# "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851"
api.key.install(key = "** Secret**")

setwd("~/dev/stackoverflow/37264919")
    
# Extract Table Structure
#
# B = Detailed Column Breakdown
# 19 = Income (Households and Families)
# 001 =
# A - I = Race
#

args <- commandArgs(trailingOnly = TRUE) # trailingOnly=TRUE means that only your arguments are returned

if ( length(args) != 0 ) {
    tableCodes <- args[1]
    defEndYear = args[2]
    defSpan = args[3]
  } else {
  tableCodes <- c("B02001")
  defEndYear = 2014
  defSpan = 5
}

# for loop to extract tables from API and save them on API
for (i in 1:length(tableCodes))
{
  tryCatch(
    table <- acs.fetch(table.number = tableCodes[i],
                       endyear = defEndYear,
                       span = defSpan,
                       geography = geo.make(state = "NY",
                                            county = "*",
                                            tract = "*"),
                       col.names = "pretty"),
    error = function(e) { print("Table skipped")} )

  # if the table is actually fetched then we save it
  if (exists("table", mode = "S4"))
  {
    print(paste("Table", i, "fetched"))
    if (!is.na(table))
    {
      write.csv(estimate(table), paste(defEndYear,"_",tableCodes[i], ".csv", sep = ""))
    }
    print(mem_used())
    print(mem_change(rm(table)))
    gc(reset = TRUE)
    print(mem_used())
  }
}

我希望通过示例向您展示一种方法。这是一种方法。；-）

T.

下一步:

我将查看软件包源代码，以查看实际上出了什么问题。或者，您自己可能能够缩小范围，并针对软件包提交错误报告。

背景 / 工作示例:

我的感觉是提供一个工作代码示例来解释上述解决方法可能会有所帮助。为什么？这样做的目的是为了提供一个人们可以用来测试和考虑发生了什么事情的示例。为什么？好吧，这使得更容易理解您的问题和意图。

本质上，（据我了解）您正在从美国人口普查网站批量下载美国人口普查数据。表格代码用于指定要下载的数据。好的，所以我刚刚创建了一组表格代码，并测试了内存使用情况，以查看是否会像您解释的那样消耗内存。

library(acs)
library(pryr)
library(tigris)
library(stringr)  # to pad fips codes
library(maptools)

# You can obtain a US Census key from the developer site
# "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851"
api.key.install(key = "<INSERT KEY HERE>")

# Table Codes
#
# While Census Reporter hopes to save you from the details, you may be
# interested to understand some of the rationale behind American Community
# Survey table identifiers.
#
# Detailed Tables
#
# The bulk of the American Community Survey is the over 1400 detailed data
# tables. These tables have reference codes, and knowing how the codes are
# structured can be helpful in knowing which table to use.
#
# Codes start with either the letter B or C, followed by two digits for the
# table subject, then 3 digits that uniquely identify the table. (For a small
# number of technical tables the unique identifier is 4 digits.) In some cases
# additional letters for racial iterations and Puerto Rico-specific tables.
#
# Full and Collapsed Tables
#
# Tables beginning with B have the most detailed column breakdown, while a
# C table for the same numbers will have fewer columns. For example, the
# B02003 table ("Detailed Race") has 71 columns, while the "collapsed
# version," C02003 has only 19 columns. While your instinct may be to want
# as much data as possible, sometimes choosing the C table can simplify
# your analysis.
#
# Table subjects
#
# The first two digits after B/C indicate the broad subject of a table.
# Note that many tables have more than one subject, but this reflects the
# main subject.
#
# 01 Age and Sex
# 02 Race
# 03 Hispanic Origin
# 04 Ancestry
# 05 Foreign Born; Citizenship; Year or Entry; Nativity
# 06 Place of Birth07Residence 1 Year Ago; Migration
# 08 Journey to Work; Workers' Characteristics; Commuting
# 09 Children; Household Relationship
# 10 Grandparents; Grandchildren
# 11 Household Type; Family Type; Subfamilies
# 12 Marital Status and History13Fertility
# 14 School Enrollment
# 15 Educational Attainment
# 16 Language Spoken at Home and Ability to Speak English
# 17 Poverty
# 18 Disability
# 19 Income (Households and Families)
# 20 Earnings (Individuals)
# 21 Veteran Status
# 22 Transfer Programs (Public Assistance)
# 23 Employment Status; Work Experience; Labor Force
# 24 Industry; Occupation; Class of Worker
# 25 Housing Characteristics
# 26 Group Quarters
# 27 Health Insurance
#
# Three groups of tables reflect technical details about how the Census is
# administered. In general, you probably don't need to look at these too
# closely, but if you need to check for possible weaknesses in your data
# analysis, they may come into play.
#
# 00 Unweighted Count
# 98 Quality Measures
# 99 Imputations
#
# Race and Latino Origin
#
# Many tables are provided in multiple racial tabulations. If a table code
# ends in a letter from A-I, that code indicates that the table universe is
# restricted to a subset based on responses to the race or
# Hispanic/Latino-origin questions.
#
# Here is a guide to those codes:
#
#   A White alone
#   B Black or African American Alone
#   C American Indian and Alaska Native Alone
#   D Asian Alone
#   E Native Hawaiian and Other Pacific Islander Alone
#   F Some Other Race Alone
#   G Two or More Races
#   H White Alone, Not Hispanic or Latino
#   I Hispanic or Latino


setwd("~/dev/stackoverflow/37264919")

# Extract Table Structure
#
# B = Detailed Column Breakdown
# 19 = Income (Households and Families)
# 001 =
# A - I = Race
#
tablecodes <- c("B19001", "B19001A", "B19001B", "B19001C", "B19001D",
                "B19001E", "B19001F", "B19001G", "B19001H", "B19001I" )

# for loop to extract tables from API and save them on API
for (i in 1:length(tablecodes))
{
  print(tablecodes[i])
  tryCatch(
    table <- acs.fetch(table.number = tablecodes[i],
                       endyear = 2014,
                       span = 5,
                       geography = geo.make(state = "NY",
                                            county = "*",
                                            tract = "*"),
                       col.names = "pretty"),
    error = function(e) { print("Table skipped")} )

  # if the table is actually fetched then we save it
  if (exists("table", mode="S4"))
  {
    print(paste("Table", i, "fetched"))
    if (!is.na(table))
    {
      write.csv(estimate(table), paste("T",tablecodes[i], ".csv", sep = ""))
    }
    print(mem_used())
    print(mem_change(rm(table)))
    gc()
    print(mem_used())
  }
}

运行时输出

> library(acs)
> library(pryr)
> library(tigris)
> library(stringr)  # to pad fips codes
> library(maptools)
> # You can obtain a US Census key from the developer site
> # "e24539dfe0e8a5c5bf99d78a2bb8138abaa3b851"
> api.key.install(key = "...secret...")
> 
...
> setwd("~/dev/stackoverflow/37264919")
> 
> # Extract Table Structure
> #
> # B = Detailed Column Breakdown
> # 19 = Income (Households and Families)
> # 001 =
> # A - I = Race
> #
> tablecodes <- c("B19001", "B19001A", "B19001B", "B19001C", "B19001D",
+                 "B19001E", "B19001F", "B19001G", "B19001H", "B19001I" )
> 
> # for loop to extract tables from API and save them on API
> for (i in 1:length(tablecodes))
+ {
+   print(tablecodes[i])
+   tryCatch(
+     table <- acs.fetch(table.number = tablecodes[i],
+                        endyear = 2014,
+                        span = 5,
+                        geography = geo.make(state = "NY",
+                                             county = "*",
+                                             tract = "*"),
+                        col.names = "pretty"),
+     error = function(e) { print("Table skipped")} )
+ 
+   # if the table is actually fetched then we save it
+   if (exists("table", mode="S4"))
+   {
+     print(paste("Table", i, "fetched"))
+     if (!is.na(table))
+     {
+       write.csv(estimate(table), paste("T",tablecodes[i], ".csv", sep = ""))
+     }
+     print(mem_used())
+     print(mem_change(rm(table)))
+     gc()
+     print(mem_used())
+   }
+ }
[1] "B19001"
[1] "Table 1 fetched"
95.4 MB
-1.88 MB
93.6 MB
[1] "B19001A"
[1] "Table 2 fetched"
95.4 MB
-1.88 MB
93.6 MB
[1] "B19001B"
[1] "Table 3 fetched"
95.5 MB
-1.88 MB
93.6 MB
[1] "B19001C"
[1] "Table 4 fetched"
95.5 MB
-1.88 MB
93.6 MB
[1] "B19001D"
[1] "Table 5 fetched"
95.5 MB
-1.88 MB
93.6 MB
[1] "B19001E"
[1] "Table 6 fetched"
95.5 MB
-1.88 MB
93.6 MB
[1] "B19001F"
[1] "Table 7 fetched"
95.5 MB
-1.88 MB
93.6 MB
[1] "B19001G"
[1] "Table 8 fetched"
95.5 MB
-1.88 MB
93.6 MB
[1] "B19001H"
[1] "Table 9 fetched"
95.5 MB
-1.88 MB
93.6 MB
[1] "B19001I"
[1] "Table 10 fetched"
95.5 MB
-1.88 MB
93.6 MB

输出文件

>ll
total 8520
drwxr-xr-x@ 13 hidden  staff   442B Oct 17 20:41 .
drwxr-xr-x@ 40 hidden  staff   1.3K Oct 17 23:17 ..
-rw-r--r--@  1 hidden  staff   4.4K Oct 17 23:43 37264919.R
-rw-r--r--@  1 hidden  staff   492K Oct 17 23:50 TB19001.csv
-rw-r--r--@  1 hidden  staff   472K Oct 17 23:51 TB19001A.csv
-rw-r--r--@  1 hidden  staff   414K Oct 17 23:51 TB19001B.csv
-rw-r--r--@  1 hidden  staff   387K Oct 17 23:51 TB19001C.csv
-rw-r--r--@  1 hidden  staff   403K Oct 17 23:51 TB19001D.csv
-rw-r--r--@  1 hidden  staff   386K Oct 17 23:51 TB19001E.csv
-rw-r--r--@  1 hidden  staff   402K Oct 17 23:51 TB19001F.csv
-rw-r--r--@  1 hidden  staff   393K Oct 17 23:52 TB19001G.csv
-rw-r--r--@  1 hidden  staff   465K Oct 17 23:44 TB19001H.csv
-rw-r--r--@  1 hidden  staff   417K Oct 17 23:44 TB19001I.csv