365体育亚洲直播

文章目录[隐藏]

内存使用统计
Numpy内存优化
- 转换数据类型
- 使用稀疏矩阵
Pandas内存优化
模型内存优化
总结

文章转载自公众号：Coggle数据科学，版权归原作者所有！

在Kaggle和日常的代码运行中，我们的内存总是受限的。那么我们在有限的内存中让代码跑起来呢？本文给出了一些解决方法。

内存使用统计

在进行内存优化之前，可以使用如下函数对进行使用的内存进行统计。

import psutil
impot os
def cpu_stats():
    pid = os.getpid()
    py = psutil.Process(pid)
    memory_use = py.memory_info()[0] / 2.  30 return 'memory GB:' + str(np.round(memory_use, 2))

对于pandas读取的数据，可以使用如下函数查看内存使用：

# 整体内存使用 df.info(memory_usage="deep")

# 每列内存使用 df.memory_usage()

对于应用程序，可以使用filprofiler函数查看内存峰值。

Kaggle知识点：内存优化方法

https://github.com/pythonspeed/filprofiler

Numpy内存优化

转换数据类型

在Numpy支持多种数据类型，不同类型数据的内存占用相差很大。uint64类型比uint16内存占比大四倍。

>>> from numpy import ones
>>> int64arr = ones((1024, 1024), dtype=np.uint64)
>>> int64arr.nbytes
8388608

>>> int16arr = ones((1024, 1024), dtype=np.uint16)
>>> int16arr.nbytes
2097152

对于数据类型，可以根据矩阵的元素范围进行设置。比如对于整数可以参考以下常见类型的范围，并选取最为合适的。

类型	范围
int8	(-128 to 127)
int16	(-32768 to 32767)
int32	(-2147483648 to 2147483647)
int64	(-9223372036854775808 to 9223372036854775807)
uint8	(0 to 255)
uint16	(0 to 65535)
uint32	(0 to 4294967295)
uint64	(0 to 18446744073709551615)

对于浮点数，可以考虑使用float16、float32和float32来进行存储。Numpy具体支持的数据类型可以参考?文档。

https://numpy.org/devdocs/user/basics.types.html

使用稀疏矩阵

如果矩阵中数据是稀疏的情况，可以考虑稀疏矩阵。LGB和XGB支持稀疏矩阵参与训练。

>>> import sparse; import numpy as np
>>> arr = np.random.random((1024, 1024))
>>> arr[arr < 0.9] = 0
>>> sparse_arr = sparse.COO(arr)

>>> arr.nbytes
8388608

>>> sparse_arr.nbytes
2514648

Pandas内存优化

分批读取

如果数据文件非常大，可以在读取时分批次读取，通过设置chunksize来控制批大小。

df = pd.read_csv(path, chunksize=1000000)

for chunk in df: # 分批次处理数据 pass

选择读取部分列

df = pd.read_csv(path, usecols=["a"])

提前设置列类型

df = pd.read_csv(path, dtype={"a":"int8"})

将类别列设为category类型

df['a'] = df['a'].astype('category')

此操作对于类别列压缩非常有效，压缩比很大。同时在设置为category类型后，LightGBM可以视为类别类型训练。

自动识别类型并进行转换

def reduce_mem_usage(props):
    start_mem_usg = props.memory_usage().sum() / 10242 print("Memory usage of properties dataframe is :",start_mem_usg," MB")
    NAlist = [] # Keeps track of columns that have missing values filled in.  for col in props.columns: if props[col].dtype != object: # Exclude strings # Print current column type print("") print("Column: ",col) print("dtype before: ",props[col].dtype) # make variables for Int, max and min IsInt = False
            mx = props[col].max()
            mn = props[col].min() # Integer does not support NA, therefore, NA needs to be filled if not np.isfinite(props[col]).all(): 
                NAlist.append(col)
                props[col].fillna(mn-1,inplace=True) # test if column can be converted to an integer asint = props[col].fillna(0).astype(np.int64)
            result = (props[col] - asint)
            result = result.sum() if result > -0.01 and result < 0.01:
                IsInt = True

# Make Integer/unsigned Integer datatypes if IsInt: if mn >= 0: if mx < 255:
props[col] = props[col].astype(np.uint8) elif mx < 65535:
props[col] = props[col].astype(np.uint16) elif mx < 4294967295:
props[col] = props[col].astype(np.uint32) else:
props[col] = props[col].astype(np.uint64) else: if mn > np.iinfo(np.int8).min and mx < np.iinfo(np.int8).max:
props[col] = props[col].astype(np.int8) elif mn > np.iinfo(np.int16).min and mx < np.iinfo(np.int16).max:
props[col] = props[col].astype(np.int16) elif mn > np.iinfo(np.int32).min and mx < np.iinfo(np.int32).max:
props[col] = props[col].astype(np.int32) elif mn > np.iinfo(np.int64).min and mx < np.iinfo(np.int64).max:
props[col] = props[col].astype(np.int64) # Make float datatypes 32 bit else:
props[col] = props[col].astype(np.float32) # Print new column type print("dtype after: ",props[col].dtype) print("") # Print final result print("___MEMORY USAGE AFTER COMPLETION:___")
mem_usg = props.memory_usage().sum() / 10242 print("Memory usage is: ",mem_usg," MB") print("This is ",100*mem_usg/start_mem_usg,"% of the initial size") return props, NAlist

https://www.kaggle.com/arjanso/reducing-dataframe-memory-size-by-65

结合numpy.memmap使用

numpy.memmap可以将数据提前在磁盘上进行申请空间，并不需要读取进内存。而且支持多次写入。

所以将每列数据处理好，存储到磁盘，处理完成后再读取进入内存。

https://www.kaggle.com/c/talkingdata-adtracking-fraud-detection/discussion/56105

https://numpy.org/doc/stable/reference/generated/numpy.memmap.html

模型内存优化

XBGboost

可以将数据集存储为libsvm格式，使用External Memory Version完成训练，或者从命令行训练。

https://xgboost.readthedocs.io/en/latest/tutorials/external_memory.html

LightGBM

使用LightGBM的自带的Dataset读取文件进行训练，比使用Numpy和Pandas数据更好。当然把内存数据转换为Dataset也有一定的效果。

https://lightgbm.readthedocs.io/en/latest/Python-Intro.html

设置histogram_pool_size参数控制内存使用，也可以减少num_leaves和max_bin的取值。

https://lightgbm.readthedocs.io/en/latest/FAQ.html?highlight=Multiple#when-running-lightgbm-on-a-large-dataset-my-computer-runs-out-of-ram

深度学习模型

如果使用深度学习模型，可以考虑使用dataloder的方式分批次读取数据到内存。

总结

查看数据列和行，读取需要的数据；
查看数据类型，进行类型转换；
分批次或利用磁盘，处理数据；

【竞赛报名/项目咨询请加微信：mollywei007】

本文由 Molly 转载发布在国际竞赛联盟平台，版权归原作者所有，如有侵权，请直接联系微信进行处理。

Kaggle知识点：内存优化方法

内存使用统计

Numpy内存优化

转换数据类型

使用稀疏矩阵

Pandas内存优化

模型内存优化

总结

雅思4月机考大作文预测：出狱又犯罪

十大需要注意的留学避坑指南！

最新发布

从申请大学考虑长期实习还是短期实习比较好？

申请欧美博士的时间点一样吗？

美国西北大学4+1项目和伊利诺伊香槟3+1+1项目哪个好？

川普是否会影响理工科计算机专业的学生签证？

本科数字媒体艺术专业申美研平面设计方向作品集是不是一定要有批判性思维？

本科华五物理专业大二研究生阶段读哪些专业比较好？

最新文章

从申请大学考虑长期实习还是短期实习比较好？

申请欧美博士的时间点一样吗？

美国西北大学4+1项目和伊利诺伊香槟3+1+1项目哪个好？

川普是否会影响理工科计算机专业的学生签证？

本科数字媒体艺术专业申美研平面设计方向作品集是不是一定要有批判性思维？

内存使用统计

Numpy内存优化

转换数据类型

使用稀疏矩阵

Pandas内存优化

模型内存优化

总结

雅思4月机考大作文预测：出狱又犯罪

十大需要注意的留学避坑指南！

你也可能喜欢

最新发布

最新文章