上一篇
本文目录导读:
🎲 数据库随机抽取大揭秘!从n个数中秒选m个的N种姿势 🎲
MySQL/PostgreSQL
-- 简单粗暴版(小数据量友好) SELECT * FROM your_table ORDER BY RAND()/RANDOM() LIMIT m;
⚠️ 注意:大数据量慎用!RAND()会全表排序,百万级数据直接卡成PPT🐢
SQL Server
-- GUID大法(效率王者👑) SELECT TOP m * FROM your_table ORDER BY NEWID();
Oracle
-- 11g+专属姿势 SELECT * FROM your_table ORDER BY DBMS_RANDOM.VALUE FETCH FIRST m ROWS ONLY;
分区+随机:先随机选分区,再抽样
-- 示例:按日期分区后抽样 WITH partitions AS ( SELECT *, DATE_TRUNC('day', create_time) AS p_day FROM your_table ) SELECT * FROM partitions WHERE p_day = (SELECT p_day FROM partitions ORDER BY RAND() LIMIT 1) LIMIT m;
索引加速:对自增主键表
-- 生成随机ID范围查询 SELECT * FROM your_table WHERE id BETWEEN FLOOR(RAND() * (SELECT MAX(id) FROM your_table)) AND m;
洗牌算法(Fisher-Yates)
Java版示例:
List<Integer> list = new ArrayList<>(n); for(int i=0; i<n; i++) list.add(i); Collections.shuffle(list); // 一键洗牌! List<Integer> result = list.subList(0, m);
水库抽样(Reservoir Sampling)
适用场景:流式数据/超大n值
def reservoir_sample(stream, m): reservoir = [] for i, num in enumerate(stream): if i < m: reservoir.append(num) else: j = random.randint(0, i) if j < m: reservoir[j] = num return reservoir
Pandas实战:
# 按'category'列分层,每层抽m/n比例 stratified_sample = df.groupby('category').apply(lambda x: x.sample(frac=m/n)).reset_index(drop=True)
SQL实现:
-- 使用NTILE分桶后抽样 WITH numbered AS ( SELECT *, NTILE(100) OVER (ORDER BY RAND()) AS bucket FROM your_table ) SELECT * FROM numbered WHERE bucket <= m;
REPLACE=False
(Pandas)或WITHOUT REPLACEMENT
(SQL)np.random.seed(42)
保证结果可复现📌 :小数据量直接ORDER BY RAND()
,大数据量用分区/索引,超大规模选水库抽样,不均衡数据必上分层法!🚀
本文由 业务大全 于2025-08-15发表在【云服务器提供商】,文中图片由(业务大全)上传,本平台仅提供信息存储服务;作者观点、意见不代表本站立场,如有侵权,请联系我们删除;若有图片侵权,请您准备原始证明材料和公证书后联系我方删除!
本文链接:https://xdh.7tqx.com/wenda/621963.html
发表评论