Python数据正态性检验实现过程

时间：2020-05-26

阅读：

在做数据分析或者统计的时候，经常需要进行数据正态性的检验，因为很多假设都是基于正态分布的基础之上的，例如：T检验。

在Python中，主要有以下检验正态性的方法：

1.scipy.stats.shapiro ——Shapiro-Wilk test，属于专门用来做正态性检验的模块，其原假设：样本数据符合正态分布。

注：适用于小样本。

其函数定位为：

def shapiro(x):
  """
  Perform the Shapiro-Wilk test for normality.

  The Shapiro-Wilk test tests the null hypothesis that the
  data was drawn from a normal distribution.

  Parameters
  ----------
  x : array_like
    Array of sample data.

  Returns
  -------
  W : float
    The test statistic.
  p-value : float
    The p-value for the hypothesis test.

x参数为样本值序列，返回值中第一个为检验统计量，第二个为P值，当P值大于指定的显著性水平，则接受原假设。

2.scipy.stats.kstest（K-S检验）：可以检验多种分布，不止正态分布，其原假设：数据符合正态分布。

其函数定义为：

def kstest(rvs, cdf, args=(), N=20, alternative='two-sided', mode='approx'):
  """
  Perform the Kolmogorov-Smirnov test for goodness of fit.

  This performs a test of the distribution G(x) of an observed
  random variable against a given distribution F(x). Under the null
  hypothesis the two distributions are identical, G(x)=F(x). The
  alternative hypothesis can be either 'two-sided' (default), 'less'
  or 'greater'. The KS test is only valid for continuous distributions.

  Parameters
  ----------
  rvs : str, array or callable
    If a string, it should be the name of a distribution in `scipy.stats`.
    If an array, it should be a 1-D array of observations of random
    variables.
    If a callable, it should be a function to generate random variables;
    it is required to have a keyword argument `size`.
  cdf : str or callable
    If a string, it should be the name of a distribution in `scipy.stats`.
    If `rvs` is a string then `cdf` can be False or the same as `rvs`.
    If a callable, that callable is used to calculate the cdf.
  args : tuple, sequence, optional
    Distribution parameters, used if `rvs` or `cdf` are strings.
  N : int, optional
    Sample size if `rvs` is string or callable. Default is 20.
  alternative : {'two-sided', 'less','greater'}, optional
    Defines the alternative hypothesis (see explanation above).
    Default is 'two-sided'.
  mode : 'approx' (default) or 'asymp', optional
    Defines the distribution used for calculating the p-value.

     - 'approx' : use approximation to exact distribution of test statistic
     - 'asymp' : use asymptotic distribution of test statistic

  Returns
  -------
  statistic : float
    KS test statistic, either D, D+ or D-.
  pvalue : float
    One-tailed or two-tailed p-value.

参数是：

rvs：待检验数据。

cdf：检验分布，例如'norm'，'expon'，'rayleigh'，'gamma'等分布，设置为'norm'时表示正态分布。

alternative：默认为双侧检验，可以设置为'less'或'greater'作单侧检验。

model:'approx'(默认值)，表示使用检验统计量的精确分布的近视值；'asymp'：使用检验统计量的渐进分布。

其返回值中第一个为统计量，第二个为P值。

3.scipy.stats.normaltest：正态性检验，其原假设：样本来自正态分布。

其函数定义为：

def normaltest(a, axis=0, nan_policy='propagate'):
  """
  Test whether a sample differs from a normal distribution.

  This function tests the null hypothesis that a sample comes
  from a normal distribution. It is based on D'Agostino and
  Pearson's [1]_, [2]_ test that combines skew and kurtosis to
  produce an omnibus test of normality.


  Parameters
  ----------
  a : array_like
    The array containing the sample to be tested.
  axis : int or None, optional
    Axis along which to compute test. Default is 0. If None,
    compute over the whole array `a`.
  nan_policy : {'propagate', 'raise', 'omit'}, optional
    Defines how to handle when input contains nan. 'propagate' returns nan,
    'raise' throws an error, 'omit' performs the calculations ignoring nan
    values. Default is 'propagate'.

  Returns
  -------
  statistic : float or array
    ``s^2 + k^2``, where ``s`` is the z-score returned by `skewtest` and
    ``k`` is the z-score returned by `kurtosistest`.
  pvalue : float or array
    A 2-sided chi squared probability for the hypothesis test.

其参数：

axis=None 可以表示对整个数据做检验，默认值是0。

nan_policy：当输入的数据中有nan时，'propagate'，返回空值；'raise' 时，抛出错误；'omit' 时，忽略空值。

其返回值中，第一个是统计量，第二个是P值。

4.scipy.stats.anderson：由 scipy.stats.kstest 改进而来，用于检验样本是否属于某一分布（正态分布、指数分布、logistic 或者 Gumbel等分布）

其函数定义为：

def anderson(x, dist='norm'):
  """
  Anderson-Darling test for data coming from a particular distribution

  The Anderson-Darling tests the null hypothesis that a sample is
  drawn from a population that follows a particular distribution.
  For the Anderson-Darling test, the critical values depend on
  which distribution is being tested against. This function works
  for normal, exponential, logistic, or Gumbel (Extreme Value
  Type I) distributions.

  Parameters
  ----------
  x : array_like
    array of sample data
  dist : {'norm','expon','logistic','gumbel','gumbel_l', gumbel_r',
    'extreme1'}, optional
    the type of distribution to test against. The default is 'norm'
    and 'extreme1', 'gumbel_l' and 'gumbel' are synonyms.

  Returns
  -------
  statistic : float
    The Anderson-Darling test statistic
  critical_values : list
    The critical values for this distribution
  significance_level : list
    The significance levels for the corresponding critical values
    in percents. The function returns critical values for a
    differing set of significance levels depending on the
    distribution that is being tested against.

其参数：

x和dist分别表示样本数据和分布。

返回值有三个，第一个表示统计值，第二个表示评价值，第三个是显著性水平；评价值和显著性水平对应。

对于不同的分布，显著性水平不一样。

Critical values provided are for the following significance levels:

  normal/exponenential
    15%, 10%, 5%, 2.5%, 1%
  logistic
    25%, 10%, 5%, 2.5%, 1%, 0.5%
  Gumbel
    25%, 10%, 5%, 2.5%, 1%

关于统计值与评价值的对比：当统计值大于这些评价值时，表示在对应的显著性水平下，原假设被拒绝，即不属于某分布。

If the returned statistic is larger than these critical values then for the corresponding significance level, the null hypothesis that the data come from the chosen distribution can be rejected.

5.skewtest 和kurtosistest 检验：用于检验样本的skew（偏度）和kurtosis（峰度）是否与正态分布一致，因为正态分布的偏度=0，峰度=3。

偏度：偏度是样本的标准三阶中心矩。

Python数据正态性检验实现过程

峰度：峰度是样本的标准四阶中心矩。

Python数据正态性检验实现过程

6. 代码如下：

import numpy as np
from scipy import stats

a = np.random.normal(0,2,50)
b = np.linspace(0, 10, 100)

# Shapiro-Wilk test
S,p = stats.shapiro(a)
print('the shapiro test result is:',S,',',p)

# kstest（K-S检验）
K,p = stats.kstest(a, 'norm')
print(K,p)

# normaltest
N,p = stats.normaltest(b)
print(N,p)

# Anderson-Darling test
A,C,p = stats.anderson(b,dist='norm')
print(A,C,p)

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持。

一、time 模块time模块是Python标准库中最基础、最常用的模块之一。它提供了各种处理时间的方法和函数，如获取当前时间、格式化时间、计算时间差等。time模块大部分函数的底

2024-11-18 10:17:21

目录 1. 安装与卸载 Poetry 2. 查看 Poetry 版本 3. 查看 Poetry 的位置 4. 依赖安装 Poetry 的优势在现代软件开发中，Python 因其简洁和强大的功能而广受欢迎。然而，随着项目

2024-11-10 12:23:07

目录 Python 日期和时间戳的转换 Python中处理时间的模块 Python的time模块将时间戳转换为格式化字符串 Python 的datetime模块 datetime模块中定义的类（这些

2024-10-20 21:50:48

目录 1. 异步组件 1.1 使用 defineAsyncComponent 1.2 预加载 2. 路由懒加载 3. 动态组件 4. 事件触发的动态加载 5. 按需加载与状态管理结合结论 1. 异步组

2024-10-20 21:50:41

目录引言使用Python保存PPT中的所有形状为图像文件用Python保存PPT中带格式设计的图片为图像文件引言将PowerPoint演示文稿中的形状（幻灯片中的内容元素，包括文本框、图形

2024-10-20 21:50:28

目录前言用Python删除PDF文档页边距前言在处理PDF文档时，有时候我们会遇到PDF文件带有较大的页边距的情况。这样过大的页边距不仅浪费了页面空间，而且在打印或电子阅读时也

2024-10-20 21:50:20

目录 1.引言 2.准备工作 3.基础理论知识 4.步骤详解 5.常见问题解答 6.成果案例分享 7.案例代码示例1.引言火柴人（Stick Figure）是一种极简风格的图形，通常由简单的线段和圆圈

2024-10-20 21:50:09

目录环境介绍类和函数区别封装性：状态保持：可重用性：继承和多态：设计模式：代码组织：执行流程：参数传递：返回值：上下文管理：总结环境window10，pyc

2024-10-20 21:50:03

目录一、JWT的介绍二、JWT的组成 1、Header（头部） 2、Payload（负载） 3、Signature（签名）三、Python写JWT 1、安装Jwt 2、使用JWT 3、解密JWT 总结一、JWT的介绍JW

2024-10-18 23:10:58

目录视频转换成 GIF 图形的重要性 1. 增强表达效果 2. 适应性强 3. 文件大小优化 4. 易于创建和编辑 5. 吸引用户注意力 6. 简化获取信息的步骤用python实现视gif

2024-10-18 23:10:46

目录前言 1. 构建分子式 2. 判断化合价 3. 解析分子式 4. 化合物反应方程式平衡 5. 化合物的摩尔质量计算 6. 计算化合物的质量分数 7. 计算反应热 8. 计算化合物的pH值总

2024-10-18 23:10:16

目录 1 创建 pytest.ini 文件 2 常见参数及配置方法 3 高级配置 4 结论1 创建 pytest.ini 文件在项目的根目录下，创建一个名为 pytest.ini 的文件即可。pytest 会在执行测试

2024-10-18 23:10:06

目录一、XPATH概念二、常用规则与方法 1.f12例子: 2.节点获取文本 3.属性匹配 4. 属性获取 5.iframe标签处理三、同级节点元素定位四、关键字定位五、定位

2024-10-18 23:09:49

目录引言发送GET请求获取页面的二进制数据添加查询参数发送POST请求处理JSON数据设置自定义Header 发送带认证信息的请求发送带有表单数据的请求发送带有文件的请

2024-10-18 23:09:30

安装CPU版本：(以2.9.0版本为例)pip install tensorflow==2.9.0安装GPU版本：(以2.9.0版本为例)pip install tensorflow-gpu==2.9.0若下载缓慢，使用阿里国内镜像源加速下载：(以2.9

2024-10-14 19:47:12

目录概述用asyncio实现Hello world 总结概述Python中 asyncio 模块内置了对异步IO的支持，用于处理异步IO；是Python 3.4版本引入的标准库。asyncio 的编程模型就是一个消息循

2024-10-14 19:47:02

目录 1. 问题描述 2. 解决办法 2.1 办法一：进入Script 进行安装 2.2 办法二：设置环境变量总结 1. 问题描述这几天一直用python实战，今天用pip想要安装一个库，结果突然报了

2024-10-14 19:46:53

目录实践环境问题域定义协议格式(编写proto文件) 编译协议缓冲区协议缓冲区 API 枚举标准消息方法解析和序列化编写消息读取消息另一个示例参考链接

2024-10-14 19:46:35

目录使用pip时报NameError: name‘pip’is not defined错误 1. 问题描述 2. 解决办法总结使用pip时报NameError: name‘pip’is not defined错

2024-10-14 19:46:27

目录 1. 概述 2. arange() 2.1 语法 2.2 参数 2.3 实例总结 1. 概述Numpy 中 arange() 主要是用于生成数组，具体用法如下；2. arange()2.1 语法numpy.arange(start, sto

2024-10-14 19:46:12

目录 1. 概述 2. vstack() 2.1 语法 2.2 参数 2.3 实例 3. hstack() 3.1 语法 3.2 参数 3.3 实例总结 1. 概述在Numpy中，最希望处理的数据就是数组和矩阵，下面就

2024-10-14 19:46:03

目录 Python列表简介 NumPy数组简介性能比较 1. 数组操作 2. 循环操作内存使用比较 1. 内存占用 2. 大数据集结论在Python中，处理数值数据时，我们通常面临两种选

2024-10-14 19:45:55

目录引言基础语法介绍核心概念基本语法规则基础实例问题描述代码示例进阶实例问题描述高级代码实例实战案例问题描述解决方案代码实现扩展

2024-10-14 19:45:46

目录引言 Python Excel库 Python 在Excel 中的添加数据条引言在Excel中添加数据条是一种数据可视化技巧，它通过条形图的形式在单元格内直观展示数值的大小，尤其适合比较同一

2024-10-14 19:45:37

目录

一、引言

二、什么是查询集？

2.1 创建查询集

三、查询集的延迟加载

3.1 查询集的惰性行为

2024-10-14 19:44:53

字符串问题我正在使用 python 通过 jdbc（或 odbc）访问 iris 数据库。我想将数据提取到 pandas 数据框中来操作数据并从中创建图表。我在使用 jdbc 时遇到了字符串处理问题。

2024-09-30 00:07:53

您的组织是否拥有太多 github 存储库，并且您需要一种简单的方法来总结和记录每个存储库的内容以用于报告、仪表板或审计目的？下面是一个使用 github api 完成该操作的快速脚本

2024-09-30 00:07:10

Python构建代理池构建有效的代理池对于爬虫任务至关重要，因为它可以绕过网站反爬或提升爬虫效率。在Python中构建代理池的方法如下：一、收集代理免费代理网站：如FreeProxyList

2024-09-18 16:06:35

&emsp;&emsp;本文介绍基于Python语言，针对一个文件夹下大量的Excel表格文件，对其中的每一个文件加以操作——将其中指定的若干列的数据部分都向上移动一行，并将所有

2024-09-09 23:42:47

技术背景一般情况下我们会选择使用明文形式来存储数据，如json、txt、csv等等。如果是需要压缩率较高的存储格式，还可以选择使用hdf5或者npz等格式。还有一种比较紧凑的数据存

2024-09-09 23:40:42

2020-10-21

2021-03-02

2020-05-07

2020-05-26

2021-01-13

2021-04-02

2020-05-10

2020-05-09

2020-05-10

2020-10-21

Python数据正态性检验实现过程

Python时间处理模块time和datetime详解

如何使用 Poetry 进行 Python 项目管理

Python日期和时间戳的转换的实现方式

详解Vue组件动态加载有哪些方式

如何使用Python保存PPT中的形状为图像文件

使用Python删除PDF文档页面的页边距的操作代码

Python实现火柴人的设计与实现

Python中使用封装类还是函数以及它们的区别

Python使用JWT的超详细教程

如何利用python实现把视频转换成gif图形

使用python解决化学问题的实用指南

深入理解python中pytest.ini的配置方法和参数

selenium XPath定位的实现示例

Python网络请求库requests的10个基本用法

pip安装指定版本的tensorflow的实现

Python中asyncio模块使用详解

pip命令突然无法使用问题以及解决

在Python中使用Protocol Buffers的详细介绍

解决NameError:name'pip'is not defined使用pip时报错问题

Numpy中arange()的用法及说明

Numpy中vstack()和hstack()的使用方式

解读NumPy数组与Python列表的比较

Python中的策略模式之解锁编程的新维度

Python在Excel中添加数据条的代码详解

Python Django查询集的延迟加载特性详解

使用 Python 通过 ODBC 或 JDBC 访问 IRIS 数据库

如何使用 Python 检索 Github 存储库数据

python爬虫怎么构建代理池

Python将表格文件中某些列的数据整体向上移动一行

Python存储与读写二进制文件

热点内容

免费资源网

在线工具

扫一扫随时看

本站下载频道