我正在尝试思考一些代码,使我可以搜索我的ArrayList并检测任何超出“好值”常见范围的值。
示例: 100 105 102 13 104 22 101
我应该如何编写代码以检测到(在这种情况下)13和22不在大约100的“好值”范围内?
package test;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;
public class Main {
public static void main(String[] args) {
List<Double> data = new ArrayList<Double>();
data.add((double) 20);
data.add((double) 65);
data.add((double) 72);
data.add((double) 75);
data.add((double) 77);
data.add((double) 78);
data.add((double) 80);
data.add((double) 81);
data.add((double) 82);
data.add((double) 83);
Collections.sort(data);
System.out.println(getOutliers(data));
}
public static List<Double> getOutliers(List<Double> input) {
List<Double> output = new ArrayList<Double>();
List<Double> data1 = new ArrayList<Double>();
List<Double> data2 = new ArrayList<Double>();
if (input.size() % 2 == 0) {
data1 = input.subList(0, input.size() / 2);
data2 = input.subList(input.size() / 2, input.size());
} else {
data1 = input.subList(0, input.size() / 2);
data2 = input.subList(input.size() / 2 + 1, input.size());
}
double q1 = getMedian(data1);
double q3 = getMedian(data2);
double iqr = q3 - q1;
double lowerFence = q1 - 1.5 * iqr;
double upperFence = q3 + 1.5 * iqr;
for (int i = 0; i < input.size(); i++) {
if (input.get(i) < lowerFence || input.get(i) > upperFence)
output.add(input.get(i));
}
return output;
}
private static double getMedian(List<Double> data) {
if (data.size() % 2 == 0)
return (data.get(data.size() / 2) + data.get(data.size() / 2 - 1)) / 2;
else
return data.get(data.size() / 2);
}
}
输出: [20.0]
解释:
检测异常值有几个标准。最简单的标准,如Chauvenet's criterion,使用从样本计算出的平均值和标准差确定值的“正常”范围。任何超出此范围的值都被视为异常值。
其他标准包括Grubb's test和Dixon's Q test,例如如果样本来自偏斜分布,则可能比Chauvenet的标准产生更好的结果。
可以在MathUtil.java找到Grubb's test的实现。它将找到一个异常值,你可以从列表中删除并重复此过程,直到删除所有异常值。
依赖于commons-math
,所以如果你使用Gradle:
dependencies {
compile 'org.apache.commons:commons-math:2.2'
}
使用此算法。该算法使用平均值和标准差。这两个数字是可选值(2 * standardDeviation)。
public static List<int> StatisticalOutLierAnalysis(List<int> allNumbers)
{
if (allNumbers.Count == 0)
return null;
List<int> normalNumbers = new List<int>();
List<int> outLierNumbers = new List<int>();
double avg = allNumbers.Average();
double standardDeviation = Math.Sqrt(allNumbers.Average(v => Math.Pow(v - avg, 2)));
foreach (int number in allNumbers)
{
if ((Math.Abs(number - avg)) > (2 * standardDeviation))
outLierNumbers.Add(number);
else
normalNumbers.Add(number);
}
return normalNumbers;
}
正如 Joni所指出的那样,您可以通过标准偏差和平均值的帮助消除异常值。以下是我的代码,您可以用于您的目的。
public static void main(String[] args) {
List<Integer> values = new ArrayList<>();
values.add(100);
values.add(105);
values.add(102);
values.add(13);
values.add(104);
values.add(22);
values.add(101);
System.out.println("Before: " + values);
System.out.println("After: " + eliminateOutliers(values,1.5f));
}
protected static double getMean(List<Integer> values) {
int sum = 0;
for (int value : values) {
sum += value;
}
return (sum / values.size());
}
public static double getVariance(List<Integer> values) {
double mean = getMean(values);
int temp = 0;
for (int a : values) {
temp += (a - mean) * (a - mean);
}
return temp / (values.size() - 1);
}
public static double getStdDev(List<Integer> values) {
return Math.sqrt(getVariance(values));
}
public static List<Integer> eliminateOutliers(List<Integer> values, float scaleOfElimination) {
double mean = getMean(values);
double stdDev = getStdDev(values);
final List<Integer> newList = new ArrayList<>();
for (int value : values) {
boolean isLessThanLowerBound = value < mean - stdDev * scaleOfElimination;
boolean isGreaterThanUpperBound = value > mean + stdDev * scaleOfElimination;
boolean isOutOfBounds = isLessThanLowerBound || isGreaterThanUpperBound;
if (!isOutOfBounds) {
newList.add(value);
}
}
int countOfOutliers = values.size() - newList.size();
if (countOfOutliers == 0) {
return values;
}
return eliminateOutliers(newList,scaleOfElimination);
}
代码输出结果:
之前:[100, 105, 102, 13, 104, 22, 101]
之后:[100, 105, 102, 104, 101]
Map
n
个数字,确保距离没有不公正的情况eliminateOutliers()
实际上返回的是离群值,而不是已经删除它们的列表。 isOutOfBounds()
方法也很令人困惑,因为它实际上在值在范围内时返回TRUE。以下是我的更新,其中包含一些(在我看来)改进:
代码:
/**
* Implements an outlier removal algorithm based on https://www.itl.nist.gov/div898/software/dataplot/refman1/auxillar/dixon.htm#:~:text=It%20can%20be%20used%20to,but%20one%20or%20two%20observations).
* Original Java code by Emil Wozniak at https://dev59.com/dXbZa4cB1Zd3GeqPKtgf
*
* Reorganized, made more robust, and clarified many of the methods.
*/
import java.util.List;
import java.util.stream.Collectors;
public class DixonTest {
protected List<Double> criticalValues =
List.of( // Taken from https://sebastianraschka.com/Articles/2014_dixon_test.html#2-calculate-q
// Alfa level of 0.1 (90% confidence)
0.941, // N=3
0.765, // N=4
0.642, // ...
0.56,
0.507,
0.468,
0.437,
0.412,
0.392,
0.376,
0.361,
0.349,
0.338,
0.329,
0.32,
0.313,
0.306,
0.3,
0.295,
0.29,
0.285,
0.281,
0.277,
0.273,
0.269,
0.266,
0.263,
0.26 // N=30
);
// Stats calculated on original input data (including outliers)
private double scaleOfElimination;
private double mean;
private double stdDev;
private double UB;
private double LB;
private List<Double> input;
/**
* Ctor taking a list of values to be analyzed.
* @param input
*/
public DixonTest(List<Double> input) {
this.input = input;
// Create statistics on the original input data
calcStats();
}
/**
* Utility method returns the mean of a list of values.
* @param valueList
* @return
*/
public static double getMean(final List<Double> valueList) {
double sum = valueList.stream()
.mapToDouble(value -> value)
.sum();
return (sum / valueList.size());
}
/**
* Utility method returns the variance of a list of values.
* @param valueList
* @return
*/
public static double getVariance(List<Double> valueList) {
double listMean = getMean(valueList);
double temp = valueList.stream()
.mapToDouble(a -> a)
.map(a -> (a - listMean) * (a - listMean))
.sum();
return temp / (valueList.size() - 1);
}
/**
* Utility method returns the std deviation of a list of values.
* @param input
* @return
*/
public static double getStdDev(List<Double> valueList) {
return Math.sqrt(getVariance(valueList));
}
/**
* Calculate statistics and bounds from the input values and store
* them in class variables.
* @param input
*/
private void calcStats() {
int N = Math.min(Math.max(0, input.size() - 3), criticalValues.size()-1); // Changed to protect against too-small or too-large lists
scaleOfElimination = criticalValues.get(N).floatValue();
mean = getMean(input);
stdDev = getStdDev(input);
UB = mean + stdDev * scaleOfElimination;
LB = mean - stdDev * scaleOfElimination;
}
/**
* Returns the input values with outliers removed.
* @param input
* @return
*/
public List<Double> eliminateOutliers() {
return input.stream()
.filter(value -> value>=LB && value <=UB)
.collect(Collectors.toList());
}
/**
* Returns the outliers found in the input list.
* @param input
* @return
*/
public List<Double> getOutliers() {
return input.stream()
.filter(value -> value<LB || value>UB)
.collect(Collectors.toList());
}
/**
* Test and sample usage
* @param args
*/
public static void main(String[] args) {
List<Double> testValues = List.of(1200.0,1205.0,1220.0,1194.0,1212.0);
DixonTest outlierDetector = new DixonTest(testValues);
List<Double> goodValues = outlierDetector.eliminateOutliers();
List<Double> badValues = outlierDetector.getOutliers();
System.out.println(goodValues.size()+ " good values:");
for (double v: goodValues) {
System.out.println(v);
}
System.out.println(badValues.size()+" outliers detected:");
for (double v: badValues) {
System.out.println(v);
}
// Get stats on remaining (good) values
System.out.println("\nMean of good values is "+DixonTest.getMean(goodValues));
}
}
非常感谢Valiyev,他的解决方案帮了我很多忙。我想在他的作品上分享我的小SRP。
请注意,我使用List.of()
来存储Dixon的临界值,因此需要使用高于Java 8的版本。
public class DixonTest {
protected List<Double> criticalValues =
List.of(0.941, 0.765, 0.642, 0.56, 0.507, 0.468, 0.437);
private double scaleOfElimination;
private double mean;
private double stdDev;
private double getMean(final List<Double> input) {
double sum = input.stream()
.mapToDouble(value -> value)
.sum();
return (sum / input.size());
}
private double getVariance(List<Double> input) {
double mean = getMean(input);
double temp = input.stream()
.mapToDouble(a -> a)
.map(a -> (a - mean) * (a - mean))
.sum();
return temp / (input.size() - 1);
}
private double getStdDev(List<Double> input) {
return Math.sqrt(getVariance(input));
}
protected List<Double> eliminateOutliers(List<Double> input) {
int N = input.size() - 3;
scaleOfElimination = criticalValues.get(N).floatValue();
mean = getMean(input);
stdDev = getStdDev(input);
return input.stream()
.filter(this::isOutOfBounds)
.collect(Collectors.toList());
}
private boolean isOutOfBounds(Double value) {
return !(isLessThanLowerBound(value)
|| isGreaterThanUpperBound(value));
}
private boolean isGreaterThanUpperBound(Double value) {
return value > mean + stdDev * scaleOfElimination;
}
private boolean isLessThanLowerBound(Double value) {
return value < mean - stdDev * scaleOfElimination;
}
}
我希望它能帮助到其他人。
最好的祝福
这只是一个非常简单的实现,它获取不在范围内的数字信息:
List<Integer> notInRangeNumbers = new ArrayList<Integer>();
for (Integer number : numbers) {
if (!isInRange(number)) {
// call with a predefined factor value, here example value = 5
notInRangeNumbers.add(number, 5);
}
}
isInRange
方法内部,您必须定义什么是“好值”。下面是一个示例实现。private boolean isInRange(Integer number, int aroundFactor) {
//TODO the implementation of the 'in range condition'
// here the example implementation
return number <= 100 + aroundFactor && number >= 100 - aroundFactor;
}
if
语句很容易实现。 - user1231232141214124