Advanced Analytics with Spark: Patterns for Learning from Data at Scale

Uri Laserson, Josh Wills

Language: English

Pages: 276

ISBN: 1491912766

Format: PDF / Kindle (mobi) / ePub

In this practical book, four Cloudera data scientists present a set of self-contained patterns for performing large-scale data analysis with Spark. The authors bring Spark, statistical methods, and real-world data sets together to teach you how to approach analytics problems by example.

You’ll start with an introduction to Spark and its ecosystem, and then dive into patterns that apply common techniques—classification, collaborative filtering, and anomaly detection among others—to fields such as genomics, security, and finance. If you have an entry-level understanding of machine learning and statistics, and you program in Java, Python, or Scala, you’ll find these patterns useful for working on your own data applications.

Patterns include:

Recommending music and the Audioscrobbler data set
Predicting forest cover with decision trees
Anomaly detection in network traffic with K-means clustering
Understanding Wikipedia with Latent Semantic Analysis
Analyzing co-occurrence networks with GraphX
Geospatial and temporal data analysis on the New York City Taxi Trips data
Estimating financial risk through Monte Carlo simulation
Analyzing genomics data and the BDG project
Analyzing neuroimaging data with PySpark and Thunder

Forestsrandom number generation, Running the TrialsRange construct, Summary Statistics for Continuous Variablesrank, The Alternating Least Squares Recommender Algorithm, Hyperparameter SelectionRating objects, Building a First Modelrecall, metric, A First Decision TreeReceiver Operating Characteristic (ROC) curve, Evaluating Recommendation Qualityrecommender enginesALS recommender algorithm, The Alternating Least Squares Recommender AlgorithmAUC computation, Computing AUCcommon deployments for,

learning algorithms. A system needs to support more flexible transformations than turning a 2D array of doubles into a mathematical model. Second, iteration is a fundamental part of the data science. Modeling and analysis typically require multiple passes over the same data. One aspect of this lies within machine learning algorithms and statistical procedures. Popular optimization procedures like stochastic gradient descent and expectation maximization involve repeated scans over their inputs to

represents a document. Loosely, the value at each position should correspond to the importance of the row’s term to the column’s document. A few weighting schemes have been proposed, but by far the most common is term frequency times inverse document frequency, commonly abbreviated as TF-IDF: def termDocWeight(termFrequencyInDoc: Int, totalTermsInDoc: Int, termFreqInCorpus: Int, totalDocs: Int): Double = { val tf = termFrequencyInDoc.toDouble / totalTermsInDoc val docFreq = totalDocs.toDouble /

print seriesRDD.rdd.takeSample(False, 1, 0)[0] ... ((30, 84, 1), array([35, 35, 35, 35, 35, 35, 35, 35, 34, 34, 34, 35, 35, 35, 35, 35, 35, 35, 35, 35], dtype=uint8)) The Series API offers many methods for performing computations across the time series, either at the per-series level or across all series. For example: print seriesRDD.max() ... array([158, 152, 145, 143, 142, 141, 140, 140, 139, 139, 140, 140, 142, 144, 153, 168, 179, 185, 185, 182], dtype=uint8) computes the maximum value across

RDD and filter only collection elements that meet a certain criterion, like minimum standard deviation by default. To choose a good value for the threshold, let’s first compute the stddev of each series and plot a histogram of a 10% sample of the values (see Figure 11-6): stddevs = (normalizedRDD .seriesStdev() .values() .sample(False, 0.1, 0) .collect()) plt.hist(stddevs, bins=20) Figure 11-6. Distribution of the standard deviations of the voxels With this in mind, we’ll choose a threshold of

Download sample

Download

February 2017
M	T	W	T	F	S	S
« Jan
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28