ScriptPython_1_PySpark_SergioMaestu

Redes de Computadores

•

Engenharias

0

sergio maestu aragon

20/08/2021

Esta é uma pré-visualização de arquivo. Entre para ver o arquivo original

### PySpark Documentation
https://spark.apache.org/docs/latest/api/python/index.html
https://docs.databricks.com/applications/mlflow/quick-start-python.html
https://spark.apache.org/docs/latest/sql-programming-guide.html
https://towardsdatascience.com/machine-learning-with-pyspark-and-mllib-solving-a-binary-classification-problem-96396065d2aa
# PySpark is an interface for Apache Spark in Python. It not only allows you to write Spark applications
# using Python APIs, but also provides the PySpark shell for interactively analyzing your data in a
# distributed environment. PySpark supports most of Spark’s features such as Spark SQL, DataFrame, Streaming,
# MLlib (Machine Learning) and Spark Core.
# PySpark Components
# Spark SQL and DataFrame
# Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called
# DataFrame and can also act as distributed SQL query engine.
# Streaming
# Running on top of Spark, the streaming feature in Apache Spark enables powerful interactive and analytical
# applications across both streaming and historical data, while inheriting Spark’s ease of use and fault
# tolerance characteristics.
# MLlib
# Built on top of Spark, MLlib is a scalable machine learning library that provides a uniform set of
# high-level APIs that help users create and tune practical machine learning pipelines.
# Spark Core
# Spark Core is the underlying general execution engine for the Spark platform that all other functionality
# is built on top of. It provides an RDD (Resilient Distributed Dataset) and in-memory computing capabilities.
## Using Conda Installation
# Conda is an open-source package management and environment management system which is a part of the
# Anaconda distribution. It is both cross-platform and language agnostic. In practice, Conda can replace
# both pip and virtualenv.
# Create new virtual environment from your terminal as shown below:
## conda create -n pyspark_env
# After the virtual environment is created, it should be visible under the list of Conda environments
# which can be seen using the following command:
## conda env list
# Now activate the newly created environment with the following command:
## conda activate pyspark_env
# You can install pyspark by Using PyPI to install PySpark in the newly created environment, for
# example as below. It will install PySpark under the new virtual environment pyspark_env created above.
## pip install pyspark
# Alternatively, you can install PySpark from Conda itself as below:
## conda install pyspark
# However, note that PySpark at Conda is not necessarily synced with PySpark release cycle because it
# is maintained by the community separately.