A Top Down Approach to Machine Learning_ + Bonus Hands-on Tutorial

•
Escola Colegio Estadual Barao Do Rio Branco

Jurandy
17/07/2021
E aí, curtiu este material?
Ajude a incentivar outros estudantes a melhorar o conteúdo
Gostou desse material? Compartilhe! 🧡
Ciências

82.362 Materiais compartilhados
Baixe o app para aproveitar ainda mais
Leia os materiais offline, sem usar a internet. Além de vários outros recursos!
Prévia do material em texto
A	Top	Down	Approach	to
Machine	Learning
	
by
	
Marty	Jacobs
Contents
Introduction
Chapter	1:	Foundational	Learning
Chapter	2:	Supervised	Learning
Chapter	3:	Unsupervised	Learning
Chapter	4:	Reinforcement	Learning
Chapter	5:	Intermission
Chapter	6:	Machine	Learning	with	Tensorflow
Introduction
	
Machine	 Learning	 (ML)	 has	 some	 hefty	 gravitational	 force	 in	 the	 Software
development	world	at	the	moment.	But	what	exactly	is	it?	In	this	post	I’ll	take	a
top-down	approach	attempting	to	make	it	crystal	clear,	what	it	is,	and	what	it	can
be	 used	 for	 in	 the	 real	 world.	 Machine	 Learning	 is	 a	 branch	 of	 Artificial
Intelligence.	 Fundamentally	 it	 is	 Software	 that	 works	 like	 our	 brain,	 learning
from	 information	 (data),	 then	 applying	 it	 to	 make	 smart	 decisions.	 Machine
Learning	 algorithms	 can	 improve	 software	 (a	 robot)	 and	 it’s	 ability	 to	 solve
problems	 through	 gaining	 experience.	 Somewhat	 like	 the	 human	 memory.
Whether	 you	 know	 it,	 or	 not,	 you’re	 probably	 already	 using	 applications	 that
leverage	Machine	Learning	algorithms.
Applications	might	be	monitoring	your	behaviour	to	give	you	more	personalised
content.	A	simple	example…	Google	uses	Machine	Learning	 in	 their	“Search”
product	to	predict	what	you	might	want	to	search	for	next.	Remember	too,	it	is
likely	 when	 using	 the	 product	 that	 the	 responded	 suggestions	 are	 sometimes
entirely	 inaccurate	 or	 not	 helpful.	 This	 is	 the	 nature	 of	 using	 a	 probabilistic
method	of	approach.	Some	times	you	hit,	and	some	times	you	miss.
More,	 and	 more	 people	 are	 becoming	 interested	 in	 Machine	 Learning.
Companies	are	adopting	it	to	gauge	a	better	understanding	of	their	clients,	which
results	 in	 providing	 better	 customer	 service.	 It	 is	 being	 used	 in	 streams	 of
gambling,	and	stock	market	applications	to	predict	rises	and	falls	of	stock	prices.
Particularly	 as	 a	Software	 developer,	 the	 demand	has	 become	more	 prominent
for	skills	in	the	AI	and	Machine	Learning	realm.	It	doesn’t	look	like	this	trend	is
slowing	down	anytime	soon.	Here	is	a	snapshot	of	the	World’s	growing	interest
in	ML	over	the	last	5	years…
	
Foundational	Learning
Chapter	1:
	
	
The	term	Agent	is	commonly	referred	to	in	AI,	representing	a	type	of	computer
program.	What	makes	it	different	from	other	computer	programs?	It	is	a	program
that	gathers	information	on	a	particular	environment,	then	takes	action(s)
autonomously	using	gathered	information.	This	could	be	a	web	crawler,	a	stock-
trading	platform,	or	any	other	program	that	can	make	informed	decisions.
How	do	we	define	an	Agent?
State	space
The	set	of	all	possible	states	that	the	agent	can	be	in.	Example,	the	light	switch
can	only	ever	be	“on”	or	“off”.
Action	space
The	set	of	all	possible	actions	that	the	agent	can	perform.	Example,	the	light
switch	can	only	ever	be	“flicked	up”	or	“flicked	down”.
Percept	space
The	set	of	all	possible	things	the	agent	can	perceive	in	the	world.	Example,
“Fog-of-War	in	a	gaming	context,	where	you	can	only	see	what	is	visible	on	the
map.”.
World	dynamics
The	change	of	one	state	to	another,	given	a	particular	action.	Example,	perform
the	light	switch	action	“flicked	up”	with	the	state	“off”	will	result	in	a	change	of
state	to	“on”.
Percept	function
A	change	in	state	results	in	a	new	perception	of	the	world.	Example,	in	a	gaming
context,	moving	into	the	enemy	base	will	show	you	enemy	resources.”.
Utility	function
The	Utility	function	is	used	to	assign	a	value	to	a	state.	This	is	can	be	used	to
ensure	your	agent	performs	an	action	to	land	in	the	best	possible	state.
This	design	can	be	used	as	a	‘structure’	to	work	from,	and	build	an	AI	agent.	But
you	might	ask,	how	does	it	even	relate	to	Machine	Learning?	Machine	Learning
algorithms	can	enhance	the	agent	to	learn	better,	and	perform	smarter	actions.
This	is	achieved	by	providing	the	algorithm	with	data	to	learn	from	and	make
some	smart	estimates/predictions.	You	might	be	thinking…	but	wait,	can’t	we
just	feed	it	all	of	the	data	on	the	internet	to	teach	it	everything?	It	doesn’t
exactly	work	like	that.	In	order	for	an	ML	algorithm	to	learn	properly,	we	need
to	provide	it	with	the	right	combination	of	data,	mixed	with	the	right	amount	of
data.	Too	much,	and	we	might	run	into	an	overfitting	problem.	Too	little,	and	we
might	have	a	shitty	model	that	doesn’t	provide	decent	predictions.
I	also	invite	you	to	take	a	look	into	the	Tree	data	structure	if	you	aren’t	familiar
with	it	already.	These	are	used	throughout	many	modern	day	applications.
Ok	let’s	dive	head	first	into	the	3	major	types	of	algorithms	in	the	field	of
Machine	Learning;
Supervised	learning,	Unsupervised	learning	and	Reinforcement	learning.
http://machinelearningmastery.com/overfitting-and-underfitting-with-machine-learning-algorithms/
https://en.wikipedia.org/wiki/Tree_(data_structure
Supervised	Learning
Chapter	2:
	
	
Supervised	learning	is	the	name	of	ML	algorithms	that	learn	from	examples.
This	means	the	we	must	provide	the	algorithm	with	training	data	prior	to	running
the	algorithm.	An	example	of	this	was	hilariously	shown	in	the	T.V	show
“Silicon	Valley”,	where	a	mobile	app	‘Not	hot	dog’	made	media	headlines.	In	the
television	show,	Jian	Yang	had	to	provide	training	data	for	his	ML	algorithm	to
learn	from	hot	dogs.	This	was	to	classify	if	an	image	of	a	hotdog	was	in	fact	a
hotdog,	or	it	was…	not	a	hotdog.	How	did	he	do	this?	In	the	show,	he	had	to
manually	scrape	the	internet	for	many	images	of	hotdogs	(a.k.a	the	training
data).
	
This	technique	is	called	Boolean	Classification,	as	the	result	itself	is	a	binary
value.	The	algorithm	can	make	a	prediction	that	in	the	picture	there	is	a	hotdog.
This	was	done	by	analysing	a	large	quantity	of	hotdog	pictures	and	the	algorithm
has	learnt	to	identify	what	looks	like	a	hotdog.	Statistical	Classification	is	used
in	Supervised	learning	where	the	training	data	is	a	set	of	correctly	identified
observations.	For	example,	in	the	hotdog	scenario,	there	must	only	be	pictures	of
hotdogs.
One	of	the	simplest	Supervised	learning	algorithms	to	implement	is	the	Decision
Tree.	A	Decision	Tree	is	where	the	leaf	nodes	of	the	tree	are	the	results,	non-leaf
nodes	are	the	attributes,	and	the	edges	of	the	tree	are	the	values.	The	Decision
Tree	analyses	the	attributes	and	returns	a	result	that	has	been	filtered	from	the
tree.
https://en.wikipedia.org/wiki/Statistical_classification
Note:	As	you	can	see	the	result	in	the	Decision	Tree	above	can	be	either	“Yes”
or	“No”.
Another	technique	that	is	classified	as	‘Supervised	learning’	is	regression.	The
simplest	form	of	regression	is	linear	regression,	where	intuitively	this	is	just
drawing	a	straight	line	through	some	data	to	infer	a	trend.	This	could	be	used	in
a	gambling	scenario,	analysing	a	history	of	chosen	numbers.	For	example,
performing	regression	on	this	history	might	show	numbers	‘5’	and	‘7’	are	chosen
more	than	numbers	‘3’	and	‘2’.	As	we	can	see	below,	this	is	an	example	of
performing	regression	analysis	on	some	data	points…
	
http://www.investopedia.com/terms/r/regression.asp
Unsupervised	Learning
Chapter	3:
	
	
Unsupervised	learning	is	where	the	results	of	the	training	data	are	not
known.	Simply	put,	we	can	give	the	ML	algorithm	some	training	data	and	it	can
respond	back	with	what	the	algorithm	has	found.	Sounds	exciting!	We	might
receive	completely	new	insights	into	the	data	that	we	would	never	expect	to
observe.	How	is	it	done?	Unsupervised	learning	commonly	uses	clustering
techniques	that	aim	to	find	patterns	in	the	data.
One	common	clustering	technique	is	called	“k-means	clustering”,	which	aims	to
solve	clustering	problems.	One	common	clustering	problem	is	Spam	filtering.
Spam	emails	can	be	sometimes	be	tricky	to	identify,	and	might	get	through	to
your	email	inbox	(instead	of	the	junk	folder).	The	K-means	clustering	approach
aims	to	partition	N	observations	into	K	clusters.	Essentially,	it	just	moves	the
spam	in	the	spam	cluster,	and	the	real	emails	intothe	inbox	cluster.	N	is
dependent	upon	the	number	of	emails,	and	K	is	dependent	upon	the	number	of
clusters.	Still	interested?	Here	is	a	study	that	has	shown	K-means	is	a	better
approach	to	take	than	using	Support	Vector	Machines	(SVM)	in	a	Spam	filtering
context.
	
Hybrid	Supervised/Unsupervised	learning
Some	ML	algorithms	can	be	used	for	both	Supervised	and	Unsupervised
https://en.wikipedia.org/wiki/Cluster_analysis
https://pdfs.semanticscholar.org/0239/e41c47a55fa4f4b1092e90af585bf6d4e134.pdf
learning.	After	all,	the	only	dramatic	difference	between	the	two	is	just	knowing
the	end	result.	The	commonly	used	methods	that	can	be	relevant	for	both	cases
are	Bayesian	Networks,	Neural	Networks	and	…Decision	Trees!	Yes	that’s	right,
a	Supervised	learning	algorithm	can	also	be	used	for	Unsupervised	learning.
This	is	straight	up	magic.	What	we’re	actually	trying	to	do	is	run	a	Supervised
algorithm	and	find	an	Unsupervised	result	(a	completely	new	/	unexpected
result).	To	do	this,	we	must	provide	the	algorithm	with	a	second	group	of
observations,	this	way	it	can	recognise	the	difference	between	the	two
observation	groups.	As	a	result,	the	Decision	tree	can	find	new	clusters	by
having	additional	observation	groups.
Bayesian	Networks	utilise	graphs,	probability	theory	and	statistics	to	model	real-
world	situations	and	infer	data	insights.	These	are	very	compact	and	can	be	used
for	modelling	a	solution	quite	quickly.	How	do	we	create	one?	Well	we	need…
1.	An	acyclic	graph;	and	2.	Conditional	Probability	Tables	(CPT)	Shown
below	is	an	example	graph,	and	the	CPTs	given	according	to	the	node
placements	on	the	graph.
Here	we	can	make	some	simple	inferences	from	the	CPTs,	such	as	when	it	is
raining	and	the	sprinkler	is	turned	on,	then	there	is	99%	chance	that	the	grass	is
wet.	Sure,	it’s	a	silly	example	but	the	point	is	that	we	can	apply	this	to	more
valuable	use	cases	that	yield	greater	results.	These	can	get	more	complicated
with	adding	extra	parent	nodes	into	the	equation,	also	having	to	estimate
probability	values	in	some	cases.
A	Neural	Network	is	a	Machine	Learning	algorithm	capable	of	simulating	the
human	brain.	A	Neural	network	is	made	up	of	interconnected	artificial	neurons.
http://www.zeroequalsfalse.press/2017/03/11/graphs/
http://www.zeroequalsfalse.press/2017/03/11/graphs/
A	neuron	is	basically	just	a	function	applied	to	a	linear	combination	of	inputs.
Each	input	is	multiplied	by	a	weight,	which	is	essentially	a	measurement	to	see
how	strong	the	input	is	for	determining	the	output.
Y	is	the	linear	combination	of	inputs,	and	A	is	the	activation	function.	Hmm..
can	we	input	just	any	data	into	the	neuron?	Not	really.	Neural	Networks	only
work	with	numerical	data,	which	means	you	cannot	initialize	a	variable	like	eg.
X_1	=	“Apple”.	However,	it	is	possible	to	get	around	this	if	we	were	trying	to
make	predictions	on	natural	language.	This	gets	complicated	fast,	but	the	idea
being	is	we	encode	the	string	so	it	can	be	fed	into	the	Neural	network.	Here	is	an
example	of	the	“Bag-of-words”	model	used	in	Natural	Language	Processing	to
encode	an	array	of	words	to	numbers.
Above	is	a	Neural	network	representation.	Don’t	freak	out	about	what
everything	means.	Let	us	deduce	it.	The	left	hand	side	is	the	input	range	X_1…
X_n	and	the	“transfer	function”	is	simply	the	function	to	combine	all	the	inputs.
It	hands	the	result,	in	our	case	Y,	to	the	activation	function	which	computes	A.
Another	aspect	to	take	into	account	when	building	a	Neural	network	is	the
technique	used	for	learning	the	weights.	Calibrating	the	weights	of	the	Neural
network	is	the	“training	process”.	This	is	done	by	alternating	techniques,
“Forward	propagation”	and	“Back	propagation”.	Forward	propagation	is	how	we
approached	the	above	equation,	applying	the	weights	to	the	input	data	before
computing	the	activation	function.	We	received	the	output,	and	could	compare	it
to	a	real	value	to	get	a	margin	of	error	(checking	if	it	is	what	we	wanted).
Backpropagation	is	the	process	of	going	backwards	through	the	network	to
reduce	the	margin	of	error.
https://en.wikipedia.org/wiki/Bag-of-words_model
https://en.wikipedia.org/wiki/Natural_language_processing
https://en.wikipedia.org/wiki/Backpropagation
Here	we	can	see	a	more	complicated	Neural	network	with	many	hidden	layers.
What	are	these	extra	layers	even	for?	Well…	our	last	example	only	had	one
layer,	which	was	really	to	compute	a	specific	function	(task	specific).	Adding
extra	layers	to	the	Neural	network	allows	us	to	learn	much	more	than	a	specific
result,	but	allows	us	to	classify	things	from	raw	data.	This	process	is	called
Feature	learning.	This	can	be	used	to	analyse	unstructured	data	such	as	images,
video	and	sensor	data.	You	probably	have	used	applications	that	have
implemented	this	technique	before.	Feature	learning	can	be	used	to	identify
people,	places,	things,	you	name	it.	There	has	been	great	technological
advancements	recently	for	Deep	learning	and	Feature	learning,	especially	after
the	rise	of	Web	2.0.
https://en.wikipedia.org/wiki/Feature_learning
Reinforcement	Learning
Chapter	4:
	
	
Reinforcement	learning	is	the	learning	by	doing	approach.	To	solve	a
Reinforcement	learning	problem,	we	must	have	the	agent	perform	actions	in	any
given	situation	to	maximise	its	reward.	There	are	two	main	strategies	used	in
Reinforcement	learning,	which	are…
1.	Model-based;	and
2.	Model-free
	
Model-based	is	the	strategy	where	the	agent	learns	the	“model”	to	produce	the
best	action	at	any	given	time.	This	is	done	by	finding	the	probability	of	landing
in	the	desired	states,	and	the	rewards	for	doing	so.	How	is	it	done?	Keep	a	record
of	all	the	states	that	the	agent	has	been	in	when	performing	an	action,	and	update
a	probability	table	of	landing	in	that	desired	state.	Ah	yeah…	also	keep	a	record
of	the	rewards	too.	That’s	how	we	determine	what	the	best	action	to	take	is	(the
one	with	the	highest	reward).
Model-free	is	the	strategy	where	the	agent	learns	how	to	make	great	actions
without	knowing	anything	about	the	probability	of	landing	in	some	state.	How	is
it	done?	Q-Learning	is	one	way.	The	agent	learns	an	action-value	function,	and
uses	it	to	perform	the	best	action	at	every	state.	Shit,	sounds	pretty	good!	The
action-value	function	simply	assigns	every	action	the	agent	can	take	with	a
specific	value,	then	the	agent	chooses	the	action	with	the	highest	value.
https://en.wikipedia.org/wiki/Q-learning
Intermission
Chapter	5:
	
Machine	learning	has	already	shown	some	insanely	good	results	thus	far.	There
is	also	an	increasingly	large	number	of	people	flocking	to	the	field	of	AI,	which
should	birth	better	design	for	agents	and	ML	algorithms.	The	rise	of	Deep
learning	has	brought	applications	that	almost	have	a	mind	of	their	own.	If	you’re
looking	how	to	use	these	algorithms	in	your	application,	but	think	it	will	be	too
complicated…	fear	not.	Large	companies	(Google,	Amazon	etc.)	provide	cloud
services	that	have	already	built	ML	algorithms	which	can	be	used	quite	easily.
There	are	ML	libraries	out	there	to	integrate	ML	algorithms	into	your	existing
application,	Tensorflow	is	just	one.	Machine	Learning	is	incredibly	interesting
and	there	is	still	so	much	more	to	come.	The	best	thing	about	ML	-	there	is
always	new	and	exciting	concepts	to	learn!
You	have	made	it	to	halfway!	Get	yourself	a	drink,	and	be	ready	for	the
demonstration	J
https://www.tensorflow.org/
Machine	Learning	with	Tensorflow
Chapter	6:
	
	
Let’s	take	a	different	approach,	a	more	practical	approach.	This	will	be	for
those	who	are	keen	to	improve	their	Machine	Learning	skills	in	the	real-world.
So	what	will	we	build?	Hmmm..	let’s	build	a	Convolutional	Neural	Network
(CNN).	The	Neural	Network	will	be	multi-layered,	and	we	will	use	Python	and
Google’s	open-source	library,	“Tensorflow”.
We’ll	be	using	the	MNIST	dataset	as	we	can	train	our	model	without	the	need	of
a	GPU.	What	is	MNIST?	It	is	an	image	database	filled	with	hand-written	digits.
Ok…	Let’s	build	a	simple	two	layer	convolutional	neural	network,	with
maxpooling,dropout,	and	a	couple	of	fully	connected	layers.	We	will	also	set	up
a	log	directory	where	we	can	catch	log	data	from	both	the	training	and	validation
sets.	This	will	help	us	monitor	the	performance	graphically	(using	TensorBoard),
rather	than	with	plain	old	print	statements.
Preliminaries
	
Python	version	3.6	-	Python	can	be	found	here
TensorFlow	version	1.1.0	-	you	can	install	Tensorflow	here
Import	the	following	libraries:
import	os	import	numpy	as	np	import
tensorflow	as	tf	import	matplotlib.pyplot	as	plt
https://www.python.org/downloads/release/python-360/
https://www.tensorflow.org/install/
Data	Exploration
	
TensorFlow	makes	it	real	simple	to	obtain	the	MNIST	dataset	-	just	import	the
input_dataand	call	the	method	read_data_sets.
#	import	MNIST	data
from	tensorflow.examples.tutorials.mnist	import	input_data	mnist	=
input_data.read_data_sets("MNIST_data/",	one_hot=True)
Let’s	explore	the	‘mnist’	object	under	the	microscope	and	see	what	is	inside	it…
#	We	see	that	it's	a	Datasets	type,	which	makes	sense
type(mnist)
#	Let's	check	the	last	5	methods	that	can	be	called	on	this	object
dir(mnist)[-5:]
dir(mnist.train)[-5:]
Images	are	typically	stored	as	a	two-dimensional	array	of	pixels	per	channel.
The	MNIST	dataset	has	only	one	channel,	hence	why	there	is	no	colour.	Below
we	see	that	there	are	55,000	images	in	the	training	set,	but	each	image	is
represented	as	a	vector	of	length	784.	This	length	represents	the	flattened	version
of	a	28x28	pixel	image.
mnist.train.images.shape	#	Out:
(55000,	784)
To	view	an	image,	we	must	first	convert	it	back	into	matrix	form.	We	do	this
using	numpy’s	reshape	method.	Reshape	the	image	into	its	original	28x28	form,
then	display	the	image	in	black	and	white	using	the	cmap=’gray’	option.	Notice
below	the	numbers	and	tick	marks	on	the	x	and	y	axes,	showing	our	notion	of
the	28x28	pixel	size	of	each	image.
#	Let's	see	an	example	of	an	image	in	the	training	set
plt.imshow(mnist.train.images[0].reshape((28,	28)),	cmap='gray')
	
	
Ok	still	with	me?	let’s	now	write	a	function	to	make	it	easier	to	sample	a	few
images	at	a	time,	displaying	them	in	a	3x3	grid.	This	makes	sampling	a	faster
process.
def	show_grid_3x3(images):	"""
Display	a	3x3	grid	of	9	randomly	sampled	numpy	array	images.
:	images:	A	batch	of	image	data.	Numpy	array	with	shape	(batch_size,
28,	28,	1)	"""
plt.rcParams['figure.figsize']	=	6,	6
fig,	axes	=	plt.subplots(nrows=3,	ncols=3,	sharex=True,	sharey=True)
rand_idx	=	np.random.choice(images.shape[0],	9,	replace=False)	#	get	5
random	indices
images	=	images[rand_idx]
for	i	in	range(3):	for	j	in	range(3):	axes[i,	j].imshow(images[i	+
3*j].reshape((28,	28)),	cmap='gray')	plt.tight_layout()
	
Cool!	now	let’s	call	the	show_grid_3x3	function	on	the	training	set.
show_grid_3x3(mnist.train.images)
TensorBoard	Setup
	
We’ll	use	TensorBoard	to	visualize	several	aspects	of	our	neural	network,	such
as	the	distribution	of	the	weights	and	biases	over	time,	the	classification
accuracy	of	the	training	and	validation	sets,	and	the	computational	graph.	Also,
we	need	to	create	a	log	file	directory	for	when	the	neural	network	starts	running.
Now	we	are	going	to	write	a	function	to	create	a	directory	path	with	a	time-
stamp.	We	wouldn’t	want	TensorFlow	overwriting	our	previous	logs	every	time
we	run	the	code.
#	For	logging
from	datetime	import	datetime
def	logdir_timestamp(root_logdir="tf_logs"):	"""Return	a	string	with	a
timestamp	to	use	as	the	log	directory	for	TensorBoard."""
now	=	datetime.utcnow().strftime("%Y%m%d_%H%M%S")	return
os.path.join(root_logdir,	"run-{}/".format(now))	logdir	=
logdir_timestamp()
We	may	now	run	TensorBoard	and	instruct	it	to	monitor	the	directory	named
tf_logs:
mkdir	tf_logs
tensorboard	--logdir=tf_logs
Navigate	to	localhost:6006	in	your	web	browser	to	view	the	TensorBoard
console.
Feel	free	to	have	a	look	around,	but	there	won’t	be	anything	there	until	we	use	a
FileWriter	to	write	some	data	to	disk	while	the	neural	network	is	running.
Graph	Construction
	
In	Tensorflow,	we	must	first	construct	a	graph.	At	this	stage,	we	lay	down	the
blueprint	for	our	neural	network,	but	no	actual	operations	are	being	executed.
Once	the	graph	is	complete,	we	will	create	a	TensorFlow	session	where	we	can
execute	the	operations	defined	in	the	graph.
Let’s	have	a	look	at	what	the	graph	should	look	like	when	we	are	done.	We’ll
step	through	one	layer	at	a	time,	starting	from	the	bottom,	where	X	is	reshaped
and	fed	into	the	convolutiona1	layer.
Create	Data	input	tensors
The	first	step	is	to	create	placeholders	for	the	data	to	feed	into	the	graph.	We’ll
create	a	variable	X	to	represent	a	batch	of	images,	and	the	variable	y_	to
represent	the	corresponding	labels	for	each	image.	Notice	that	we	expect	the
input	as	a	flattened	vector,	because	that	is	the	form	in	which	we	obtained	the
MNIST	data.	But	since	we	are	performing	convolutions	in	this	neural	network,
we	would	like	to	retain	the	two-dimensional	spatial	structure	in	the	image	data,
so	we	reshape	X	and	assigned	it	to	the	variable	X_image.
Shown	below	are	the	two	methods	returning	placeholders	for	the	graph:
def	neural_net_image_input(image_shape):	"""Constructs	a	tensor	for	a
batch	of	image	input
image_shape:	Shape	of	the	images	as	a	list-like	object
return:	Tensor	for	image	input
"""
shape	=	None,	*image_shape	return	tf.placeholder(tf.float32,
shape=shape,
name="X")	def	neural_net_label_input(n_classes):	"""Constructs	a
tensor	for	a	batch	of	label	input
n_classes:	Number	of	classes
return:	Tensor	for	label	input
"""
shape	=	None,	n_classes	return	tf.placeholder(tf.float32,
shape=shape,
name="y")
Below	we	input	the	length	784	into	the	Neural	Network	(NN),	remember	this	is
the	length	of	the	flattened	image	vector.	The	labels,	denoted	by	the	placeholder
y_,	has	a	shape	of	10	as	there	are	ten	different	digits	to	be	classified	in	the
dataset.	When	creating	a	placeholder,	we	use	the	value	None	to	indicate	an
arbitrarily	sized	batch	of	images	or	labels.
X	=	neural_net_image_input([784])	y_	=	neural_net_label_input(10)
X_image	=	tf.reshape(X,	[-1,	28,	28,	1])	#	rehaped	to	[batch_size,	rows,
cols,	channels]
	
Create	the	first	convolutional	layer
We	can	now	write	a	function	to	create	a	convolutional	layer	since	we’ll	be
repeating	this	step	to	create	another	layer.
We	initialize	the	weights	by	sampling	from	a	truncated	normal	distribution	with
a	standard	deviation	of	0.1.	A	truncated	normal	distribution	is	similar	to	a	normal
distribution,	but	if	a	weight	is	more	than	two	standard	deviations	away	from	the
mean,	it	is	dropped	and	repicked.	We	hard-code	the	filter	(also	called	a	kernel)	to
have	a	size	of	5x5.	See	this	for	a	visualization	of	how	convolutional	filters	work.
In	the	first	layer,	we	input	a	single	image,	so	the	size_invariable	is	set	to	1.
size_out	is	the	number	of	convolutional	filters	we	want	to	create;	in	this	case	32.
The	size	of	the	filter	and	the	number	of	filters	are	hyper-parameters	we	can
experiment	with,	in	an	effort	to	improve	performance	-	the	current	values	are	by
no	means	optimal!
The	image	placeholder	and	the	newly	initialized	weights	are	passed	into	the
tf.nn.conv2dTensorFlow	library	function.	To	learn	more	about	strides	and
padding,	please	refer	to	the	TensorFlow	documentation.
tf.nn.relu	is	another	TensorFlow	library	function	which	is	applied	to	the	result	of
the	conv2d	operation.	ReLU	is	an	abbreviation	for	rectified	linear	unit,	which
returns	the	value	of	its	argument	or	0,	whichever	is	greater.
def	convolution_layer(inp,	size_in,	size_out,	name="convolution"):
"""Creates	a	convolutional	layer	with	filter	of	size	5x5,	and	size_out
number	of	filters.
Applies	stride	of	[1,	1,	1,	1]	with	SAME	padding,	and	appies	ReLU
activation.
No	downsampling	within	this	layer	-	returns	tensor	with	activation
function	applied	only	"""
with	tf.name_scope(name):	#	Hard	code	convolutional	filter	of	size	5x5
W	=	tf.Variable(tf.truncated_normal([5,	5,	size_in,	size_out],
stddev=0.1),	name="W")	b	=	tf.Variable(tf.constant(0.1,	shape=
[size_out]),name="b")	conv	=	tf.nn.conv2d(inp,	W,	strides=[1,	1,	1,	1],
padding='SAME')	act	=	tf.nn.relu(conv	+	b)
tf.summary.histogram("weights",	W)	tf.summary.histogram("biases",
b)	tf.summary.histogram("activations",	act)	return	act
	
	
Turning	to	the	TensorFlow	graph,	let’s	look	at	what	is	actually	happening	inside
the	first	convolutional	layer.	The	graph	appears	to	show	a	fairly	straightforward
http://setosa.io/ev/image-kernels/
https://www.tensorflow.org/versions/r1.0/api_docs/python/tf/nn/conv2d
representation	of	the	code…
	
	
Assign	the	output	of	the	convolution_layer	function	to	a	variable	named	act1.
This	will	be	used	as	the	input	for	the	next	layer.
act1	=	convolution_layer(X_image,	1,	32,	"convolution1")
Create	the	first	downsampling	layer
The	output	of	the	convolution	layer	is	downsampled	using	maxpooling	with	a
kernel	of	size	2x2.	This	means	that	the	maximum	value	is	taken	for	every	2x2
region	of	the	input.	This	reduces	the	spatial	size	of	the	input,	effectively	reducing
the	number	of	parameters	in	the	network	and	thereby	reducing	computational
complexity	and	the	propensity	to	overfit.	We’ll	return	to	the	topic	of	overfitting
when	we	discuss	the	TensorBoard	graphs	showing	the	training	and	validation	set
accuracies.
def	downsample_layer(act,	name="maxpooling"):	"""Creates
downsampling	layer	by	applying	maxpooling	with	hardcode	kernel	size
[1,	2,	2,	1]	and	strides	[1,	2,	2,	1]	with	SAME	padding.
"""
https://en.wikipedia.org/wiki/Overfitting
with	tf.name_scope(name):	return	tf.nn.max_pool(act,	ksize=[1,	2,	2,
1],	strides=[1,	2,	2,	1],	padding='SAME')
Notice	below	how	the	number	of	parameters	are	reduced	after	the	maxpool
operation	-	from	28x28	to	14x14.
	
Store	the	output	of	the	downsampling	layer	in	the	variable	h_pool1.
h_pool1	=	downsample_layer(act1,	"downsample1")
Create	the	second	convolutional	layer
The	structure	of	the	second	convolutional	layer	is	identical	to	the	first	one.	It
might	be	hard	to	see	below,	but	notice	the	size	of	the	tensors	coming	in,	and	the
tensors	going	out	-	14x14x32	to	14x14x64.
	
This	time,	set	the	input	size	to	32,	and	create	64	convolutional	filters.
act2	=	convolution_layer(h_pool1,	32,	64,	"convolution2")
Create	the	second	downsampling	layer
Once	again,	notice	the	shape	of	the	outgoing	tensor.	We	would	like	to	flatten	this
tensor	into	a	vector,	so	that	we	can	connect	every	single	neuron	together	in	the
dense	layer,	a.k.a	a	fully	connected	layer.	This	is	the	reason	for	the	7*7*64	value
for	the	reshape	operation	-	the	input	is	a	7x7x64	tensor	which	will	then	be
converted	into	a	vector	of	length	7*7*64=3136.	The	same	value	is	then	passed
into	the	dense_layer	method	to	create	tensors	of	weights	and	biases	sized
appropriately.
h_pool2	=	downsample_layer(act2,	"downsample2")
Create	the	first	dense	layer
The	dense	layer	performs	a	simple	matrix	multiplication	followed	by	adding	the
biases.	This	time,	we	do	not	apply	an	activation	function	within	the	layer.	Why?
So	we	can	apply	a	different	activation	function	(softmax)	to	the	output	of	the
final	layer.	After	the	first	dense	layer,	the	ReLU	activation	function	is	applied
separately	outside	the	dense_layer	function.
def	dense_layer(inp,	size_in,	size_out,	name="dense"):	"""Creates	fully
connected	layer	with	size	[size_in,	size_out].	Initialize	weights	with
standard	deviation	0.1.	Returns	tensor	without	applying	any	activation
function.
"""
with	tf.name_scope(name):	W	=
tf.Variable(tf.truncated_normal([size_in,	size_out],	stddev=0.1),
name="W")	b	=	tf.Variable(tf.constant(0.1,	shape=[size_out]),
name="b")	act	=	tf.matmul(inp,	W)	+	b
tf.summary.histogram("weights",	W)	tf.summary.histogram("biases",
b)	tf.summary.histogram("activations",	act)	return	act
	
	
Notice	the	size	of	the	output	-	1024.	This	will	be	the	number	of	neurons	in	the
second	fully	connected	layer.	Before	we	get	to	the	next	layer,	however,	we	apply
the	dropout	technique.
h_pool2_flat	=	tf.reshape(h_pool2,	[-1,	7*7*64])	h_fc1	=
tf.nn.relu(dense_layer(h_pool2_flat,	7*7*64,	1024,	"dense1"))
Dropout
Dropout	is	a	regularization	technique	which	controls	overfitting.	During	the
training	phase,	a	fixed	proportion	of	randomly	selected	neurons	are	disabled.	In
this	example,	we	use	a	value	of	0.5	to	be	injected	into	a	placeholder	when	the
network	is	running.	So,	in	every	iteration	during	training,	half	the	neurons	per
layer	are	disabled.	Note	that	this	is	only	done	during	training	and	not	when
generating	predictions	on	a	test	set.
def	dropout(inp,	keep_prob,	name="dropout"):	"""Apply	dropout	with
probability	defined	by	placeholder	tensor	keep_prob."""
with	tf.name_scope(name):	return	tf.nn.dropout(inp,	keep_prob)
	
	
keep_prob	=	tf.placeholder(tf.float32)
h_fc1_drop	=	dropout(h_fc1,	keep_prob)
Create	the	second	dense	layer
	
Set	the	output	size	for	the	final	fully	connected	layer	to	equal	the	number	of
classes,	which	is	10	for	the	MNIST	dataset.
y_conv	=	dense_layer(h_fc1_drop,	1024,	10,	"dense2")
We	want	each	of	the	10	neurons	to	output	a	probability.	We	can	apply	the
softmax	activation	function	to	do	this.	In	order	to	evaluate	the	model,	we	will
also	need	a	cost	function.	For	classification	problems,	a	frequent	choice	is	cross-
entropy.	TensorFlow	has	a	function	that	will	perform	both	these	operations	in	a
way	that	is	numerically	stable.
As	in,	the	functions	we	created	for	each	of	the	layers,	we	use	name	scopes	so
that	TensorFlow	groups	all	the	ops	in	the	with	block	inside	the	computational
graph.	This	helps	keep	the	graph	looking	nice	and	clean.	You	can	try	creating	a
graph	without	the	name	scopes,	just	to	get	a	visual	on	how	it	looks.
with	tf.name_scope("xentropy"):	cross_entropy	=
tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_,
logits=y_conv))
Let’s	use	the	Adam	optimizer	to	minimize	the	loss	function.	You	might	want	to
consider	picking	a	learning	rate	with	a	smaller	value,	such	as	1e-4.	This	is
another	important	hyperparameter	to	tune	-	a	value	that	is	too	small	willrequire
unnecessarily	long	training	times,	but	a	value	that	is	too	large	may	not	achieve
https://en.wikipedia.org/wiki/Softmax_function
https://rdipietro.github.io/friendly-intro-to-cross-entropy-loss/
an	optimal	local	minimum	for	the	cross-entropy	loss	function.
lr=1e-2	#	Learning	rate
with	tf.name_scope("train"):	optimizer	=
tf.train.AdamOptimizer(learning_rate=lr)
training_op	=	optimizer.minimize(cross_entropy)
We’ll	execute	the	training_op	variable	in	the	TensorFlow	session.	We’ll	also
create	an	operation	to	compute	the	accuracy	of	our	model.
with	tf.name_scope("accuracy"):	correct_prediction	=
tf.equal(tf.argmax(y_conv,	1),	tf.argmax(y_,	1))	accuracy	=
tf.reduce_mean(tf.cast(correct_prediction,	dtype=tf.float32))
tf.summary.scalar('accuracy',	accuracy)
Create	some	file	writers	to	save	log	data	for	TensorBoard	to	use	for	the
visualizations.
write_op	=	tf.summary.merge_all()	writer_train	=
tf.summary.FileWriter(logdir	+	'train',	tf.get_default_graph())
writer_val	=	tf.summary.FileWriter(logdir	+	'val',	tf.get_default_graph())
Graph	Execution
	
With	the	graph	construction	complete,	we	can	now	begin	the	execution	stage.
Here	we	create	a	TensorFlow	session,	in	which	we	repeatedly	run	training_op.
Even	though	we	created	variables	earlier,	they	have	to	be	initialized	before	we
can	actually	use	them.	Rather	than	individually	initializing	each	variable,	you
can	use	tf.global_variables_initializer().	Inside	the	for	loop,	a	randomly	sampled
batch	of	100	images	is	obtained	from	the	training	and	validation	sets.	On	every
fifth	iteration,	TensorFlow	writes	information	to	disk	via	the	write_op	operation
we	defined	earlier.	Notice	that	we	feed	in	the	placeholders	with	the	feed_dict
argument.	Once	training	is	complete,	the	model	is	evaluated	by	running	it	on	the
test	set.	The	result	is	then	printed	out	to	the	console.
with	tf.Session()	as	sess:	sess.run(tf.global_variables_initializer())	for	i	in
range(1001):	batch_X,	batch_y	=	mnist.train.next_batch(100)
val_batch_X,	val_batch_y	=	mnist.validation.next_batch(100)	if	i%	5	==
0:	summary_str	=	sess.run(write_op,	feed_dict={X:	batch_X,	y_:
batch_y,	keep_prob:	1.0})	writer_train.add_summary(summary_str,	i)
writer_train.flush()
summary_str	=	sess.run(write_op,	feed_dict={X:	val_batch_X,	y_:
val_batch_y,	keep_prob:	1.0})	writer_val.add_summary(summary_str,	i)
writer_val.flush()
training_op.run(feed_dict={X:	batch_X,	y_:	batch_y,	keep_prob:	0.5})
test_accuracy	=	accuracy.eval(feed_dict={X:	mnist.test.images,	y_:
mnist.test.labels,	keep_prob:1.0})	print('Test	accuracy
{}'.format(test_accuracy))
TensorBoard	Visualization
	
While	the	graph	is	executing,	you	can	observe	its	progress	through	the
TensorBoard	interface.	You	should	see	some	visualizations	that	look	something
like	the	following:
	
This	is	perhaps	the	most	important	graph.	It	shows	the	classification	accuracy	of
the	training	set	(green)	and	validation	set	(yellow).	In	general,	we	want	the
training	and	validation	accuracies	to	track	each	other	fairly	closely.	The	gap
between	the	training	and	validation	accuracy	shows	how	much	your	model	is
overfitting	-	if	the	training	accuracy	is	higher	than	the	validation	accuracy,	that
means	your	model	is	overfitting.	On	the	other	hand,	it	is	possible	that	the	model
is	underfitting	if	the	accuracies	are	too	close	-	this	would	mean	that	the	model	is
too	simple	to	capture	the	complexity	of	the	data.
For	simplicity,	the	accuracy	here	is	plotted	against	the	number	of	iterations,	but
normally	we	would	place	the	number	of	epochs	on	the	x-axis.	Check	this	out	for
more	info.
http://cs231n.github.io/neural-networks-3/#accuracy
	
Other	useful	visualizations	to	look	at	are	the	distributions	and	histograms	of	the
parameters	and	the	activations	for	each	layer	of	the	network.	The	distribution
and	histogram	plots	essentially	give	you	two	different	ways	of	visualizing	the
same	thing	-	the	distribution	of	parameters	evolving	over	time.	For	example,	in
the	top	right	graph	above	(the	dense1	layer	biases),	you	can	see	the	variance
increasing	over	time,	whereas	the	mean	is	decreasing,	indicated	by	the
distribution	shifting	slightly	to	the	left.
You	can	use	these	plots	to	diagnose	problems	such	as	an	incorrect	initialization
of	parameters	in	your	model.	Watch	out	for	distributions	getting	stuck	at	0	or	at
the	extreme	ends	of	the	range	of	the	activation	function	(in	the	case	of	bounded
activations).
	
Want	to	learn	more	about	TensorBoard?	The	TensorFlow	official	source	offers
tutorials	for	sharpening	your	skills	to	build	predictive	models.	It	is	found	here:
https://www.tensorflow.org/tutorials/
https://www.tensorflow.org/tutorials/
Logging	Off
	
Machine	learning	has	already	shown	some	great	results	thus	far.	There	is	also	an
increasingly	large	number	of	people	flocking	to	the	field	of	AI,	which	should
birth	better	design	for	agents	and	ML	algorithms.	The	rise	of	Deep	learning	has
brought	applications	that	almost	have	a	mind	of	their	own.	If	you’re	looking	how
to	use	these	algorithms	in	your	application,	but	think	it	will	be	too
complicated…	fear	not.	Large	companies	(Google,	Amazon	etc.)	provide	cloud
services	that	have	already	built	ML	algorithms	which	can	be	used	quite	easily.
There	are	ML	libraries	out	there	to	integrate	ML	algorithms	into	your	existing
application,	Tensorflow	is	just	one.	Machine	Learning	is	incredibly	interesting
and	there	is	still	so	much	more	to	come.	The	best	thing	about	ML	-	there	is
always	new	and	exciting	concepts	to	learn!
Thanks	for	reading!	Please	leave	a	review	if	you	liked	the	book	J
https://www.tensorflow.org/
	Introduction
	Chapter 1: Foundational Learning
	Chapter 2: Supervised Learning
	Chapter 3: Unsupervised Learning
	Chapter 4: Reinforcement Learning
	Chapter 5: Intermission
	Chapter 6: Machine Learning with Tensorflow