Porn Data Anaylize — 上传者 分类信息分析(github)

'''
视频作者 视频分类信息分析
http://www.h4ck.org.cn
by obaby
obaby@mars
email:root@obaby.org.cn
date: 2020.09.04
'''
from pyspark.sql.functions import col
import altair as alt
import pandas as pd
from matplotlib import pyplot as plt
%matplotlib inline
csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_movie.csv")
csv.printSchema()
root
 |-- id: string (nullable = true)
 |-- create: string (nullable = true)
 |-- update: string (nullable = true)
 |-- name: string (nullable = true)
 |-- describe: string (nullable = true)
 |-- source_id: string (nullable = true)
 |-- publish_time: string (nullable = true)
 |-- play_count: string (nullable = true)
 |-- good_count: string (nullable = true)
 |-- bad_count: string (nullable = true)
 |-- link_count: string (nullable = true)
 |-- comment_count: string (nullable = true)
 |-- designation: string (nullable = true)
 |-- category_id: string (nullable = true)
 |-- porn_site_id: string (nullable = true)
 |-- uploader_id: string (nullable = true)
 |-- producer: string (nullable = true)
csv.select('name', 'describe', 'uploader_id').show()
Continue Reading

Porn Data Anaylize — 标签 模特信息分析(github)

from pyspark.sql.functions import col
import altair as alt

import pandas as pd
from matplotlib import pyplot as plt
get_ipython().run_line_magic('matplotlib', 'inline')
csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_movie_tags.csv")
tag_csv = spark.read.option("header",True).csv("hdfs://localhost:9000/data2/porn_data_tag.csv")
csv.show()

+---+--------+------+
| id|movie_id|tag_id|
+---+--------+------+
|  1|    9909|     1|
|  2|    9909|     2|
|  3|    9909|     3|
|  4|    9909|     4|
|  5|    9910|     5|
|  6|    9910|     6|
|  7|    9910|     7|
|  8|    9910|     8|
|  9|    9910|     9|
| 10|    9910|    10|
| 11|    9911|    12|
| 12|    9911|     2|
| 13|    9911|     1|
| 14|    9911|    13|
| 15|    9910|    11|
| 16|    9911|    14|
| 17|    9911|    15|
| 18|    9911|     5|
| 19|    9910|    16|
| 20|    9910|    17|
+---+--------+------+
only showing top 20 rows

Continue Reading

jupyter notebook 调整字体 以及matplotlib显示中文

原生的jupyter theme看起来比较蛋疼,尤其是字体和字号。为了修改这个配置可以安装 jupyter theme。

项目链接: https://github.com/dunovank/jupyter-themes 如果不喜欢英文可以参考这个链接:https://www.jianshu.com/p/6de5f6cce06d

上面的样式对应的配置命令:
jt  -f fira -fs 11 -cellw 90% -ofs 11 -dfs 11 -T -t solarizedl

除此之外matplotlib 默认不支持中文显示,主要是字体问题,可以通过下面的代码配置来让matplotlib 支持中文

from matplotlib import pyplot as plt
%matplotlib inline
font = {'family' : 'MicroSoft YaHei',
'weight' : 'bold',
'size' : 10}
plt.rc("font", **font)

实际效果,另外还可以使用altair ,altair 默认支持中文显示 https://altair-viz.github.io