文献计量学分析实战（2）——pyBibx_python bibpy-程序员宅基地

Python中pyBibx库是文献计量学分析较好的库，有着丰富的方法。

1 安装

pip install pyBibx

2 数据

见本文绑定资源

3 实战分析

3.1 导入数据和总结汇报

import numpy as np
import pandas as pd
import textwrap
from pyBibX.base import pbx_probe
from prettytable import PrettyTable
file_name = 'pubmed.txt'
database = 'pubmed'
# load data
bibfile = pbx_probe(file_bib=file_name, db=database, del_duplicated=True)
# Generate EDA (Exploratory Data Analysis) Report
report = bibfile.eda_bib()
# Check Report
report

3.2 词云

bibfile.word_cloud_plot(entry='kwa', size_x=15, size_y=10, wordsn=500)

3.3 词语重要性排序

# Check Table for important word
table = PrettyTable()
data_wd = bibfile.ask_gpt_wd
table.field_names = ['Word', 'Importance']
for key, value in data_wd.items():
    table.add_row([key, round(value, 4)])
print(table)

3.4 N-Grams：是文本文档中n个连续项目的集合，可能包括单词、数字、符号和标点符号。

bibfile.get_top_ngrams(view='notebook', entry='kwp', ngrams=4, stop_words=[], rmv_custom_words=[], wordsn=15)

3.5 文章聚类

projection, labels = bibfile.docs_projection(view='notebook',corpus_type='abs',
                                             stop_words=['en'],rmv_custom_words=[],
                                             custom_label=[],custom_projection=[],
                                             n_components=2,n_clusters=5,
                                             tf_idf=False,embeddings=False,
                                             method='umap')
data_pr = pd.DataFrame(np.hstack([projection, labels.reshape(-1,1)]))
# Check Articles per Cluster
cluster = 1
idx_articles = [i for i in range(0, labels.shape[0]) if labels[i] == cluster]
print(*idx_articles, sep=', ')

3.6 每年关键词变化

bibfile.plot_evolution_year(view='notebook',
                            stop_words=['en'],
                            rmv_custom_words=[],
                            key='kwp',
                            topn=10,
                            start=2010,
                            end=2021)
# View Table
data_ep = bibfile.ask_gpt_ep

3.7 作者/国家/杂志桑基图

bibfile.sankey_diagram(view='notebook', entry=['aut', 'cout', 'inst', 'lan'], topn=10)
# View Table
data_sk = bibfile.ask_gpt_sk
pd.DataFrame(data_sk)

3.8 杂志热图

bibfile.tree_map(entry='jou', topn=20, size_x=30, size_y=30)

3.9 作者产出图

bibfile.authors_productivity(view='notebook', topn=20)

3.10 作者产出柱状图

bibfile.plot_bars(statistic='apd', topn=20, size_x=15, size_y=10)
# View Table
data_bp = bibfile.ask_gpt_bp

3.11 合作网络图

bibfile.network_adj(view = 'notebook', adj_type = 'aut', min_count = 5, node_labels = True, label_type = 'name', centrality = None)
# PS: If a centrality criterion is used then the values can be obtained by the following command:  bibfile.table_centr
# View Table
data_adj = bibfile.ask_gpt_adj
bibfile.find_nodes(node_ids = [], node_name = ['youngkong s'], node_only = False)

3.12 合作世界地图

bibfile.network_adj_map(view = 'browser', connections = True, country_lst = [])

3.13 NLP：自然语言处理

# NLP
# Arguments: corpus_type       = 'abs', 'title', 'kwa', or 'kwp';
#            stop_words        = A list of stopwords to clean the corpus. ['ar', 'bn', 'bg', 'cs', 'en', 'fi', 'fr', 'de', 'el', 'hi', 'he', 'hu', 'it', 'ja', 'ko',  'mr', 'fa', 'pl', 'pt-br', 'ro', 'ru', 'es', 'sv', 'sk', 'zh', 'th', 'uk'];
#                                'ar' = Arabic; 'bn' = Bengali; 'bg' = Bulgarian; 'cs' = Czech; 'en' = English; 'fi' = Finnish; 'fr' = French; 'de' = German; 'el' = Greek; 'he' = Hebrew;…n;
#                                'ja' = Japanese; 'ko' = Korean; 'mr' =  Marathi; 'fa' =  Persian; 'pl' =  Polish; 'pt-br' = Potuguese-Brazilian; 'ro' = Romanian; 'ru' = Russian; 'es' =  Spanish; 'sk' = Slovak; 'sv' = Swedish;
#                                'zh' = Chinese; 'th' = Thai; 'uk' = Ukrainian
#            rmv_custom_words  = A list of custom stopwords to clean the corpus;
bibfile.create_embeddings(stop_words = ['en'], rmv_custom_words = [], corpus_type = 'abs')
emb = bibfile.embds
# NLP #-1 refers to all outliers and should typically be ignored.
# Arguments: stop_words        = A list of stopwords to clean the corpus. ['ar', 'bn', 'bg', 'cs', 'en', 'fi', 'fr', 'de', 'el', 'hi', 'he', 'hu', 'it', 'ja', 'ko',  'mr', 'fa', 'pl', 'pt-br', 'ro', 'ru', 'es', 'sv', 'sk', 'zh', 'th', 'uk'];
#                               'ar' = Arabic; 'bn' = Bengali; 'bg' = Bulgarian; 'cs' = Czech; 'en' = English; 'fi' = Finnish; 'fr' = French; 'de' = German; 'el' = Greek; 'he' = Hebrew;'hi' = Hindi; 'hu' = Hungarian; 'it' = Italian;
#                               'ja' = Japanese; 'ko' = Korean; 'mr' =  Marathi; 'fa' =  Persian; 'pl' =  Polish; 'pt-br' = Potuguese-Brazilian; 'ro' = Romanian; 'ru' = Russian; 'es' =  Spanish; 'sk' = Slovak; 'sv' = Swedish;
#                               'zh' = Chinese; 'th' = Thai; 'uk' = Ukrainian
#            rmv_custom_words  = A list of custom stopwords to clean the corpus;
#            embeddings        = True or False. If True then word embeddings are used to create the topics
bibfile.topics_creation(stop_words = ['en'], rmv_custom_words = [], embeddings = True)
# NLP
# Each document Topic
topics = bibfile.topics
# NLP
# Each document Probability to belong a Topic
probs = bibfile.probs
# NLP
# Arguments: view = 'notebook', 'browser' ('notebook' -> To plot in your prefered Notebook App. 'browser' -> To plot in your prefered browser window)
bibfile.graph_topics_distribution(view = 'notebook')
# NLP
# Arguments: view = 'notebook', 'browser' ('notebook' -> To plot in your prefered Notebook App. 'browser' -> To plot in your prefered browser window)
bibfile.graph_topics(view = 'notebook')
# NLP
# Arguments: view = 'notebook', 'browser' ('notebook' -> To plot in your prefered Notebook App. 'browser' -> To plot in your prefered browser window)
bibfile.graph_topics_projection(view = 'notebook')
# NLP
# Arguments: view = 'notebook', 'browser' ('notebook' -> To plot in your prefered Notebook App. 'browser' -> To plot in your prefered browser window)
bibfile.graph_topics_heatmap(view = 'notebook')
# NLP
similar_topics, similarity = bibfile.topic_model.find_topics('electre', top_n = 10)
for i in range(0, len(similar_topics)):
  print('Topic: ', similar_topics[i], 'Correlation: ', round(similarity[i], 3))
# NLP
bibfile.topic_model.save('my_topic_model')

abs_summary = bibfile.summarize_abst_peg(article_ids = [305, 34, 176], model_name = './pegasus-xsum')
# NLP - Check Abstractive Summarization
print(textwrap.fill(abs_summary, 150))
abs_summary_chat = bibfile.summarize_abst_chatgpt(article_ids = [305, 34, 176], join_articles = True, api_key = 'your_api_key_here', query = 'from the following scientific abstracts, summarize the main information in a single paragraph using around 250 words', model = 'gpt-4')
# NLP - Check Abstractive Summarization
print(textwrap.fill(abs_summary_chat, 250))
# NLP - Extractive Summarization
# Arguments: article_ids = A list of documents to perform an extractive summarization with the available abstracts. If the list is empty then all documents will be used
ext_summary = bibfile.summarize_ext_bert(article_ids = [305, 34, 176])
# NLP - Check Extractive Summarization
print(textwrap.fill(ext_summary, 150))

上述可总结摘要

3.14 筛选文章

bibfile.filter_bib(documents = [], doc_type = [], year_str = -1, year_end = -1, sources = [], core = -1, country = [], language = [], abstract = False)

感兴趣的可以参考原始指南：

https://colab.research.google.com/drive/13CU-KvZMnazga1BmQf2J8wYM9mhHL2e1?usp=sharing#scrollTo=_11EAT72ED4N

本文链接：https://blog.csdn.net/weixin_49320263/article/details/136015906

原作者删帖不实内容删帖广告或垃圾文章投诉

智能推荐

【史上最易懂】马尔科夫链-蒙特卡洛方法：基于马尔科夫链的采样方法，从概率分布中随机抽取样本，从而得到分布的近似_马尔科夫链期望怎么求-程序员宅基地

文章浏览阅读1.3k次，点赞40次，收藏19次。虽然你不能直接计算每个房间的人数，但通过马尔科夫链的蒙特卡洛方法，你可以从任意状态（房间）开始采样，并最终收敛到目标分布（人数分布）。然后，根据一个规则（假设转移概率是基于房间的人数，人数较多的房间具有较高的转移概率），你随机选择一个相邻的房间作为下一个状态。比如在巨大城堡，里面有很多房间，找到每个房间里的人数分布情况（每个房间被访问的次数），但是你不能一次进入所有的房间并计数。但是，当你重复这个过程很多次时，你会发现你更有可能停留在人数更多的房间，而在人数较少的房间停留的次数较少。_马尔科夫链期望怎么求

linux以root登陆命令,su命令和sudo命令，以及限制root用户登录-程序员宅基地

文章浏览阅读3.9k次。一、su命令su命令用于切换当前用户身份到其他用户身份，变更时须输入所要变更的用户帐号与密码。命令su的格式为：su [-] username1、后面可以跟 ‘-‘ 也可以不跟，普通用户su不加username时就是切换到root用户，当然root用户同样可以su到普通用户。 ‘-‘ 这个字符的作用是，加上后会初始化当前用户的各种环境变量。下面看下加‘-’和不加‘-’的区别：root用户切换到普通..._限制su root登陆

精通VC与Matlab联合编程（六）_精通vc和matlab联合编程六-程序员宅基地

文章浏览阅读1.2k次。精通VC与Matlab联合编程（六）作者：邓科下载源代码浅析VC与MATLAB联合编程浅析VC与MATLAB联合编程浅析VC与MATLAB联合编程浅析VC与MATLAB联合编程浅析VC与MATLAB联合编程　　Matlab C/C++函数库是Matlab扩展功能重要的组成部分,包含了大量的用C/C++语言重新编写的Matlab函数,主要包括初等数学函数、线形代数函数、矩阵操作函数、数值计算函数_精通vc和matlab联合编程六

Asp.Net MVC2中扩展ModelMetadata的DescriptionAttribute。-程序员宅基地

文章浏览阅读128次。在MVC2中默认并没有实现DescriptionAttribute（虽然可以找到这个属性，通过阅读MVC源码，发现并没有实现方法），这很不方便，特别是我们使用EditorForModel的时候，我们需要对字段进行简要的介绍，下面来扩展这个属性。新建类 DescriptionMetadataProvider然后重写DataAnnotationsModelMetadataPro..._asp.net mvc 模型description

领域模型架构 eShopOnWeb项目分析上-程序员宅基地

文章浏览阅读1.3k次。一.概述　　本篇继续探讨web应用架构，讲基于DDD风格下最初的领域模型架构，不同于DDD风格下CQRS架构，二者架构主要区别是领域层的变化。架构的演变是从领域模型到C..._eshoponweb

Springboot中使用kafka_springboot kafka-程序员宅基地

文章浏览阅读2.6w次，点赞23次，收藏85次。首先说明，本人之前没用过zookeeper、kafka等，尚硅谷十几个小时的教程实在没有耐心看，现在我也不知道分区、副本之类的概念。用kafka只是听说他比RabbitMQ快，我也是昨天晚上刚使用，下文中若有讲错的地方或者我的理解与它的本质有偏差的地方请包涵。此文背景的环境是windows，linux流程也差不多。官网下载kafka，选择Binary downloads Apache Kafka 解压在D盘下或者什么地方，注意不要放在桌面等绝对路径太长的地方打开conf_springboot kafka

随便推点

VS2008+水晶报表发布后可能无法打印的解决办法_水晶报表不能打印-程序员宅基地

文章浏览阅读1k次。编好水晶报表代码,用的是ActiveX模式,在本机运行,第一次运行提示安装ActiveX控件,安装后,一切正常,能正常打印,但发布到网站那边运行,可能是一闪而过,连提示安装ActiveX控件也没有,甚至相关的功能图标都不能正常显示,再点"打印图标"也是没反应解决方法是: 1.先下载"PrintControl.cab" http://support.businessobjects.c_水晶报表不能打印

一. UC/OS-Ⅱ简介_ucos-程序员宅基地

文章浏览阅读1.3k次。绝大部分UC/OS-II的源码是用移植性很强的ANSI C写的。也就是说某产品可以只使用很少几个UC/OS-II调用，而另一个产品则使用了几乎所有UC/OS-II的功能，这样可以减少产品中的UC/OS-II所需的存储器空间（RAM和ROM）。UC/OS-II是为嵌入式应用而设计的，这就意味着，只要用户有固化手段（C编译、连接、下载和固化）， UC/OS-II可以嵌入到用户的产品中成为产品的一部分。1998年uC/OS-II，目前的版本uC/OS -II V2.61，2.72。1.UC/OS-Ⅱ简介。_ucos